# HG changeset patch # User goeckslab # Date 1763774247 0 # Node ID f13455263ac82e98de3266efdf5ca37afde738c1 # Parent cda3d431d2376c0e3fd727fe81f12a9b3baa29e5 planemo upload for repository https://github.com/goeckslab/Galaxy-Ludwig.git commit e2ab4c0f9ce8b7a0a48f749ef5dd9899d6c2b1f8 diff -r cda3d431d237 -r f13455263ac8 ludwig_experiment.py --- a/ludwig_experiment.py Sat Sep 06 01:52:31 2025 +0000 +++ b/ludwig_experiment.py Sat Nov 22 01:17:27 2025 +0000 @@ -1,10 +1,15 @@ +import base64 +import html import json import logging import os import pickle +import re import sys +from io import BytesIO import pandas as pd +from ludwig.api import LudwigModel from ludwig.experiment import cli from ludwig.globals import ( DESCRIPTION_FILE_NAME, @@ -21,6 +26,11 @@ get_html_template ) +try: # pragma: no cover - optional dependency in runtime containers + import matplotlib.pyplot as plt +except ImportError: # pragma: no cover + plt = None + logging.basicConfig(level=logging.DEBUG) @@ -158,44 +168,435 @@ LOG.error(f"Error converting Parquet to CSV: {e}") -def generate_html_report(title, ludwig_output_directory_name): - # ludwig_output_directory = os.path.join( - # output_directory, ludwig_output_directory_name) +def _resolve_dataset_path(dataset_path): + if not dataset_path: + return None + + candidates = [dataset_path] + + if not os.path.isabs(dataset_path): + candidates.extend([ + os.path.join(output_directory, dataset_path), + os.path.join(os.getcwd(), dataset_path), + ]) + + for candidate in candidates: + if candidate and os.path.exists(candidate): + return os.path.abspath(candidate) + + return None + + +def _load_dataset_dataframe(dataset_path): + if not dataset_path: + return None + + _, ext = os.path.splitext(dataset_path.lower()) + + try: + if ext in {".csv", ".tsv"}: + sep = "\t" if ext == ".tsv" else "," + return pd.read_csv(dataset_path, sep=sep) + if ext == ".parquet": + return pd.read_parquet(dataset_path) + if ext == ".json": + return pd.read_json(dataset_path) + if ext == ".h5": + return pd.read_hdf(dataset_path) + except Exception as exc: + LOG.warning(f"Unable to load dataset '{dataset_path}': {exc}") + + LOG.warning("Unsupported dataset format for feature importance computation") + return None + + +def sanitize_feature_name(name): + """Mirror Ludwig's get_sanitized_feature_name implementation.""" + return re.sub(r"[(){}.:\"\"\'\'\[\]]", "_", str(name)) + + +def _sanitize_dataframe_columns(dataframe): + """Rename dataframe columns to Ludwig-sanitized names for explainability.""" + column_map = {col: sanitize_feature_name(col) for col in dataframe.columns} + + sanitized_df = dataframe.rename(columns=column_map) + if len(set(column_map.values())) != len(column_map.values()): + LOG.warning( + "Column name collision after sanitization; feature importance may be unreliable" + ) + + return sanitized_df + + +def _feature_importance_plot(label_df, label_name, top_n=10, max_abs_importance=None): + """ + Return base64-encoded bar plot for a label's top-N feature importances. + + max_abs_importance lets us pin the x-axis across labels so readers can + compare magnitudes. + """ + if plt is None or label_df.empty: + return "" + + top_features = label_df.nlargest(top_n, "abs_importance") + if top_features.empty: + return "" + + fig, ax = plt.subplots(figsize=(6, 3 + 0.2 * len(top_features))) + ax.barh(top_features["feature"], top_features["abs_importance"], color="#3f8fd2") + ax.set_xlabel("|importance|") + if max_abs_importance and max_abs_importance > 0: + ax.set_xlim(0, max_abs_importance * 1.05) + ax.invert_yaxis() + fig.tight_layout() + + buf = BytesIO() + fig.savefig(buf, format="png", dpi=150) + plt.close(fig) + encoded = base64.b64encode(buf.getvalue()).decode("utf-8") + return encoded + + +def render_feature_importance_table(df: pd.DataFrame) -> str: + """Render a sortable HTML table for feature importance values.""" + if df.empty: + return "" + + columns = list(df.columns) + headers = "".join( + f"{html.escape(str(col).replace('_', ' '))}" + for col in columns + ) + + body_rows = [] + for _, row in df.iterrows(): + cells = [] + for col in columns: + val = row[col] + if isinstance(val, float): + val_str = f"{val:.6f}" + else: + val_str = str(val) + cells.append(f"{html.escape(val_str)}") + body_rows.append("" + "".join(cells) + "") + + return ( + "
" + "" + f"{headers}" + f"{''.join(body_rows)}" + "
" + "
" + ) + + +def compute_feature_importance(ludwig_output_directory_name, + sample_size=200, + random_seed=42): + ludwig_output_directory = os.path.join( + output_directory, ludwig_output_directory_name) + model_dir = os.path.join(ludwig_output_directory, "model") + + output_csv_path = os.path.join( + ludwig_output_directory, "feature_importance.csv") + + if not os.path.exists(model_dir): + LOG.info("Model directory not found; skipping feature importance computation") + return - # test_statistics_html = "" - # # Read test statistics JSON and convert to HTML table - # try: - # test_statistics_path = os.path.join( - # ludwig_output_directory, TEST_STATISTICS_FILE_NAME) - # with open(test_statistics_path, "r") as f: - # test_statistics = json.load(f) - # test_statistics_html = "

Test Statistics

" - # test_statistics_html += json_to_html_table( - # test_statistics) - # except Exception as e: - # LOG.info(f"Error reading test statistics: {e}") + try: + ludwig_model = LudwigModel.load(model_dir) + except Exception as exc: + LOG.warning(f"Unable to load Ludwig model for explanations: {exc}") + return + + training_metadata = getattr(ludwig_model, "training_set_metadata", {}) + + output_feature_name, dataset_path = get_output_feature_name( + ludwig_output_directory) + + if not output_feature_name or not dataset_path: + LOG.warning("Output feature or dataset path missing; skipping feature importance") + if hasattr(ludwig_model, "close"): + ludwig_model.close() + return + + dataset_full_path = _resolve_dataset_path(dataset_path) + if not dataset_full_path: + LOG.warning(f"Unable to resolve dataset path '{dataset_path}' for explanations") + if hasattr(ludwig_model, "close"): + ludwig_model.close() + return + + dataframe = _load_dataset_dataframe(dataset_full_path) + if dataframe is None or dataframe.empty: + LOG.warning("Dataset unavailable or empty; skipping feature importance") + if hasattr(ludwig_model, "close"): + ludwig_model.close() + return + + dataframe = _sanitize_dataframe_columns(dataframe) + + data_subset = dataframe if len(dataframe) <= sample_size else dataframe.head(sample_size) + sample_df = dataframe.sample( + n=min(sample_size, len(dataframe)), + random_state=random_seed, + replace=False, + ) if len(dataframe) > sample_size else dataframe + + try: + from ludwig.explain.captum import IntegratedGradientsExplainer + except ImportError as exc: + LOG.warning(f"Integrated Gradients explainer unavailable: {exc}") + if hasattr(ludwig_model, "close"): + ludwig_model.close() + return + + sanitized_output_feature = sanitize_feature_name(output_feature_name) + + try: + explainer = IntegratedGradientsExplainer( + ludwig_model, + data_subset, + sample_df, + sanitized_output_feature, + ) + explanations = explainer.explain() + except Exception as exc: + LOG.warning(f"Unable to compute feature importance: {exc}") + if hasattr(ludwig_model, "close"): + ludwig_model.close() + return + + if hasattr(ludwig_model, "close"): + try: + ludwig_model.close() + except Exception: + pass - # Convert visualizations to HTML + label_names = [] + target_metadata = {} + if isinstance(training_metadata, dict): + target_metadata = training_metadata.get(sanitized_output_feature, {}) + + if isinstance(target_metadata, dict): + if "idx2str" in target_metadata: + idx2str = target_metadata["idx2str"] + if isinstance(idx2str, dict): + def _idx_key(item): + idx_key = item[0] + try: + return (0, int(idx_key)) + except (TypeError, ValueError): + return (1, str(idx_key)) + + label_names = [value for key, value in sorted( + idx2str.items(), key=_idx_key)] + else: + label_names = idx2str + elif "str2idx" in target_metadata and isinstance( + target_metadata["str2idx"], dict): + # invert mapping + label_names = [label for label, _ in sorted( + target_metadata["str2idx"].items(), + key=lambda item: item[1])] + + rows = [] + global_explanation = explanations.global_explanation + for label_index, label_explanation in enumerate( + global_explanation.label_explanations): + if label_names and label_index < len(label_names): + label_value = str(label_names[label_index]) + elif len(global_explanation.label_explanations) == 1: + label_value = output_feature_name + else: + label_value = str(label_index) + + for feature in label_explanation.feature_attributions: + rows.append({ + "label": label_value, + "feature": feature.feature_name, + "importance": feature.attribution, + "abs_importance": abs(feature.attribution), + }) + + if not rows: + LOG.warning("No feature importance rows produced") + return + + importance_df = pd.DataFrame(rows) + importance_df.sort_values([ + "label", + "abs_importance" + ], ascending=[True, False], inplace=True) + + importance_df.to_csv(output_csv_path, index=False) + + LOG.info(f"Feature importance saved to {output_csv_path}") + + +def generate_html_report(title, ludwig_output_directory_name): plots_html = "" - if len(os.listdir(viz_output_directory)) > 0: + plot_files = [] + if os.path.isdir(viz_output_directory): + plot_files = sorted(os.listdir(viz_output_directory)) + if plot_files: plots_html = "

Visualizations

" - for plot_file in sorted(os.listdir(viz_output_directory)): + for plot_file in plot_files: plot_path = os.path.join(viz_output_directory, plot_file) if os.path.isfile(plot_path) and plot_file.endswith((".png", ".jpg")): encoded_image = encode_image_to_base64(plot_path) + plot_title = os.path.splitext(plot_file)[0].replace("_", " ") plots_html += ( f'
' - f'

{os.path.splitext(plot_file)[0]}

' + f'

{plot_title}

' '{plot_file}' f'
' ) + feature_importance_html = "" + importance_path = os.path.join( + output_directory, + ludwig_output_directory_name, + "feature_importance.csv", + ) + if os.path.exists(importance_path): + try: + importance_df = pd.read_csv(importance_path) + if not importance_df.empty: + sorted_df = ( + importance_df + .sort_values(["label", "abs_importance"], ascending=[True, False]) + ) + top_rows = ( + sorted_df + .groupby("label", as_index=False) + .head(5) + ) + max_abs_importance = pd.to_numeric( + importance_df.get("abs_importance", pd.Series(dtype=float)), + errors="coerce", + ).max() + if pd.isna(max_abs_importance): + max_abs_importance = None + + plot_sections = [] + for label in sorted(importance_df["label"].unique()): + encoded_plot = _feature_importance_plot( + importance_df[importance_df["label"] == label], + label, + max_abs_importance=max_abs_importance, + ) + if encoded_plot: + plot_sections.append( + f'
' + f'

Top features for {label}

' + f'' + f'
' + ) + explanation_text = ( + "

Feature importance scores come from Ludwig's Integrated Gradients explainer. " + "It interpolates between each example and a neutral baseline sample, summing " + "the change in the model output along that path. Higher |importance| values " + "indicate stronger influence. Plots share a common x-axis to make magnitudes " + "comparable across labels, and the table columns can be sorted for quick scans.

" + ) + feature_importance_html = ( + "

Feature Importance

" + + explanation_text + + render_feature_importance_table(top_rows) + + "".join(plot_sections) + ) + except Exception as exc: + LOG.info(f"Unable to embed feature importance table: {exc}") + # Generate the full HTML content + feature_section = feature_importance_html or "

No feature importance artifacts were generated.

" + viz_section = plots_html or "

No visualizations were generated.

" + tabs_style = """ + + """ + tabs_script = """ + + """ + tabs_html = f""" +
+ + +
+
+ {viz_section} +
+
+ {feature_section} +
+ """ html_content = f""" {get_html_template()}

{title}

- {plots_html} + {tabs_style} + {tabs_html} + {tabs_script} {get_html_closing()} """ @@ -217,4 +618,5 @@ make_visualizations(ludwig_output_directory_name) convert_parquet_to_csv(ludwig_output_directory_name) + compute_feature_importance(ludwig_output_directory_name) generate_html_report("Ludwig Experiment", ludwig_output_directory_name) diff -r cda3d431d237 -r f13455263ac8 ludwig_macros.xml --- a/ludwig_macros.xml Sat Sep 06 01:52:31 2025 +0000 +++ b/ludwig_macros.xml Sat Nov 22 01:17:27 2025 +0000 @@ -1,7 +1,7 @@ 0.10.1 - 2 + 3 @LUDWIG_VERSION@+@SUFFIX@ diff -r cda3d431d237 -r f13455263ac8 test-data/breast_config.yml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/breast_config.yml Sat Nov 22 01:17:27 2025 +0000 @@ -0,0 +1,12 @@ +input_features: + - name: RPL23AP24 + type: number + - name: RP11-206L10.9 + type: number + - name: RP11-465B22.8 + type: number +output_features: + - name: AGE_CAT + type: category +trainer: + epochs: 2 diff -r cda3d431d237 -r f13455263ac8 test-data/breast_sample.csv --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/breast_sample.csv Sat Nov 22 01:17:27 2025 +0000 @@ -0,0 +1,9 @@ +RPL23AP24,RP11-206L10.9,RP11-465B22.8,B3GALT6,UBE2J2,RP4-758J18.2,RP4-758J18.13,SSU72,RP11-345P4.9,RP1-140A9.1,TMEM52,RP4-740C4.5,PEX10,TNFRSF14,LRRC47,DFFB,LINC01134,RP1-37J18.1,GPR153,KLHL21,AGE_CAT +0.1211,0.4986,3.354,19.86,24.1,14.84,3.367,38.35,6.023,0.7205,1.089,1.356,15.64,108.6,24.97,5.79,0.1377,0.0,15.68,24.43,post +0.0,1.291,25.7,24.05,27.97,19.14,6.678,52.93,17.52,0.6363,0.412,3.319,14.88,99.48,35.01,5.38,0.8004,0.0,7.475,29.76,pre +0.0,0.5758,6.357,19.08,26.45,17.41,5.742,54.13,12.31,0.7358,1.308,2.053,12.47,100.9,29.83,6.159,0.2316,0.0,12.34,36.57,pre +0.1467,0.6449,0.4005,11.81,21.93,8.1,1.557,43.66,5.914,0.5921,1.111,1.372,10.96,96.68,23.17,4.133,0.09032,0.0,11.1,20.92,post +0.1191,0.3823,0.1858,12.3,23.48,13.18,2.318,56.68,7.94,0.7591,9.574,0.6858,16.57,74.72,30.11,3.118,0.2482,0.0,38.67,26.86,post +0.0,0.4321,2.784,18.11,30.75,11.82,2.282,56.51,9.529,0.7135,1.213,3.559,17.05,122.8,38.61,8.208,0.3281,0.0,19.0,36.32,pre +0.09253,0.4649,2.418,14.98,17.76,11.07,1.637,31.54,7.266,1.219,1.212,2.812,12.59,135.5,25.63,5.315,0.2016,0.0,8.482,15.74,pre +0.09606,0.4759,0.07494,11.48,14.75,8.049,1.554,32.96,7.075,0.6939,1.234,1.029,11.62,74.73,21.45,3.678,0.0182,0.0,14.31,10.59,post diff -r cda3d431d237 -r f13455263ac8 test-data/ludwig_experiment_report_test.html --- a/test-data/ludwig_experiment_report_test.html Sat Sep 06 01:52:31 2025 +0000 +++ b/test-data/ludwig_experiment_report_test.html Sat Nov 22 01:17:27 2025 +0000 @@ -55,6 +55,39 @@ background-color: #4CAF50; color: white; } + /* feature importance layout tweaks */ + table.feature-importance-table { + table-layout: auto; + } + table.feature-importance-table th, + table.feature-importance-table td { + white-space: nowrap; + word-break: normal; + } + /* sortable tables */ + .sortable-table th.sortable { + cursor: pointer; + position: relative; + user-select: none; + } + .sortable-table th.sortable::after { + content: '⇅'; + position: absolute; + right: 12px; + top: 50%; + transform: translateY(-50%); + font-size: 0.8em; + color: #eaf5ea; + text-shadow: 0 0 1px rgba(0,0,0,0.15); + } + .sortable-table th.sortable.sorted-none::after { content: '⇅'; color: #eaf5ea; } + .sortable-table th.sortable.sorted-asc::after { content: '↑'; color: #ffffff; } + .sortable-table th.sortable.sorted-desc::after { content: '↓'; color: #ffffff; } + .scroll-rows-30 { + max-height: 900px; + overflow-y: auto; + overflow-x: auto; + } .plot { text-align: center; margin: 20px 0; @@ -64,12 +97,150 @@ height: auto; } +

Ludwig Experiment

-

Visualizations

learning_curves_temperature_loss

learning_curves_temperature_loss.png
+ + + + +
+ + +
+
+

Visualizations

learning curves temperature loss

learning_curves_temperature_loss.png
+
+
+

Feature Importance

Feature importance scores come from Ludwig's Integrated Gradients explainer. It interpolates between each example and a neutral baseline sample, summing the change in the model output along that path. Higher |importance| values indicate stronger influence. Plots share a common x-axis to make magnitudes comparable across labels, and the table columns can be sorted for quick scans.

labelfeatureimportanceabs importance
temperaturetemperature_feature0.4908210.490821

Top features for temperature

Feature importance plot for temperature
+
+ + + +
diff -r cda3d431d237 -r f13455263ac8 test-data/ludwig_experiment_report_test_breast.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/ludwig_experiment_report_test_breast.html Sat Nov 22 01:17:27 2025 +0000 @@ -0,0 +1,249 @@ + + + + + Galaxy-Ludwig Report + + + + +
+ +

Ludwig Experiment

+ + + + +
+ + +
+
+

Visualizations

confusion matrix AGE CAT top2

confusion_matrix__AGE_CAT_top2.png

confusion matrix entropy AGE CAT top2

confusion_matrix_entropy__AGE_CAT_top2.png

frequency vs f1 AGE CAT

frequency_vs_f1__AGE_CAT.png

learning curves AGE CAT accuracy

learning_curves_AGE_CAT_accuracy.png

learning curves AGE CAT loss

learning_curves_AGE_CAT_loss.png

roc curves

roc_curves.png
+
+
+

Feature Importance

Feature importance scores come from Ludwig's Integrated Gradients explainer. It interpolates between each example and a neutral baseline sample, summing the change in the model output along that path. Higher |importance| values indicate stronger influence. Plots share a common x-axis to make magnitudes comparable across labels, and the table columns can be sorted for quick scans.

labelfeatureimportanceabs importance
postRP11-465B22_8-0.0075520.007552
postRP11-206L10_9-0.0045790.004579
postRPL23AP240.0033860.003386
preRP11-465B22_80.0075520.007552
preRP11-206L10_90.0045790.004579
preRPL23AP24-0.0033860.003386

Top features for post

Feature importance plot for post

Top features for pre

Feature importance plot for pre
+
+ + + + + +
+ + + + \ No newline at end of file diff -r cda3d431d237 -r f13455263ac8 utils.py --- a/utils.py Sat Sep 06 01:52:31 2025 +0000 +++ b/utils.py Sat Nov 22 01:17:27 2025 +0000 @@ -59,6 +59,39 @@ background-color: #4CAF50; color: white; } + /* feature importance layout tweaks */ + table.feature-importance-table { + table-layout: auto; + } + table.feature-importance-table th, + table.feature-importance-table td { + white-space: nowrap; + word-break: normal; + } + /* sortable tables */ + .sortable-table th.sortable { + cursor: pointer; + position: relative; + user-select: none; + } + .sortable-table th.sortable::after { + content: '⇅'; + position: absolute; + right: 12px; + top: 50%; + transform: translateY(-50%); + font-size: 0.8em; + color: #eaf5ea; + text-shadow: 0 0 1px rgba(0,0,0,0.15); + } + .sortable-table th.sortable.sorted-none::after { content: '⇅'; color: #eaf5ea; } + .sortable-table th.sortable.sorted-asc::after { content: '↑'; color: #ffffff; } + .sortable-table th.sortable.sorted-desc::after { content: '↓'; color: #ffffff; } + .scroll-rows-30 { + max-height: 900px; + overflow-y: auto; + overflow-x: auto; + } .plot { text-align: center; margin: 20px 0; @@ -68,6 +101,69 @@ height: auto; } +