Mercurial > repos > bimib > cobraxy
changeset 542:fcdbc81feb45 draft
Uploaded
line wrap: on
line diff
--- a/COBRAxy/.gitignore Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/.gitignore Sun Oct 26 19:27:41 2025 +0000 @@ -4,4 +4,5 @@ .vscode/ outputs/ build/ -dist/ \ No newline at end of file +dist/ +*/.pytest_cache/ \ No newline at end of file
--- a/COBRAxy/README.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/README.md Sun Oct 26 19:27:41 2025 +0000 @@ -4,12 +4,12 @@ # COBRAxy -A Python toolkit for metabolic flux analysis and visualization, with Galaxy integration. +A Python-based command-line suite for metabolic flux analysis and visualization, with [Galaxy](http://marea4galaxy.cloud.ba.infn.it/galaxy) integration. COBRAxy transforms gene expression and metabolite data into meaningful metabolic insights through flux sampling and interactive pathway maps. DOC: https://compbtbs.github.io/COBRAxy ## Features - +- **Import/Export** of metabolic models in multiple formats (SBML, JSON, MAT, YAML) - **Reaction Activity Scores (RAS)** from gene expression data - **Reaction Propensity Scores (RPS)** from metabolite abundance - **Flux sampling** with CBS or OptGP algorithms @@ -18,13 +18,41 @@ - **Galaxy tools** for web-based analysis - **Built-in models** including ENGRO2 and Recon -## Quick Start +## Requirements + +- **Python**: 3.8-3.13 +- **OS**: Linux, macOS, Windows (Linux/macOS recommended) +- **Dependencies**: Automatically installed via pip (COBRApy, pandas, numpy, etc.) +- **Build tools**: C/C++ compiler (gcc, clang, or MSVC), CMake for compiling Python extensions, pkg-config + +**System dependencies** (install before pip): +```bash +# Ubuntu/Debian +sudo apt-get install build-essential cmake pkg-config libvips libglpk40 glpk-utils + +# macOS +xcode-select --install +brew install cmake pkg-config vips glpk + +# Windows (with Chocolatey) +choco install cmake visualstudio2022buildtools pkgconfiglite +``` ### Installation +**Recommended: Using Conda** + ```bash +# Create a new conda environment +conda create -n cobraxy python=3.13 -y +conda activate cobraxy + +# Install system dependencies via conda (optional, if not using system packages) +conda install -c conda-forge gcc cmake pkg-config swiglpk -y + +# Clone and install COBRAxy git clone https://github.com/CompBtBs/COBRAxy.git -cd COBRAxy +cd COBRAxy/src pip install . ``` @@ -54,29 +82,16 @@ | Tool | Purpose | Input | Output | |------|---------|--------|---------| -| `metabolic_model_setting` | Extract model components | SBML model | Rules, reactions, bounds, medium | -| `ras_generator` | Compute reaction activity scores | Gene expression data | RAS values | -| `rps_generator` | Compute reaction propensity scores | Metabolite abundance | RPS values | -| `marea` | Statistical pathway analysis | RAS + RPS data | Enrichment + base maps | -| `ras_to_bounds` | Apply RAS constraints to model | RAS + SBML model | Constrained bounds | -| `flux_simulation` | Sample metabolic fluxes | Constrained model | Flux distributions | -| `flux_to_map` | Add fluxes to enriched maps | Flux samples + base maps | Final styled maps | -| `marea_cluster` | Cluster analysis | Expression/flux data | Sample clusters | - -## Requirements +| [`importMetabolicModel`](https://compbtbs.github.io/COBRAxy/#/tools/import-metabolic-model) | Import and extract model components | SBML/JSON/MAT/YAML model | Tabular model data | +| [`exportMetabolicModel`](https://compbtbs.github.io/COBRAxy/#/tools/export-metabolic-model) | Export tabular data to model format | Tabular model data | SBML/JSON/MAT/YAML model | +| [`ras_generator`](https://compbtbs.github.io/COBRAxy/#/tools/ras-generator) | Compute reaction activity scores | Gene expression data | RAS values | +| [`rps_generator`](https://compbtbs.github.io/COBRAxy/#/tools/rps-generator) | Compute reaction propensity scores | Metabolite abundance | RPS values | +| [`marea`](https://compbtbs.github.io/COBRAxy/#/tools/marea) | Statistical pathway analysis | RAS + RPS data | Enrichment + base maps | +| [`ras_to_bounds`](https://compbtbs.github.io/COBRAxy/#/tools/ras-to-bounds) | Apply RAS constraints to model | RAS + SBML model | Constrained bounds | +| [`flux_simulation`](https://compbtbs.github.io/COBRAxy/#/tools/flux-simulation) | Sample metabolic fluxes | Constrained model | Flux distributions | +| [`flux_to_map`](https://compbtbs.github.io/COBRAxy/#/tools/flux-to-map) | Add fluxes to enriched maps | Flux samples + base maps | Final styled maps | +| [`marea_cluster`](https://compbtbs.github.io/COBRAxy/#/tools/marea-cluster) | Cluster analysis | Expression/flux data | Sample clusters | -- **Python**: 3.8-3.12 -- **OS**: Linux, macOS, Windows (Linux recommended) -- **Dependencies**: Automatically installed via pip (COBRApy, pandas, numpy, etc.) - -**Optional system libraries** (for enhanced features): -```bash -# Ubuntu/Debian -sudo apt-get install libvips libglpk40 glpk-utils - -# For Python GLPK bindings -pip install swiglpk -``` ## Data Flow @@ -112,10 +127,10 @@ - **Models**: ENGRO2, Recon (human metabolism) - **Gene mappings**: HGNC, Ensembl, Entrez ID conversions -- **Pathway maps**: Pre-styled SVG templates +- **Pathway map**: Pre-styled SVG templates for ENGRO2 - **Medium compositions**: Standard growth conditions -Located in `local/` directory for immediate use. +Located in `src/local/` directory for immediate use. ## Command Line Usage @@ -147,141 +162,8 @@ - Upload data through Galaxy interface - Chain tools in visual workflows - Share and reproduce analyses -- Access via Galaxy ToolShed -## Tutorials - -### Local Galaxy Installation - -To set up a local Galaxy instance with COBRAxy tools: - -1. **Install Galaxy**: - ```bash - # Clone Galaxy repository - git clone -b release_23.1 https://github.com/galaxyproject/galaxy.git - cd galaxy - - # Install dependencies and start Galaxy - sh run.sh - ``` - -2. **Install COBRAxy tools**: - ```bash - # Add COBRAxy tools to Galaxy - mkdir -p tools/cobraxy - cp path/to/COBRAxy/Galaxy_tools/*.xml tools/cobraxy/ - - # Update tool_conf.xml to include COBRAxy tools - # Add section in config/tool_conf.xml: - # <section id="cobraxy" name="COBRAxy"> - # <tool file="cobraxy/ras_generator.xml" /> - # <tool file="cobraxy/rps_generator.xml" /> - # <tool file="cobraxy/marea.xml" /> - # <!-- Add other tools --> - # </section> - ``` - -3. **Galaxy Tutorial Resources**: - - [Galaxy Installation Guide](https://docs.galaxyproject.org/en/master/admin/) - - [Tool Development Tutorial](https://training.galaxyproject.org/training-material/topics/dev/) - - [Galaxy Admin Training](https://training.galaxyproject.org/training-material/topics/admin/) - -### Python Direct Usage - -For programmatic use of COBRAxy tools in Python scripts: - -1. **Installation for Development**: - ```bash - # Clone and install in development mode - git clone https://github.com/CompBtBs/COBRAxy.git - cd COBRAxy - pip install -e . - ``` - -2. **Python API Usage**: - ```python - import sys - import os - - # Add COBRAxy to Python path - sys.path.append('/path/to/COBRAxy') - - # Import tool modules - import ras_generator - import rps_generator - import flux_simulation - import marea - import ras_to_bounds - - # Set working directory - tool_dir = "/path/to/COBRAxy" - os.chdir(tool_dir) - - # Generate RAS scores - ras_args = [ - '-td', tool_dir, - '-in', 'data/expression.tsv', - '-ra', 'output/ras_values.tsv', - '-rs', 'ENGRO2' - ] - ras_generator.main(ras_args) - - # Generate RPS scores (optional) - rps_args = [ - '-td', tool_dir, - '-id', 'data/metabolites.tsv', - '-rp', 'output/rps_values.tsv' - ] - rps_generator.main(rps_args) - - # Create enriched pathway maps - marea_args = [ - '-td', tool_dir, - '-using_RAS', 'true', - '-input_data', 'output/ras_values.tsv', - '-choice_map', 'ENGRO2', - '-gs', 'true', - '-idop', 'maps' - ] - marea.main(marea_args) - - # Apply RAS constraints to model - bounds_args = [ - '-td', tool_dir, - '-ms', 'ENGRO2', - '-ir', 'output/ras_values.tsv', - '-rs', 'true', - '-idop', 'bounds' - ] - ras_to_bounds.main(bounds_args) - - # Sample metabolic fluxes - flux_args = [ - '-td', tool_dir, - '-ms', 'ENGRO2', - '-in', 'bounds/bounds_output.tsv', - '-a', 'CBS', - '-ns', '1000', - '-idop', 'flux_results' - ] - flux_simulation.main(flux_args) - ``` - -3. **Python Tutorial Resources**: - - [COBRApy Documentation](https://cobrapy.readthedocs.io/) - - [Metabolic Modeling with Python](https://opencobra.github.io/cobrapy/building_model.html) - - [Flux Sampling Tutorial](https://cobrapy.readthedocs.io/en/stable/sampling.html) - - [Jupyter Notebooks Examples](examples/) (included in repository) - -## Input/Output Formats - -| Data Type | Format | Description | -|-----------|---------|-------------| -| Gene expression | TSV | Genes (rows) × Samples (columns) | -| Metabolites | TSV | Metabolites (rows) × Samples (columns) | -| Models | SBML | Standard metabolic model format | -| Results | TSV/CSV | Tabular flux/score data | -| Maps | SVG/PDF | Styled pathway visualizations | +For Galaxy installation and setup, refer to the [official Galaxy documentation](https://docs.galaxyproject.org/). ## Troubleshooting
--- a/COBRAxy/dist/cobraxy/meta.yaml Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/dist/cobraxy/meta.yaml Sun Oct 26 19:27:41 2025 +0000 @@ -11,69 +11,82 @@ build: entry_points: - - custom_data_generator=custom_data_generator:main + - importMetabolicModel=importMetabolicModel:main + - exportMetabolicModel=exportMetabolicModel:main + - ras_generator=ras_generator:main + - rps_generator=rps_generator:main + - marea_cluster=marea_cluster:main + - marea=marea:main + - ras_to_bounds=ras_to_bounds:main - flux_simulation=flux_simulation:main - flux_to_map=flux_to_map:main - - marea_cluster=marea_cluster:main - - marea=marea:main - - ras_generator=ras_generator:main - - ras_to_bounds=ras_to_bounds:main - - rps_generator=rps_generator:main noarch: python script: {{ PYTHON }} -m pip install . -vv --no-build-isolation number: 0 requirements: host: - - python >=3.8.20,<3.12 + - python >=3.8,<3.14 - pip - setuptools + - cmake + - pkg-config run: - - python >=3.8.20,<3.12 - - cairosvg ==2.7.1 - - cobra ==0.29.0 - - joblib ==1.4.2 - - lxml ==5.2.2 - - matplotlib-base ==3.7.3 - - numpy ==1.24.4 - - pandas ==2.0.3 - - pyvips ==2.2.3 # [linux or osx] - - scikit-learn ==1.3.2 - - scipy ==1.10.1 - - seaborn ==0.13.0 - - svglib ==1.5.1 - pip: - - pyvips==2.2.3 # [win] + - python >=3.8,<3.14 + - cairosvg >=2.7.0 + - cobra >=0.29.0 + - joblib >=1.3.0 + - lxml >=5.0.0 + - matplotlib-base >=3.7.0 + - numpy >=1.24.0 + - pandas >=2.0.0 + - pyvips >=2.2.0 + - scikit-learn >=1.3.0 + - scipy >=1.11.0 + - seaborn >=0.13.0 + - svglib >=1.5.0 + - anndata >=0.8.0 + - pydeseq2 >=0.4.0 + - cmake + - pkg-config test: imports: - utils - - custom_data_generator - - flux_simulation - - marea_cluster + - importMetabolicModel + - exportMetabolicModel - ras_generator - - ras_to_bounds - rps_generator + - marea_cluster + - marea + - ras_to_bounds + - flux_simulation + - flux_to_map commands: - - pip install pyvips==2.2.3 - - python -c "import pyvips; print('pyvips version:', pyvips.__version__)" - pip check - - custom_data_generator --help + - importMetabolicModel --help + - exportMetabolicModel --help + - ras_generator --help + - rps_generator --help + - marea_cluster --help + - marea --help + - ras_to_bounds --help - flux_simulation --help - flux_to_map --help - - marea_cluster --help - - marea --help - - ras_generator --help - - ras_to_bounds --help - - rps_generator --help requires: - pip about: - home: https://github.com/CompBtBs/COBRAxy.git - summary: A collection of tools for metabolic flux analysis in Galaxy. - #license: '' - #license_file: PLEASE_ADD_LICENSE_FILE + home: https://github.com/CompBtBs/COBRAxy + summary: A Python-based command-line suite for metabolic flux analysis and visualization + description: | + COBRAxy transforms gene expression and metabolite data into meaningful metabolic insights + through flux sampling and interactive pathway maps. It provides tools for computing reaction + activity scores (RAS), reaction propensity scores (RPS), flux sampling, and statistical analysis. + doc_url: https://compbtbs.github.io/COBRAxy + dev_url: https://github.com/CompBtBs/COBRAxy + license: MIT + license_file: LICENSE extra: recipe-maintainers:
--- a/COBRAxy/docs/README.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/README.md Sun Oct 26 19:27:41 2025 +0000 @@ -1,8 +1,8 @@ # COBRAxy Documentation -> A Python toolkit for metabolic flux analysis and visualization, with Galaxy integration. +> A Python-based command-line suite for metabolic flux analysis and visualization, with Galaxy integration. -COBRAxy enables the integration of transcriptomics data with COBRA-based metabolic models, offering a comprehensive framework for studying metabolism in both health and disease. With COBRAxy, users can load and enrich metabolic models by incorporating transcriptomic data and adjusting the model’s medium conditions. +COBRAxy enables the integration of transcriptomics data with COBRA-based metabolic models, offering a comprehensive framework for studying metabolism in both health and disease. With COBRAxy, users can load and enrich metabolic models by incorporating transcriptomic data and adjusting the model's medium conditions. ## Overview @@ -30,35 +30,28 @@ Install COBRAxy and get it running on your system ### [Tutorials](tutorials/) -Step-by-step guides for Galaxy and Python usage +Step-by-step guides for Galaxy usage ### [Tools Documentation](tools/) Complete reference for all COBRAxy tools ## Data Flow -COBRAxy follows several analysis paths: +COBRAxy supports three main analysis workflows: + +1. **RAS-based Enrichment Analysis**: + Gene Expression → RAS Generation → MAREA → Enriched Pathway Maps -1. **RAS Enrichment Analysis**: RAS computation → MAREA → Enriched Maps -2. **Flux Enrichment Analysis Simulation**: RAS computation → Model Constraints → Flux Sampling → Flux Maps -3. **RAS/RPS Enrichment Analysis**: RAS + RPS computation → MAREA → Enriched Maps +2. **Flux Sampling Analysis**: + Gene Expression → RAS Generation → RAS to Bounds → Flux Simulation → Flux to Map → Flux-enriched Maps + +3. **RAS+RPS Combined Enrichment**: + Gene Expression + Metabolite Data → RAS + RPS Generation → MAREA → Multi-omics Enriched Maps ## Community & Support -- **Documentation**: Complete guides and API reference -- **Discussions**: Ask questions and share experiences -- **Issues**: Report bugs and request features +- **Documentation**: Complete guides and references +- **Issues**: Report bugs and request features on [GitHub](https://github.com/CompBtBs/COBRAxy/issues) - **Contributing**: Help improve COBRAxy -## Quick Links - -| Resource | Description | -|----------|-------------| -| [Installation Guide](installation.md) | Get COBRAxy running on your system | -| [Galaxy Tutorial](tutorials/galaxy-setup.md) | Web-based analysis setup | -| [Python Tutorial](tutorials/python-api.md) | Use COBRAxy programmatically | -| [Tools Documentation](tools/) | Complete tool reference | - ---- - **Ready to start?** Follow the [Installation Guide](installation.md) to get COBRAxy up and running! \ No newline at end of file
--- a/COBRAxy/docs/_sidebar.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/_sidebar.md Sun Oct 26 19:27:41 2025 +0000 @@ -2,19 +2,18 @@ * [Home](/) -* [Installation](installation.md) +* [Installation](/installation.md) -* [Tutorials](tutorials/) - * [Galaxy Setup](tutorials/galaxy-setup.md) - * [Python API](tutorials/python-api.md) +* [Tutorials](/tutorials/) + * [Galaxy Setup](/tutorials/galaxy-setup.md) -* [Tools Documentation](tools/) - * [RAS Generator](tools/ras-generator.md) - * [RPS Generator](tools/rps-generator.md) - * [MAREA](tools/marea.md) - * [RAS to Bounds](tools/ras-to-bounds.md) - * [Flux Simulation](tools/flux-simulation.md) - * [Flux to Map](tools/flux-to-map.md) - * [Model Setting](tools/metabolic-model-setting.md) - * [Tabular to Model](tools/tabular-to-model.md) - * [MAREA Cluster](tools/marea-cluster.md) \ No newline at end of file +* [Tools Documentation](/tools/) + * [Import Metabolic Model](/tools/import-metabolic-model.md) + * [Export Metabolic Model](/tools/export-metabolic-model.md) + * [RAS Generator](/tools/ras-generator.md) + * [RPS Generator](/tools/rps-generator.md) + * [MAREA](/tools/marea.md) + * [RAS to Bounds](/tools/ras-to-bounds.md) + * [Flux Simulation](/tools/flux-simulation.md) + * [Flux to Map](/tools/flux-to-map.md) + * [MAREA Cluster](/tools/marea-cluster.md) \ No newline at end of file
--- a/COBRAxy/docs/getting-started.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/getting-started.md Sun Oct 26 19:27:41 2025 +0000 @@ -74,7 +74,8 @@ ```bash # Generate RAS from expression data -ras_generator -td $(pwd) \ +# Note: -td is optional and auto-detected after pip install +ras_generator \ -in expression_data.tsv \ -ra ras_output.tsv \ -rs ENGRO2 @@ -84,7 +85,8 @@ ```bash # Generate enriched pathway maps -marea -td $(pwd) \ +# Note: -td is optional and auto-detected after pip install +marea \ -using_RAS true \ -input_data ras_output.tsv \ -choice_map ENGRO2 \ @@ -108,7 +110,7 @@ | **ENGRO2** | Human | ~2,000 | ~500 | Focused human metabolism model | | **Recon** | Human | ~10,000 | ~2,000 | Comprehensive human metabolism | -Models are stored in the `local/` directory and include: +Models are stored in the `src/local/` directory and include: - SBML files - GPR rules - Gene mapping tables @@ -132,34 +134,17 @@ lactate 23.9 41.2 19.4 ``` -## Command Line vs Python API - -COBRAxy offers two usage modes: - -### Command Line (Quick Analysis) -```bash -# Simple command-line execution -ras_generator -td $(pwd) -in data.tsv -ra output.tsv -rs ENGRO2 -``` - -### Python API (Programming) -```python -import ras_generator -# Call main function with arguments -ras_generator.main(['-td', '/path', '-in', 'data.tsv', '-ra', 'output.tsv', '-rs', 'ENGRO2']) -``` - ## Next Steps Now that you understand the basics: -1. **[Quick Start Guide](quickstart.md)** - Complete walkthrough with example data -2. **[Python API Tutorial](tutorials/python-api.md)** - Learn programmatic usage -3. **[Tools Reference](tools/)** - Detailed documentation for each tool -4. **[Examples](examples/)** - Real-world analysis examples +1. **[Quick Start Guide](/quickstart.md)** - Complete walkthrough with example data +2. **[Galaxy Tutorial](/tutorials/galaxy-setup.md)** - Web-based analysis setup +3. **[Tools Reference](/tools/)** - Detailed documentation for each tool +4. **[Examples](/examples/)** - Real-world analysis examples ## Need Help? -- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions +- **[Troubleshooting](/troubleshooting.md)** - Common issues and solutions - **[GitHub Issues](https://github.com/CompBtBs/COBRAxy/issues)** - Report bugs or ask questions -- **[Contributing](contributing.md)** - Help improve COBRAxy \ No newline at end of file +- **[Contributing](/contributing.md)** - Help improve COBRAxy \ No newline at end of file
--- a/COBRAxy/docs/index.html Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/index.html Sun Oct 26 19:27:41 2025 +0000 @@ -34,6 +34,48 @@ color: var(--theme-color); } + /* Main category styling - bold and larger */ + .sidebar-nav > ul > li > a, + .sidebar-nav > ul > li > strong > a { + font-weight: 700 !important; + font-size: 1.1em !important; + color: var(--theme-color) !important; + text-transform: uppercase; + letter-spacing: 0.5px; + margin-top: 1em; + display: block; + } + + /* Sub-items styling - normal weight and smaller */ + .sidebar-nav > ul > li > ul > li > a { + font-weight: 400 !important; + font-size: 0.95em !important; + color: var(--text-color-base) !important; + padding-left: 1.5em; + } + + /* Collapsible arrows */ + .sidebar-nav > ul > li.folder > a::before { + content: '▶'; + display: inline-block; + margin-right: 0.5em; + transition: transform 0.2s; + font-size: 0.7em; + } + + .sidebar-nav > ul > li.folder.open > a::before { + transform: rotate(90deg); + } + + /* Hide sub-items by default */ + .sidebar-nav > ul > li.folder > ul { + display: none; + } + + .sidebar-nav > ul > li.folder.open > ul { + display: block; + } + .app-name-link { color: var(--theme-color) !important; font-weight: 600; @@ -102,6 +144,12 @@ subMaxLevel: 3, auto2top: true, + // Sidebar configuration + alias: { + '/.*/_sidebar.md': '/_sidebar.md' + }, + sidebarDisplayLevel: 1, // Expand sidebar to level 1 by default + // Search plugin search: { maxAge: 86400000, // Expiration time, the default one day @@ -160,5 +208,68 @@ <script src="//cdn.jsdelivr.net/npm/prismjs@1/components/prism-yaml.min.js"></script> <script src="//cdn.jsdelivr.net/npm/prismjs@1/components/prism-xml.min.js"></script> <script src="//cdn.jsdelivr.net/npm/prismjs@1/components/prism-json.min.js"></script> + + <!-- Sidebar collapse script --> + <script> + window.addEventListener('load', function() { + // Wait for sidebar to be rendered + setTimeout(function() { + const sidebar = document.querySelector('.sidebar-nav'); + if (!sidebar) return; + + // Mark items with sub-lists as folders + const items = sidebar.querySelectorAll('ul > li'); + items.forEach(item => { + const hasSublist = item.querySelector('ul'); + if (hasSublist) { + item.classList.add('folder'); + + // Get the main link + const mainLink = item.querySelector('a'); + if (mainLink) { + // Create a toggle button for the arrow + const arrow = mainLink.querySelector('::before') || mainLink; + + // Add click handler that allows navigation but also toggles on second click + let clickCount = 0; + let clickTimer = null; + + mainLink.addEventListener('click', function(e) { + clickCount++; + + if (clickCount === 1) { + // First click: navigate to the page + clickTimer = setTimeout(function() { + clickCount = 0; + }, 300); + // Don't prevent default, allow navigation + item.classList.add('open'); + } else if (clickCount === 2) { + // Second click: toggle the folder + clearTimeout(clickTimer); + clickCount = 0; + e.preventDefault(); + item.classList.toggle('open'); + } + }); + } + } + }); + + // Open the active section by default + const activeLink = sidebar.querySelector('li.active'); + if (activeLink) { + let parent = activeLink.parentElement; + while (parent && parent.tagName === 'UL') { + const parentLi = parent.parentElement; + if (parentLi && parentLi.classList.contains('folder')) { + parentLi.classList.add('open'); + } + parent = parentLi ? parentLi.parentElement : null; + } + } + }, 300); + }); + </script> </body> </html> \ No newline at end of file
--- a/COBRAxy/docs/installation.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/installation.md Sun Oct 26 19:27:41 2025 +0000 @@ -4,38 +4,70 @@ ## System Requirements -- **Python**: 3.8-3.12 +- **Python**: 3.8-3.13 - **Operating System**: Linux (recommended), macOS, Windows -- **Storage**: 2GB free space for installation and temporary files +- **Build tools**: C/C++ compiler (gcc, clang, or MSVC), CMake, pkg-config + +## System Dependencies + +Install required build tools before installing COBRAxy: + +```bash +# Ubuntu/Debian +sudo apt-get install build-essential cmake pkg-config libvips libglpk40 glpk-utils + +# macOS +xcode-select --install +brew install cmake pkg-config vips glpk + +# Windows (with Chocolatey) +choco install cmake visualstudio2022buildtools pkgconfiglite +``` + +## Installation Methods -## Quick Install +### Recommended: Using Conda + +Create an isolated environment with all dependencies: + +```bash +# Create a new conda environment +conda create -n cobraxy python=3.13 -y +conda activate cobraxy -The fastest way to install COBRAxy: +# Install build tools via conda +conda install -c conda-forge cmake pkg-config swiglpk -y + +# Clone and install COBRAxy +git clone https://github.com/CompBtBs/COBRAxy.git +cd COBRAxy/src +pip install . +``` + +### Alternative: Direct Installation + +If you have system dependencies already installed: ```bash # Clone the repository git clone https://github.com/CompBtBs/COBRAxy.git -cd COBRAxy +cd COBRAxy/src # Install COBRAxy pip install . ``` -## Development Install +### Development Install For development or if you want to modify COBRAxy: ```bash -# Clone and install in development mode +# Clone and install in editable mode git clone https://github.com/CompBtBs/COBRAxy.git -cd COBRAxy +cd COBRAxy/src pip install -e . ``` -## Dependencies - -COBRAxy automatically installs its Python dependencies (COBRApy, pandas, numpy, etc.) - ## Verify Installation Test your installation: @@ -44,9 +76,53 @@ # Check if COBRAxy tools are available ras_generator --help flux_simulation --help +marea --help + +# Check Python can import COBRAxy modules +python -c "import ras_generator; print('COBRAxy installed successfully!')" +``` + +## Troubleshooting Installation + +### Missing Compiler Errors + +If you see errors about missing compilers during installation: + +```bash +# Ubuntu/Debian +sudo apt-get install build-essential + +# macOS +xcode-select --install ``` -## Virtual Environment (Recommended) +### CMake Not Found + +```bash +# Ubuntu/Debian +sudo apt-get install cmake + +# macOS +brew install cmake + +# Or via conda +conda install -c conda-forge cmake +``` + +### pkg-config Issues + +```bash +# Ubuntu/Debian +sudo apt-get install pkg-config + +# macOS +brew install pkg-config + +# Or via conda +conda install -c conda-forge pkg-config +``` + +## Alternative: Virtual Environment (without Conda) Using a virtual environment prevents conflicts with other Python packages: @@ -59,6 +135,7 @@ # cobraxy-env\Scripts\activate # Windows # Install COBRAxy +cd COBRAxy/src pip install . # When done, deactivate @@ -69,15 +146,14 @@ After successful installation: -1. **[Quick Start Guide](quickstart.md)** - Run your first analysis -2. **[Tutorial: Python API](tutorials/python-api.md)** - Learn programmatic usage -3. **[Tutorial: Galaxy Setup](tutorials/galaxy-setup.md)** - Set up web interface +1. **[Quick Start Guide](/quickstart.md)** - Run your first analysis +2. **[Tutorial: Galaxy Setup](/tutorials/galaxy-setup.md)** - Set up web interface ## Getting Help If you encounter issues: -1. Check the [Troubleshooting Guide](troubleshooting.md) +1. Check the [Troubleshooting Guide](/troubleshooting.md) 2. Search [existing issues](https://github.com/CompBtBs/COBRAxy/issues) 3. Create a [new issue](https://github.com/CompBtBs/COBRAxy/issues/new) with: - Your operating system
--- a/COBRAxy/docs/quickstart.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/quickstart.md Sun Oct 26 19:27:41 2025 +0000 @@ -38,7 +38,8 @@ ```bash # Generate RAS scores using built-in ENGRO2 model -ras_generator -td $(pwd) \ +# Note: -td is optional and auto-detected after pip install +ras_generator \ -in sample_expression.tsv \ -ra ras_scores.tsv \ -rs ENGRO2 @@ -61,7 +62,8 @@ ```bash # Create pathway maps with statistical analysis -marea -td $(pwd) \ +# Note: -td is optional and auto-detected after pip install +marea \ -using_RAS true \ -input_data ras_scores.tsv \ -choice_map ENGRO2 \ @@ -95,20 +97,19 @@ ### Learn More About the Analysis -- **[Understanding RAS](tools/ras-generator.md)** - How activity scores are computed -- **[MAREA Analysis](tools/marea.md)** - Statistical enrichment methods +- **[Understanding RAS](/tools/ras-generator.md)** - How activity scores are computed +- **[MAREA Analysis](/tools/marea.md)** - Statistical enrichment methods - **[Data Flow](getting-started.md#analysis-workflows)** - Complete workflow overview ### Try Advanced Features - **[Flux Sampling](tutorials/workflow.md#flux-simulation-workflow)** - Predict metabolic flux distributions -- **[Python API](tutorials/python-api.md)** - Integrate into scripts and pipelines -- **[Galaxy Interface](tutorials/galaxy-setup.md)** - Web-based analysis +- **[Galaxy Interface](/tutorials/galaxy-setup.md)** - Web-based analysis ### Use Your Own Data -- **[Data Formats](tutorials/data-formats.md)** - Prepare your expression data -- **[Troubleshooting](troubleshooting.md)** - Common issues and solutions +- **[Data Formats](/tutorials/data-formats.md)** - Prepare your expression data +- **[Troubleshooting](/troubleshooting.md)** - Common issues and solutions ## Complete Example Pipeline @@ -124,8 +125,9 @@ EOF # Run analysis pipeline -ras_generator -td /path/to/COBRAxy -in expression.tsv -ra ras.tsv -rs ENGRO2 -marea -td /path/to/COBRAxy -using_RAS true -input_data ras.tsv -choice_map ENGRO2 -gs true -idop maps +# Note: -td is optional and auto-detected after pip install +ras_generator -in expression.tsv -ra ras.tsv -rs ENGRO2 +marea -using_RAS true -input_data ras.tsv -choice_map ENGRO2 -gs true -idop maps # View results ls maps/*.svg @@ -138,16 +140,4 @@ 1. **Check Prerequisites**: Ensure COBRAxy is properly installed 2. **Verify File Format**: Make sure your data is tab-separated TSV 3. **Review Logs**: Look for error messages in the terminal output -4. **Consult Guides**: [Troubleshooting](troubleshooting.md) and [Installation](installation.md) - -**Still stuck?** Ask for help in [GitHub Discussions](https://github.com/CompBtBs/COBRAxy/discussions). - -## Summary - -🎉 **Congratulations!** You've completed your first COBRAxy analysis. You now know how to: - -- ✅ Generate metabolic activity scores from gene expression -- ✅ Create statistical pathway visualizations -- ✅ Interpret basic COBRAxy outputs - -Ready for more? Explore the [full documentation](/) to unlock COBRAxy's complete potential! \ No newline at end of file +4. **Consult Guides**: [Troubleshooting](/troubleshooting.md) and [Installation](/installation.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/README.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/README.md Sun Oct 26 19:27:41 2025 +0000 @@ -4,23 +4,16 @@ | Tool | Purpose | Input | Output | |------|---------|--------|--------| -| [RAS Generator](ras-generator.md) | Compute reaction activity scores | Gene expression + GPR rules | RAS values | -| [RPS Generator](rps-generator.md) | Compute reaction propensity scores | Metabolite abundance | RPS values | -| [MAREA](marea.md) | Statistical pathway enrichment | RAS/RPS data | Enriched maps + statistics | -| [RAS to Bounds](ras-to-bounds.md) | Apply RAS constraints to model | RAS + SBML model | Constrained bounds | -| [Flux Simulation](flux-simulation.md) | Sample metabolic fluxes | Constrained model | Flux distributions | -| [Flux to Map](flux-to-map.md) | Visualize flux data on maps | Flux samples + statistical comparison | Color-coded pathway maps | -| [Model Setting](metabolic-model-setting.md) | Extract model components | SBML/JSON/MAT/YML model | Tabular model data | -| [MAREA Cluster](marea-cluster.md) | Cluster analysis | Expression/RAS/RPS/flux data | Sample clusters + validation plots | +| [Import Metabolic Model](tools/import-metabolic-model) | Import and extract model components | SBML/JSON/MAT/YAML model | Tabular model data | +| [Export Metabolic Model](tools/export-metabolic-model) | Export tabular data to model format | Tabular model data | SBML/JSON/MAT/YAML model | +| [RAS Generator](tools/ras-generator) | Compute reaction activity scores | Gene expression + GPR rules | RAS values | +| [RPS Generator](tools/rps-generator) | Compute reaction propensity scores | Metabolite abundance | RPS values | +| [MAREA](tools/marea) | Statistical pathway enrichment | RAS/RPS data | Enriched maps + statistics | +| [RAS to Bounds](tools/ras-to-bounds) | Apply RAS constraints to model | RAS + SBML model | Constrained bounds | +| [Flux Simulation](tools/flux-simulation) | Sample metabolic fluxes | Constrained model | Flux distributions | +| [Flux to Map](tools/flux-to-map) | Visualize flux data on maps | Flux samples + statistical comparison | Color-coded pathway maps | +| [MAREA Cluster](tools/marea-cluster) | Cluster analysis | Expression/RAS/RPS/flux data | Sample clusters + validation plots | -## Common Parameters - -All tools share these basic parameters: - -- **`-td, --tool_dir`**: COBRAxy installation directory (required) -- **`-in, --input`**: Input dataset file -- **`-idop, --output_dir`**: Output directory for results -- **`-rs, --rules_selector`**: Built-in model (ENGRO2, Recon, HMRcore) ## Analysis Workflows @@ -74,23 +67,27 @@ ### Choose Your Analysis Path **For Pathway Enrichment** -1. [RAS Generator](ras-generator.md) → Generate activity scores -2. [RPS Generator](rps-generator.md) → Generate propensity scores (optional) -3. [MAREA](marea.md) → Statistical analysis and visualization +1. [RAS Generator](tools/ras-generator) → Generate activity scores +2. [RPS Generator](tools/rps-generator) → Generate propensity scores (optional) +3. [MAREA](tools/marea) → Statistical analysis and visualization **For Flux Analysis** -1. [RAS Generator](ras-generator.md) → Generate activity scores -2. [RAS to Bounds](ras-to-bounds.md) → Apply constraints -3. [Flux Simulation](flux-simulation.md) → Sample fluxes -4. [Flux to Map](flux-to-map.md) → Create visualizations +1. [RAS Generator](tools/ras-generator) → Generate activity scores +2. [RAS to Bounds](tools/ras-to-bounds) → Apply constraints +3. [Flux Simulation](tools/flux-simulation) → Sample fluxes +4. [Flux to Map](tools/flux-to-map) → Create visualizations **For Model Exploration** -1. [Model Setting](metabolic-model-setting.md) → Extract model info +1. [Import Metabolic Model](tools/import-metabolic-model) → Extract model info 2. Analyze model structure and gene coverage +**For Model Creation** +1. Create/edit tabular model data +2. [Export Metabolic Model](tools/export-metabolic-model) → Create SBML/JSON/MAT/YAML model + **For Sample Classification** 1. Generate RAS/RPS scores -2. [MAREA Cluster](marea-cluster.md) → Cluster samples +2. [MAREA Cluster](tools/marea-cluster) → Cluster samples @@ -98,16 +95,6 @@ ### Common Issues Across Tools -**File Path Problems** -- Use absolute paths when possible -- Ensure all input files exist before starting -- Check write permissions for output directories - -**File Issues** -- Check file paths and permissions -- Verify input file formats -- Ensure sufficient disk space - **Model Issues** - Verify model file format and gene ID consistency - Check gene ID mapping between data and model @@ -119,7 +106,7 @@ 1. Check individual tool documentation 2. Review parameter requirements and formats 3. Test with smaller datasets first -4. Consult [troubleshooting guide](../troubleshooting.md) +4. Consult [troubleshooting guide](/troubleshooting.md) ## Contributing
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COBRAxy/docs/tools/export-metabolic-model.md Sun Oct 26 19:27:41 2025 +0000 @@ -0,0 +1,493 @@ +# Export Metabolic Model + +Export tabular data (CSV/TSV) into COBRA metabolic models in various formats. + +## Overview + +Export Metabolic Model (exportMetabolicModel) converts structured tabular data containing reaction information into fully functional COBRA metabolic models. This tool enables creation of custom models from spreadsheet data and supports multiple output formats including SBML, JSON, MATLAB, and YAML. + +## Usage + +### Command Line + +```bash +exportMetabolicModel --input model_data.csv \ + --format sbml \ + --output custom_model.xml \ + --out_log conversion.log \ + --tool_dir /path/to/COBRAxy/src +``` + +### Galaxy Interface + +Select "Export Metabolic Model" from the COBRAxy tool suite and configure conversion parameters. + +## Parameters + +### Required Parameters + +| Parameter | Flag | Description | +|-----------|------|-------------| +| Input File | `--input` | Tabular file (CSV/TSV) with model data | +| Output Format | `--format` | Model format (sbml, json, mat, yaml) | +| Output File | `--output` | Output model file path | +| Output Log | `--out_log` | Log file for conversion process | + +### Optional Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Tool Directory | `--tool_dir` | COBRAxy installation directory | Current directory | + +## Input Format + +### Tabular Model Data + +The input file must contain structured model information with the following columns: + +```csv +Reaction_ID,GPR_Rule,Reaction_Formula,Lower_Bound,Upper_Bound,Objective_Coefficient,Medium_Member,Compartment,Subsystem +R00001,GENE1 or GENE2,A + B -> C + D,-1000.0,1000.0,0.0,FALSE,cytosol,Glycolysis +R00002,GENE3 and GENE4,E <-> F,-1000.0,1000.0,0.0,FALSE,mitochondria,TCA_Cycle +EX_glc_e,-,glc_e <->,-1000.0,1000.0,0.0,TRUE,extracellular,Exchange +BIOMASS,GENE5,0.5 A + 0.3 B -> 1 BIOMASS,0.0,1000.0,1.0,FALSE,cytosol,Biomass +``` + +### Required Columns + +| Column | Description | Format | +|--------|-------------|--------| +| **Reaction_ID** | Unique reaction identifier | String | +| **Reaction_Formula** | Stoichiometric equation | Metabolite notation | +| **Lower_Bound** | Minimum flux constraint | Numeric | +| **Upper_Bound** | Maximum flux constraint | Numeric | + +### Optional Columns + +| Column | Description | Default | +|--------|-------------|---------| +| **GPR_Rule** | Gene-protein-reaction association | Empty string | +| **Objective_Coefficient** | Biomass/objective weight | 0.0 | +| **Medium_Member** | Exchange reaction flag | FALSE | +| **Compartment** | Subcellular location | Empty | +| **Subsystem** | Metabolic pathway | Empty | + +## Output Formats + +### SBML (Systems Biology Markup Language) +- **Format**: XML-based standard +- **Extension**: `.xml` or `.sbml` +- **Use Case**: Interoperability with other tools +- **Advantages**: Widely supported, standardized + +### JSON (JavaScript Object Notation) +- **Format**: COBRApy native JSON +- **Extension**: `.json` +- **Use Case**: Python/COBRApy workflows +- **Advantages**: Human-readable, lightweight + +### MATLAB (.mat) +- **Format**: MATLAB workspace format +- **Extension**: `.mat` +- **Use Case**: MATLAB COBRA Toolbox +- **Advantages**: Direct MATLAB compatibility + +### YAML (YAML Ain't Markup Language) +- **Format**: Human-readable data serialization +- **Extension**: `.yml` or `.yaml` +- **Use Case**: Configuration and documentation +- **Advantages**: Most human-readable format + +## Reaction Formula Syntax + +### Standard Notation +``` +# Irreversible reaction +A + B -> C + D + +# Reversible reaction +A + B <-> C + D + +# With stoichiometric coefficients +2 A + 3 B -> 1 C + 4 D + +# Compartmentalized metabolites +glc_c + atp_c -> g6p_c + adp_c +``` + +### Compartment Suffixes +- `_c`: Cytosol +- `_m`: Mitochondria +- `_e`: Extracellular +- `_r`: Endoplasmic reticulum +- `_x`: Peroxisome +- `_n`: Nucleus + +### Exchange Reactions +``` +# Import reaction +EX_glc_e: glc_e <-> + +# Export reaction +EX_co2_e: co2_e <-> +``` + +## GPR Rule Syntax + +### Logical Operators +- **AND**: Gene products required together +- **OR**: Alternative gene products +- **Parentheses**: Grouping for complex logic + +### Examples +``` +# Single gene +GENE1 + +# Alternative genes (isozymes) +GENE1 or GENE2 or GENE3 + +# Required genes (complex) +GENE1 and GENE2 + +# Complex logic +(GENE1 and GENE2) or (GENE3 and GENE4) +``` + +## Examples + +### Create Basic Model + +```bash +# Convert simple CSV to SBML model +exportMetabolicModel --input simple_model.csv \ + --format sbml \ + --output simple_model.xml \ + --out_log simple_conversion.log \ + --tool_dir /opt/COBRAxy/src +``` + +### Multi-format Export + +```bash +# Create models in all supported formats +formats=("sbml" "json" "mat" "yaml") +for fmt in "${formats[@]}"; do + exportMetabolicModel --input comprehensive_model.csv \ + --format "$fmt" \ + --output "model.$fmt" \ + --out_log "conversion_$fmt.log" \ + --tool_dir /opt/COBRAxy/src +done +``` + +### Custom Model Creation + +```bash +# Build tissue-specific model from curated data +exportMetabolicModel --input liver_reactions.tsv \ + --format sbml \ + --output liver_model.xml \ + --out_log liver_model.log \ + --tool_dir /opt/COBRAxy/src +``` + +### Model Integration Pipeline + +```bash +# Extract existing model, modify, and recreate +importMetabolicModel --model ENGRO2 \ + --out_tabular base_model.csv \ + --tool_dir /opt/COBRAxy/src + +# Edit base_model.csv with custom reactions/constraints + +# Create modified model +exportMetabolicModel --input modified_model.csv \ + --format sbml \ + --output custom_model.xml \ + --out_log custom_creation.log \ + --tool_dir /opt/COBRAxy/src +``` + +## Model Validation + +### Automatic Checks + +The tool performs validation during conversion: +- **Stoichiometric Balance**: Reaction mass balance +- **Metabolite Consistency**: Compartment assignments +- **Bound Validation**: Feasible constraint ranges +- **Objective Function**: Valid biomass reaction + +### Post-conversion Validation + +```python +import cobra + +# Load and validate model +model = cobra.io.read_sbml_model('custom_model.xml') + +# Check basic properties +print(f"Reactions: {len(model.reactions)}") +print(f"Metabolites: {len(model.metabolites)}") +print(f"Genes: {len(model.genes)}") + +# Test model solvability +solution = model.optimize() +print(f"Growth rate: {solution.objective_value}") + +# Validate mass balance +unbalanced = cobra.flux_analysis.check_mass_balance(model) +if unbalanced: + print("Unbalanced reactions found:", unbalanced) +``` + +## Integration Workflow + +### Upstream Data Sources + +#### COBRAxy Tools +- [Import Metabolic Model](import-metabolic-model.md) - Extract tabular data for modification + +#### External Sources +- **Databases**: KEGG, Reactome, BiGG +- **Literature**: Manually curated reactions +- **Spreadsheets**: User-defined custom models + +### Downstream Applications + +#### COBRAxy Analysis +- [RAS to Bounds](ras-to-bounds.md) - Apply constraints to custom model +- [Flux Simulation](flux-simulation.md) - Sample fluxes from custom model +- [MAREA](marea.md) - Analyze custom pathways + +#### External Tools +- **COBRApy**: Python-based analysis +- **COBRA Toolbox**: MATLAB analysis +- **OptFlux**: Strain design +- **Escher**: Pathway visualization + +### Typical Pipeline + +```bash +# 1. Start with existing model data +importMetabolicModel --model ENGRO2 \ + --out_tabular base_reactions.csv \ + --tool_dir /opt/COBRAxy/src + +# 2. Modify/extend the reaction data +# Edit base_reactions.csv to add tissue-specific reactions + +# 3. Create custom model +exportMetabolicModel --input modified_reactions.csv \ + --format sbml \ + --output tissue_model.xml \ + --out_log tissue_creation.log \ + --tool_dir /opt/COBRAxy/src + +# 4. Validate and use custom model +ras_to_bounds --model Custom --input tissue_model.xml \ + --ras_input tissue_expression.tsv \ + --idop tissue_bounds/ \ + --tool_dir /opt/COBRAxy/src + +# 5. Perform flux analysis +flux_simulation --model Custom --input tissue_model.xml \ + --bounds tissue_bounds/*.tsv \ + --algorithm CBS --idop tissue_fluxes/ \ + --tool_dir /opt/COBRAxy/src +``` + +## Quality Control + +### Input Data Validation + +#### Pre-conversion Checks +- **Format Consistency**: Verify column headers and data types +- **Reaction Completeness**: Check for missing required fields +- **Stoichiometric Validity**: Validate reaction formulas +- **Bound Feasibility**: Ensure lower ≤ upper bounds + +#### Common Data Issues +```bash +# Check for missing reaction IDs +awk -F',' 'NR>1 && ($1=="" || $1=="NA") {print "Empty ID in line " NR}' input.csv + +# Validate reaction directions +awk -F',' 'NR>1 && $3 !~ /->|<->/ {print "Invalid formula: " $1 ", " $3}' input.csv + +# Check bound consistency +awk -F',' 'NR>1 && $4>$5 {print "Invalid bounds: " $1 ", LB=" $4 " > UB=" $5}' input.csv +``` + +### Model Quality Assessment + +#### Structural Properties +- **Network Connectivity**: Ensure realistic pathway structure +- **Compartmentalization**: Validate transport reactions +- **Exchange Reactions**: Verify medium composition +- **Biomass Function**: Check objective reaction completeness + +#### Functional Testing +```python +# Test model functionality +model = cobra.io.read_sbml_model('custom_model.xml') + +# Check growth capability +growth = model.optimize().objective_value +print(f"Maximum growth rate: {growth}") + +# Flux Variability Analysis +fva_result = cobra.flux_analysis.flux_variability_analysis(model) +blocked_reactions = fva_result[(fva_result.minimum == 0) & (fva_result.maximum == 0)] +print(f"Blocked reactions: {len(blocked_reactions)}") + +# Essential gene analysis +essential_genes = cobra.flux_analysis.find_essential_genes(model) +print(f"Essential genes: {len(essential_genes)}") +``` + +## Tips and Best Practices + +### Data Preparation +- **Consistent Naming**: Use systematic metabolite/reaction IDs +- **Compartment Notation**: Follow standard suffixes (_c, _m, _e) +- **Balanced Reactions**: Verify mass and charge balance +- **Realistic Bounds**: Use physiologically relevant constraints + +### Model Design +- **Modular Structure**: Organize reactions by pathway/subsystem +- **Exchange Reactions**: Include all necessary transport processes +- **Biomass Function**: Define appropriate growth objective +- **Gene Associations**: Add GPR rules where available + +### Format Selection +- **SBML**: Choose for maximum compatibility and sharing +- **JSON**: Use for COBRApy-specific workflows +- **MATLAB**: Select for COBRA Toolbox integration +- **YAML**: Pick for human-readable documentation + +### Performance Optimization +- **Model Size**: Balance comprehensiveness with computational efficiency +- **Reaction Pruning**: Remove unnecessary or blocked reactions +- **Compartmentalization**: Minimize unnecessary compartments +- **Validation**: Test model properties before distribution + +## Troubleshooting + +### Common Issues + +**Conversion fails with format error** +- Check CSV/TSV column headers and data consistency +- Verify reaction formula syntax +- Ensure numeric fields contain valid numbers + +**Model is infeasible after conversion** +- Check reaction bounds for conflicts +- Verify exchange reaction setup +- Validate stoichiometric balance + +**Missing metabolites or reactions** +- Confirm all required columns present in input +- Check for empty rows or malformed data +- Validate reaction formula parsing + +### Error Messages + +| Error | Cause | Solution | +|-------|-------|----------| +| "Input file not found" | Invalid file path | Check file location and permissions | +| "Unknown format" | Invalid output format | Use: sbml, json, mat, or yaml | +| "Formula parsing failed" | Malformed reaction equation | Check reaction formula syntax | +| "Model infeasible" | Conflicting constraints | Review bounds and exchange reactions | + +### Performance Issues + +**Slow conversion** +- Large input files require more processing time +- Complex GPR rules increase parsing overhead +- Monitor system memory usage + +**Memory errors** +- Reduce model size or split into smaller files +- Increase available system memory +- Use more efficient data structures + +**Output file corruption** +- Ensure sufficient disk space +- Check file write permissions +- Verify format-specific requirements + +## Advanced Usage + +### Batch Model Creation + +```python +#!/usr/bin/env python3 +import subprocess +import pandas as pd + +# Create multiple tissue-specific models +tissues = ['liver', 'muscle', 'brain', 'heart'] +base_data = pd.read_csv('base_model.csv') + +for tissue in tissues: + # Modify base data for tissue specificity + tissue_data = customize_for_tissue(base_data, tissue) + tissue_data.to_csv(f'{tissue}_model.csv', index=False) + + # Convert to SBML + subprocess.run([ + 'exportMetabolicModel', + '--input', f'{tissue}_model.csv', + '--format', 'sbml', + '--output', f'{tissue}_model.xml', + '--out_log', f'{tissue}_conversion.log', + '--tool_dir', '/opt/COBRAxy/src' + ]) +``` + +### Model Merging + +Combine multiple tabular files into comprehensive models: + +```bash +# Merge core metabolism with tissue-specific pathways +cat core_reactions.csv > combined_model.csv +tail -n +2 tissue_reactions.csv >> combined_model.csv +tail -n +2 disease_reactions.csv >> combined_model.csv + +# Create merged model +exportMetabolicModel --input combined_model.csv \ + --format sbml \ + --output comprehensive_model.xml \ + --tool_dir /opt/COBRAxy/src +``` + +### Model Versioning + +Track model versions and changes: + +```bash +# Version control for model development +git add model_v1.csv +git commit -m "Initial model version" + +# Create versioned models +exportMetabolicModel --input model_v1.csv --format sbml \ + --output model_v1.xml --tool_dir /opt/COBRAxy/src +exportMetabolicModel --input model_v2.csv --format sbml \ + --output model_v2.xml --tool_dir /opt/COBRAxy/src + +# Compare model versions +cobra_compare_models model_v1.xml model_v2.xml +``` + +## See Also + +- [Import Metabolic Model](import-metabolic-model.md) - Extract tabular data from existing models +- [RAS to Bounds](ras-to-bounds.md) - Apply constraints to custom models +- [Flux Simulation](flux-simulation.md) - Analyze custom models with flux sampling +- [Model Creation Tutorial](/tutorials/custom-model-creation.md) +- [COBRA Model Standards](/tutorials/cobra-model-standards.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/flux-simulation.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/flux-simulation.md Sun Oct 26 19:27:41 2025 +0000 @@ -300,7 +300,7 @@ ### Upstream Tools - [RAS to Bounds](ras-to-bounds.md) - Generate constrained bounds from RAS -- [Model Setting](metabolic-model-setting.md) - Extract model components +- [Import Metabolic Model](import-metabolic-model.md) - Extract model components ### Downstream Tools - [Flux to Map](flux-to-map.md) - Visualize flux distributions on metabolic maps @@ -402,5 +402,5 @@ - [RAS to Bounds](ras-to-bounds.md) - Generate input constraints - [Flux to Map](flux-to-map.md) - Visualize flux results -- [CBS Algorithm Documentation](../tutorials/cbs-algorithm.md) -- [OPTGP Algorithm Documentation](../tutorials/optgp-algorithm.md) \ No newline at end of file +- [CBS Algorithm Documentation](/tutorials/cbs-algorithm.md) +- [OPTGP Algorithm Documentation](/tutorials/optgp-algorithm.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/flux-to-map.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/flux-to-map.md Sun Oct 26 19:27:41 2025 +0000 @@ -463,5 +463,5 @@ - [Flux Simulation](flux-simulation.md) - Generate input flux distributions - [MAREA](marea.md) - Alternative pathway analysis approach -- [Custom Map Creation Guide](../tutorials/custom-map-creation.md) -- [Statistical Methods Reference](../tutorials/statistical-methods.md) \ No newline at end of file +- [Custom Map Creation Guide](/tutorials/custom-map-creation.md) +- [Statistical Methods Reference](/tutorials/statistical-methods.md) \ No newline at end of file
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COBRAxy/docs/tools/import-metabolic-model.md Sun Oct 26 19:27:41 2025 +0000 @@ -0,0 +1,387 @@ +# Import Metabolic Model + +Import and extract metabolic model components into tabular format for analysis and integration. + +## Overview + +Import Metabolic Model (importMetabolicModel) imports metabolic models from various formats (SBML, JSON, MAT, YAML) and extracts key components into comprehensive tabular summaries. This tool processes built-in or custom models, applies medium constraints, handles gene nomenclature conversion, and outputs structured data for downstream analysis. + +## Usage + +### Command Line + +```bash +importMetabolicModel --model ENGRO2 \ + --name ENGRO2 \ + --medium_selector allOpen \ + --out_tabular model_data.csv \ + --out_log extraction.log \ + --tool_dir /path/to/COBRAxy/src +``` + +### Galaxy Interface + +Select "Import Metabolic Model" from the COBRAxy tool suite and configure model extraction parameters. + +## Parameters + +### Required Parameters + +| Parameter | Flag | Description | +|-----------|------|-------------| +| Model Name | `--name` | Model identifier for output files | +| Medium Selector | `--medium_selector` | Medium configuration option | +| Output Tabular | `--out_tabular` | Output file path (CSV or XLSX) | +| Output Log | `--out_log` | Log file for processing information | +| Tool Directory | `--tool_dir` | COBRAxy installation directory | + +### Model Selection Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Built-in Model | `--model` | Pre-installed model (ENGRO2, Recon, HMRcore) | - | +| Custom Model | `--input` | Path to custom SBML/JSON model file | - | + +**Note**: Provide either `--model` OR `--input`, not both. + +### Optional Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Custom Medium | `--custom_medium` | CSV file with medium constraints | - | + +## Model Selection + +### Built-in Models + +#### ENGRO2 +- **Species**: Homo sapiens +- **Scope**: Genome-scale reconstruction +- **Reactions**: ~2,000 reactions +- **Metabolites**: ~1,500 metabolites +- **Coverage**: Comprehensive human metabolism + +#### Recon +- **Species**: Homo sapiens +- **Scope**: Recon3D human reconstruction +- **Reactions**: ~10,000+ reactions +- **Metabolites**: ~5,000+ metabolites +- **Coverage**: Most comprehensive human model + +#### HMRcore +- **Species**: Homo sapiens +- **Scope**: Core metabolic network +- **Reactions**: ~300 essential reactions +- **Metabolites**: ~200 core metabolites +- **Coverage**: Central carbon and energy metabolism + +### Custom Models + +Supported formats for custom model import: +- **SBML**: Systems Biology Markup Language (.xml, .sbml) +- **JSON**: COBRApy JSON format (.json) +- **MAT**: MATLAB format (.mat) +- **YML**: YAML format (.yml, .yaml) +- **Compressed**: All formats support .gz, .zip, .bz2 compression + +## Medium Configuration + +### allOpen (Default) +- All exchange reactions unconstrained +- Maximum metabolic flexibility +- Suitable for general analysis + +### Custom Medium +Users can specify custom medium constraints by providing a CSV file with exchange reaction bounds. + +## Output Format + +### Tabular Summary File + +The output contains comprehensive model information in CSV or XLSX format: + +#### Column Structure +``` +Reaction_ID GPR_Rule Reaction_Formula Lower_Bound Upper_Bound Objective_Coefficient Medium_Member Compartment Subsystem +R00001 GENE1 or GENE2 A + B -> C + D -1000.0 1000.0 0.0 FALSE cytosol Glycolysis +R00002 GENE3 and GENE4 E <-> F -1000.0 1000.0 0.0 FALSE mitochondria TCA_Cycle +EX_glc_e - glc_e <-> -1000.0 1000.0 0.0 TRUE extracellular Exchange +``` + +#### Data Fields + +| Field | Description | Values | +|-------|-------------|---------| +| Reaction_ID | Unique reaction identifier | String | +| GPR_Rule | Gene-protein-reaction association | Logical expression | +| Reaction_Formula | Stoichiometric equation | Metabolites with coefficients | +| Lower_Bound | Minimum flux constraint | Numeric (typically -1000) | +| Upper_Bound | Maximum flux constraint | Numeric (typically 1000) | +| Objective_Coefficient | Biomass/objective weight | Numeric (0 or 1) | +| Medium_Member | Exchange reaction flag | TRUE/FALSE | +| Compartment | Subcellular location | String (for ENGRO2 only) | +| Subsystem | Metabolic pathway | String | + +## Examples + +### Extract Built-in Model Data + +```bash +# Extract ENGRO2 model with default settings +importMetabolicModel --model ENGRO2 \ + --name ENGRO2_extraction \ + --medium_selector allOpen \ + --out_tabular ENGRO2_data.csv \ + --out_log ENGRO2_log.txt \ + --tool_dir /opt/COBRAxy/src +``` + +### Process Custom Model + +```bash +# Extract custom SBML model +importMetabolicModel --input /data/custom_model.xml \ + --name CustomModel \ + --medium_selector allOpen \ + --out_tabular custom_model_data.csv \ + --out_log custom_extraction.log \ + --tool_dir /opt/COBRAxy/src +``` + +### Extract Core Model for Quick Analysis + +```bash +# Extract HMRcore for rapid prototyping +importMetabolicModel --model HMRcore \ + --name CoreModel \ + --medium_selector allOpen \ + --out_tabular core_reactions.csv \ + --out_log core_log.txt \ + --tool_dir /opt/COBRAxy/src +``` + +### Batch Processing Multiple Models + +```bash +#!/bin/bash +models=("ENGRO2" "HMRcore" "Recon") +for model in "${models[@]}"; do + importMetabolicModel --model "$model" \ + --name "${model}_extract" \ + --medium_selector allOpen \ + --out_tabular "${model}_data.csv" \ + --out_log "${model}_log.txt" \ + --tool_dir /opt/COBRAxy/src +done +``` + +## Use Cases + +### Model Comparison +Extract multiple models to compare: +- Reaction coverage across different reconstructions +- Gene-reaction associations +- Pathway representation +- Metabolite compartmentalization + +### Data Integration +Prepare model data for: +- Custom analysis pipelines +- Database integration +- Pathway annotation +- Cross-reference mapping + +### Quality Control +Validate model properties: +- Check reaction balancing +- Verify gene associations +- Assess network connectivity +- Identify missing annotations + +### Custom Analysis +Export structured data for: +- Network analysis (graph theory) +- Machine learning applications +- Statistical modeling +- Comparative genomics + +## Integration Workflow + +### Downstream Tools + +The extracted tabular data serves as input for: + +#### COBRAxy Tools +- [RAS Generator](ras-generator.md) - Use extracted GPR rules +- [RPS Generator](rps-generator.md) - Use reaction formulas +- [RAS to Bounds](ras-to-bounds.md) - Use reaction bounds +- [MAREA](marea.md) - Use reaction annotations + +#### External Analysis +- **R/Bioconductor**: Import CSV for pathway analysis +- **Python/pandas**: Load data for network analysis +- **MATLAB**: Process XLSX for modeling +- **Cytoscape**: Network visualization +- **Databases**: Populate reaction databases + +### Typical Pipeline + +```bash +# 1. Extract model components +importMetabolicModel --model ENGRO2 --name ModelData \ + --out_tabular model_components.csv \ + --tool_dir /opt/COBRAxy/src + +# 2. Use extracted data for RAS analysis +ras_generator -td /opt/COBRAxy/src -rs Custom \ + -rl model_components.csv \ + -in expression_data.tsv -ra ras_scores.tsv + +# 3. Apply constraints and sample fluxes +ras_to_bounds -td /opt/COBRAxy/src -ms Custom -mo model_components.csv \ + -ir ras_scores.tsv -idop constrained_bounds/ + +# 4. Visualize results +marea -td /opt/COBRAxy/src -input_data ras_scores.tsv \ + -choice_map Custom -custom_map custom.svg -idop results/ +``` + +## Quality Control + +### Pre-extraction Validation +- Verify model file integrity and format +- Check SBML compliance for custom models +- Validate gene ID formats and coverage +- Confirm medium constraint specifications + +### Post-extraction Checks +- **Completeness**: Verify all expected reactions extracted +- **Consistency**: Check stoichiometric balance +- **Annotations**: Validate gene-reaction associations +- **Formatting**: Confirm output file structure + +### Data Validation + +#### Reaction Balancing +```bash +# Check for unbalanced reactions +awk -F'\t' 'NR>1 && $3 !~ /\<->\|->/ {print $1, $3}' model_data.csv +``` + +#### Gene Coverage +```bash +# Count reactions with GPR rules +awk -F'\t' 'NR>1 && $2 != "" {count++} END {print count " reactions with GPR"}' model_data.csv +``` + +#### Exchange Reactions +```bash +# List medium components +awk -F'\t' 'NR>1 && $7 == "TRUE" {print $1}' model_data.csv +``` + +## Tips and Best Practices + +### Model Selection +- **ENGRO2**: Balanced coverage for human tissue analysis +- **HMRcore**: Fast processing for algorithm development +- **Recon**: Comprehensive analysis requiring computational resources +- **Custom**: Organism-specific or specialized models + +### Output Format Optimization +- **CSV**: Lightweight, universal compatibility +- Choose based on downstream analysis requirements + +### Performance Considerations +- Large models (Recon) may require substantial memory +- Consider batch processing for multiple extractions + +## Troubleshooting + +### Common Issues + +**Model loading fails** +- Check file format and compression +- Verify SBML/JSON/MAT/YAML validity for custom models +- Ensure sufficient system memory + +**Empty output file** +- Model may contain no reactions +- Check model file integrity +- Verify tool directory configuration + +### Error Messages + +| Error | Cause | Solution | +|-------|-------|----------| +| "Model file not found" | Invalid file path | Check file location and permissions | +| "Unsupported format" | Invalid model format | Use SBML, JSON, MAT, or YAML | +| "Memory allocation error" | Insufficient system memory | Use smaller model or increase memory | + +### Performance Issues + +**Slow processing** +- Large models require more time +- Monitor system resource usage + +**Memory errors** +- Reduce model size if possible +- Process in smaller batches +- Increase available system memory + +**Output file corruption** +- Check disk space availability +- Verify file write permissions +- Monitor for system interruptions + +## Advanced Usage + +### Batch Extraction Script + +```python +#!/usr/bin/env python3 +import subprocess +import sys + +models = ['ENGRO2', 'HMRcore', 'Recon'] + +for model in models: + cmd = [ + 'importMetabolicModel', + '--model', model, + '--name', f'{model}_data', + '--medium_selector', 'allOpen', + '--out_tabular', f'{model}.csv', + '--out_log', f'{model}.log', + '--tool_dir', '/opt/COBRAxy/src' + ] + subprocess.run(cmd, check=True) +``` + +### Database Integration + +Export model data to databases: + +```sql +-- Load CSV into PostgreSQL +CREATE TABLE model_reactions ( + reaction_id VARCHAR(50), + gpr_rule TEXT, + reaction_formula TEXT, + lower_bound FLOAT, + upper_bound FLOAT, + objective_coefficient FLOAT, + medium_member BOOLEAN, + compartment VARCHAR(50), + subsystem VARCHAR(100) +); + +COPY model_reactions FROM 'model_data.csv' WITH CSV HEADER; +``` + +## See Also + +- [Export Metabolic Model](export-metabolic-model.md) - Export tabular data to model formats +- [RAS Generator](ras-generator.md) - Use extracted GPR rules for RAS computation +- [RPS Generator](rps-generator.md) - Use reaction formulas for RPS analysis +- [Custom Model Tutorial](/tutorials/custom-model-integration.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/marea-cluster.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/marea-cluster.md Sun Oct 26 19:27:41 2025 +0000 @@ -508,5 +508,5 @@ - [MAREA](marea.md) - Statistical analysis of cluster differences - [RAS Generator](ras-generator.md) - Generate clustering input data - [Flux Simulation](flux-simulation.md) - Alternative clustering data source -- [Clustering Tutorial](../tutorials/clustering-analysis.md) -- [Validation Methods Reference](../tutorials/cluster-validation.md) \ No newline at end of file +- [Clustering Tutorial](/tutorials/clustering-analysis.md) +- [Validation Methods Reference](/tutorials/cluster-validation.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/marea.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/marea.md Sun Oct 26 19:27:41 2025 +0000 @@ -274,6 +274,6 @@ ## See Also -- [Statistical Tests Documentation](../tutorials/statistical-tests.md) -- [Map Customization Guide](../tutorials/custom-maps.md) -- [Multi-modal Analysis Tutorial](../tutorials/multimodal-analysis.md) \ No newline at end of file +- [Statistical Tests Documentation](/tutorials/statistical-tests.md) +- [Map Customization Guide](/tutorials/custom-maps.md) +- [Multi-modal Analysis Tutorial](/tutorials/multimodal-analysis.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/metabolic-model-setting.md Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,425 +0,0 @@ -# Metabolic Model Setting - -Extract and organize metabolic model components into tabular format for analysis and integration. - -## Overview - -Metabolic Model Setting (metabolicModel2Tabular) extracts key components from SBML metabolic models and generates comprehensive tabular summaries. This tool processes built-in or custom models, applies medium constraints, handles gene nomenclature conversion, and outputs structured data for downstream analysis. - -## Usage - -### Command Line - -```bash -metabolicModel2Tabular --model ENGRO2 \ - --name ENGRO2 \ - --medium_selector allOpen \ - --gene_format Default \ - --out_tabular model_data.csv \ - --out_log extraction.log \ - --tool_dir /path/to/COBRAxy -``` - -### Galaxy Interface - -Select "Metabolic Model Setting" from the COBRAxy tool suite and configure model extraction parameters. - -## Parameters - -### Required Parameters - -| Parameter | Flag | Description | -|-----------|------|-------------| -| Model Name | `--name` | Model identifier for output files | -| Medium Selector | `--medium_selector` | Medium configuration option | -| Output Tabular | `--out_tabular` | Output file path (CSV or XLSX) | -| Output Log | `--out_log` | Log file for processing information | -| Tool Directory | `--tool_dir` | COBRAxy installation directory | - -### Model Selection Parameters - -| Parameter | Flag | Description | Default | -|-----------|------|-------------|---------| -| Built-in Model | `--model` | Pre-installed model (ENGRO2, Recon, HMRcore) | - | -| Custom Model | `--input` | Path to custom SBML/JSON model file | - | - -**Note**: Provide either `--model` OR `--input`, not both. - -### Optional Parameters - -| Parameter | Flag | Description | Default | -|-----------|------|-------------|---------| -| Gene Format | `--gene_format` | Gene ID format conversion | Default | - -## Model Selection - -### Built-in Models - -#### ENGRO2 -- **Species**: Homo sapiens -- **Scope**: Genome-scale reconstruction -- **Reactions**: ~2,000 reactions -- **Metabolites**: ~1,500 metabolites -- **Coverage**: Comprehensive human metabolism - -#### Recon -- **Species**: Homo sapiens -- **Scope**: Recon3D human reconstruction -- **Reactions**: ~10,000+ reactions -- **Metabolites**: ~5,000+ metabolites -- **Coverage**: Most comprehensive human model - -#### HMRcore -- **Species**: Homo sapiens -- **Scope**: Core metabolic network -- **Reactions**: ~300 essential reactions -- **Metabolites**: ~200 core metabolites -- **Coverage**: Central carbon and energy metabolism - -### Custom Models - -Supported formats for custom model import: -- **SBML**: Systems Biology Markup Language (.xml, .sbml) -- **JSON**: COBRApy JSON format (.json) -- **MAT**: MATLAB format (.mat) -- **YML**: YAML format (.yml, .yaml) -- **Compressed**: All formats support .gz, .zip, .bz2 compression - -## Medium Configuration - -### allOpen (Default) -- All exchange reactions unconstrained -- Maximum metabolic flexibility -- Suitable for general analysis - -### Custom Medium -User can specify custom medium constraints through Galaxy interface or by modifying the tool configuration. - -## Gene Format Options - -| Format | Description | Example | -|--------|-------------|---------| -| Default | Original model gene IDs | As stored in model | -| ENSNG | Ensembl Gene IDs | ENSG00000139618 | -| HGNC_SYMBOL | HUGO Gene Symbols | BRCA2 | -| HGNC_ID | HUGO Gene Committee IDs | HGNC:1101 | -| ENTREZ | NCBI Entrez Gene IDs | 675 | - -Gene format conversion uses internal mapping tables and may not cover all genes in custom models. - -## Output Format - -### Tabular Summary File - -The output contains comprehensive model information in CSV or XLSX format: - -#### Column Structure -``` -Reaction_ID GPR_Rule Reaction_Formula Lower_Bound Upper_Bound Objective_Coefficient Medium_Member Compartment Subsystem -R00001 GENE1 or GENE2 A + B -> C + D -1000.0 1000.0 0.0 FALSE cytosol Glycolysis -R00002 GENE3 and GENE4 E <-> F -1000.0 1000.0 0.0 FALSE mitochondria TCA_Cycle -EX_glc_e - glc_e <-> -1000.0 1000.0 0.0 TRUE extracellular Exchange -``` - -#### Data Fields - -| Field | Description | Values | -|-------|-------------|---------| -| Reaction_ID | Unique reaction identifier | String | -| GPR_Rule | Gene-protein-reaction association | Logical expression | -| Reaction_Formula | Stoichiometric equation | Metabolites with coefficients | -| Lower_Bound | Minimum flux constraint | Numeric (typically -1000) | -| Upper_Bound | Maximum flux constraint | Numeric (typically 1000) | -| Objective_Coefficient | Biomass/objective weight | Numeric (0 or 1) | -| Medium_Member | Exchange reaction flag | TRUE/FALSE | -| Compartment | Subcellular location | String (for ENGRO2 only) | -| Subsystem | Metabolic pathway | String | - -## Examples - -### Extract Built-in Model Data - -```bash -# Extract ENGRO2 model with default settings -metabolicModel2Tabular --model ENGRO2 \ - --name ENGRO2_extraction \ - --medium_selector allOpen \ - --gene_format Default \ - --out_tabular ENGRO2_data.csv \ - --out_log ENGRO2_log.txt \ - --tool_dir /opt/COBRAxy -``` - -### Process Custom Model - -```bash -# Extract custom SBML model with gene conversion -metabolicModel2Tabular --input /data/custom_model.xml \ - --name CustomModel \ - --medium_selector allOpen \ - --gene_format HGNC_SYMBOL \ - --out_tabular custom_model_data.xlsx \ - --out_log custom_extraction.log \ - --tool_dir /opt/COBRAxy -``` - -### Extract Core Model for Quick Analysis - -```bash -# Extract HMRcore for rapid prototyping -metabolicModel2Tabular --model HMRcore \ - --name CoreModel \ - --medium_selector allOpen \ - --gene_format ENSNG \ - --out_tabular core_reactions.csv \ - --out_log core_log.txt \ - --tool_dir /opt/COBRAxy -``` - -### Batch Processing Multiple Models - -```bash -#!/bin/bash -models=("ENGRO2" "HMRcore" "Recon") -for model in "${models[@]}"; do - metabolicModel2Tabular --model "$model" \ - --name "${model}_extract" \ - --medium_selector allOpen \ - --gene_format HGNC_SYMBOL \ - --out_tabular "${model}_data.csv" \ - --out_log "${model}_log.txt" \ - --tool_dir /opt/COBRAxy -done -``` - -## Use Cases - -### Model Comparison -Extract multiple models to compare: -- Reaction coverage across different reconstructions -- Gene-reaction associations -- Pathway representation -- Metabolite compartmentalization - -### Data Integration -Prepare model data for: -- Custom analysis pipelines -- Database integration -- Pathway annotation -- Cross-reference mapping - -### Quality Control -Validate model properties: -- Check reaction balancing -- Verify gene associations -- Assess network connectivity -- Identify missing annotations - -### Custom Analysis -Export structured data for: -- Network analysis (graph theory) -- Machine learning applications -- Statistical modeling -- Comparative genomics - -## Integration Workflow - -### Downstream Tools - -The extracted tabular data serves as input for: - -#### COBRAxy Tools -- [RAS Generator](ras-generator.md) - Use extracted GPR rules -- [RPS Generator](rps-generator.md) - Use reaction formulas -- [RAS to Bounds](ras-to-bounds.md) - Use reaction bounds -- [MAREA](marea.md) - Use reaction annotations - -#### External Analysis -- **R/Bioconductor**: Import CSV for pathway analysis -- **Python/pandas**: Load data for network analysis -- **MATLAB**: Process XLSX for modeling -- **Cytoscape**: Network visualization -- **Databases**: Populate reaction databases - -### Typical Pipeline - -```bash -# 1. Extract model components -metabolicModel2Tabular --model ENGRO2 --name ModelData \ - --out_tabular model_components.csv - -# 2. Use extracted data for RAS analysis -ras_generator -td /opt/COBRAxy -rs Custom \ - -rl model_components.csv \ - -in expression_data.tsv -ra ras_scores.tsv - -# 3. Apply constraints and sample fluxes -ras_to_bounds -td /opt/COBRAxy -ms Custom -mo model_components.csv \ - -ir ras_scores.tsv -idop constrained_bounds/ - -# 4. Visualize results -marea -td /opt/COBRAxy -input_data ras_scores.tsv \ - -choice_map Custom -custom_map custom.svg -idop results/ -``` - -## Quality Control - -### Pre-extraction Validation -- Verify model file integrity and format -- Check SBML compliance for custom models -- Validate gene ID formats and coverage -- Confirm medium constraint specifications - -### Post-extraction Checks -- **Completeness**: Verify all expected reactions extracted -- **Consistency**: Check stoichiometric balance -- **Annotations**: Validate gene-reaction associations -- **Formatting**: Confirm output file structure - -### Data Validation - -#### Reaction Balancing -```bash -# Check for unbalanced reactions -awk -F'\t' 'NR>1 && $3 !~ /\<->\|->/ {print $1, $3}' model_data.csv -``` - -#### Gene Coverage -```bash -# Count reactions with GPR rules -awk -F'\t' 'NR>1 && $2 != "" {count++} END {print count " reactions with GPR"}' model_data.csv -``` - -#### Exchange Reactions -```bash -# List medium components -awk -F'\t' 'NR>1 && $7 == "TRUE" {print $1}' model_data.csv -``` - -## Tips and Best Practices - -### Model Selection -- **ENGRO2**: Balanced coverage for human tissue analysis -- **HMRcore**: Fast processing for algorithm development -- **Recon**: Comprehensive analysis requiring computational resources -- **Custom**: Organism-specific or specialized models - -### Gene Format Selection -- **Default**: Preserve original model annotations -- **HGNC_SYMBOL**: Human-readable gene names -- **ENSNG**: Stable identifiers for bioinformatics -- **ENTREZ**: Cross-database compatibility - -### Output Format Optimization -- **CSV**: Lightweight, universal compatibility -- **XLSX**: Rich formatting, multiple sheets possible -- Choose based on downstream analysis requirements - -### Performance Considerations -- Large models (Recon) may require substantial memory -- Gene format conversion adds processing time -- Consider batch processing for multiple extractions - -## Troubleshooting - -### Common Issues - -**Model loading fails** -- Check file format and compression -- Verify SBML validity for custom models -- Ensure sufficient system memory - -**Gene format conversion errors** -- Mapping tables may not cover all genes -- Original gene IDs retained when conversion fails -- Check log file for conversion statistics - -**Empty output file** -- Model may contain no reactions -- Check model file integrity -- Verify tool directory configuration - -### Error Messages - -| Error | Cause | Solution | -|-------|-------|----------| -| "Model file not found" | Invalid file path | Check file location and permissions | -| "Unsupported format" | Invalid model format | Use SBML, JSON, MAT, or YML | -| "Gene mapping failed" | Missing gene conversion data | Use Default format or update mappings | -| "Memory allocation error" | Insufficient system memory | Use smaller model or increase memory | - -### Performance Issues - -**Slow processing** -- Large models require more time -- Gene conversion adds overhead -- Monitor system resource usage - -**Memory errors** -- Reduce model size if possible -- Process in smaller batches -- Increase available system memory - -**Output file corruption** -- Check disk space availability -- Verify file write permissions -- Monitor for system interruptions - -## Advanced Usage - -### Custom Gene Mapping - -Advanced users can extend gene format conversion by modifying mapping files in the `local/mappings/` directory. - -### Batch Extraction Script - -```python -#!/usr/bin env python3 -import subprocess -import sys - -models = ['ENGRO2', 'HMRcore', 'Recon'] -formats = ['Default', 'HGNC_SYMBOL', 'ENSNG'] - -for model in models: - for fmt in formats: - cmd = [ - 'metabolicModel2Tabular', - '--model', model, - '--name', f'{model}_{fmt}', - '--medium_selector', 'allOpen', - '--gene_format', fmt, - '--out_tabular', f'{model}_{fmt}.csv', - '--out_log', f'{model}_{fmt}.log', - '--tool_dir', '/opt/COBRAxy' - ] - subprocess.run(cmd, check=True) -``` - -### Database Integration - -Export model data to databases: - -```sql --- Load CSV into PostgreSQL -CREATE TABLE model_reactions ( - reaction_id VARCHAR(50), - gpr_rule TEXT, - reaction_formula TEXT, - lower_bound FLOAT, - upper_bound FLOAT, - objective_coefficient FLOAT, - medium_member BOOLEAN, - compartment VARCHAR(50), - subsystem VARCHAR(100) -); - -COPY model_reactions FROM 'model_data.csv' WITH CSV HEADER; -``` - -## See Also - -- [RAS Generator](ras-generator.md) - Use extracted GPR rules for RAS computation -- [RPS Generator](rps-generator.md) - Use reaction formulas for RPS analysis -- [Custom Model Tutorial](../tutorials/custom-model-integration.md) -- [Gene Mapping Reference](../tutorials/gene-id-conversion.md) \ No newline at end of file
--- a/COBRAxy/docs/tools/ras-generator.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/ras-generator.md Sun Oct 26 19:27:41 2025 +0000 @@ -18,7 +18,6 @@ | Parameter | Short | Type | Description | |-----------|--------|------|-------------| -| `--tool_dir` | `-td` | string | COBRAxy installation directory | | `--input` | `-in` | file | Gene expression dataset (TSV format) | | `--ras_output` | `-ra` | file | Output file for RAS values | | `--rules_selector` | `-rs` | choice | Built-in model (ENGRO2, Recon, HMRcore) | @@ -27,11 +26,14 @@ | Parameter | Short | Type | Default | Description | |-----------|--------|------|---------|-------------| +| `--tool_dir` | `-td` | string | auto-detected | COBRAxy installation directory (automatically detected after pip install) | | `--none` | `-n` | boolean | true | Handle missing gene values | | `--model_upload` | `-rl` | file | - | Custom GPR rules file | | `--model_upload_name` | `-rn` | string | - | Custom model name | | `--out_log` | - | file | log.txt | Output log file | +> **Note**: After installing COBRAxy via pip, the `--tool_dir` parameter is automatically detected and doesn't need to be specified. + ## Input Format ### Gene Expression File @@ -102,35 +104,25 @@ ### Command Line ```bash -# Basic usage with built-in model -ras_generator -td /path/to/COBRAxy \ +# Basic usage with built-in model (after pip install) +ras_generator \ -in expression_data.tsv \ -ra ras_output.tsv \ -rs ENGRO2 # With custom model and strict missing gene handling -ras_generator -td /path/to/COBRAxy \ +ras_generator \ -in expression_data.tsv \ -ra ras_output.tsv \ -rl custom_rules.tsv \ -rn "CustomModel" \ -n false -``` -### Python API - -```python -import ras_generator - -# Basic RAS generation -args = [ - '-td', '/path/to/COBRAxy', - '-in', 'expression_data.tsv', - '-ra', 'ras_output.tsv', - '-rs', 'ENGRO2' -] - -ras_generator.main(args) +# Explicitly specify tool directory (only needed if not using pip install) +ras_generator -td /path/to/COBRAxy \ + -in expression_data.tsv \ + -ra ras_output.tsv \ + -rs ENGRO2 ``` ### Galaxy Usage
--- a/COBRAxy/docs/tools/ras-to-bounds.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tools/ras-to-bounds.md Sun Oct 26 19:27:41 2025 +0000 @@ -330,4 +330,4 @@ - [RAS Generator](ras-generator.md) - Generate input RAS data - [Flux Simulation](flux-simulation.md) - Use constrained bounds for sampling -- [Model Setting](metabolic-model-setting.md) - Extract model components \ No newline at end of file +- [Import Metabolic Model](import-metabolic-model.md) - Extract model components \ No newline at end of file
--- a/COBRAxy/docs/tools/tabular-to-model.md Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,480 +0,0 @@ -# Tabular to Metabolic Model - -Convert tabular data (CSV/TSV) into COBRA metabolic models in various formats. - -## Overview - -Tabular to Metabolic Model (tabular2MetabolicModel) converts structured tabular data containing reaction information into fully functional COBRA metabolic models. This tool enables creation of custom models from spreadsheet data and supports multiple output formats including SBML, JSON, MATLAB, and YAML. - -## Usage - -### Command Line - -```bash -tabular2MetabolicModel --input model_data.csv \ - --format sbml \ - --output custom_model.xml \ - --out_log conversion.log \ - --tool_dir /path/to/COBRAxy -``` - -### Galaxy Interface - -Select "Tabular to Metabolic Model" from the COBRAxy tool suite and configure conversion parameters. - -## Parameters - -### Required Parameters - -| Parameter | Flag | Description | -|-----------|------|-------------| -| Input File | `--input` | Tabular file (CSV/TSV) with model data | -| Output Format | `--format` | Model format (sbml, json, mat, yaml) | -| Output File | `--output` | Output model file path | -| Output Log | `--out_log` | Log file for conversion process | - -### Optional Parameters - -| Parameter | Flag | Description | Default | -|-----------|------|-------------|---------| -| Tool Directory | `--tool_dir` | COBRAxy installation directory | Current directory | - -## Input Format - -### Tabular Model Data - -The input file must contain structured model information with the following columns: - -```csv -Reaction_ID,GPR_Rule,Reaction_Formula,Lower_Bound,Upper_Bound,Objective_Coefficient,Medium_Member,Compartment,Subsystem -R00001,GENE1 or GENE2,A + B -> C + D,-1000.0,1000.0,0.0,FALSE,cytosol,Glycolysis -R00002,GENE3 and GENE4,E <-> F,-1000.0,1000.0,0.0,FALSE,mitochondria,TCA_Cycle -EX_glc_e,-,glc_e <->,-1000.0,1000.0,0.0,TRUE,extracellular,Exchange -BIOMASS,GENE5,0.5 A + 0.3 B -> 1 BIOMASS,0.0,1000.0,1.0,FALSE,cytosol,Biomass -``` - -### Required Columns - -| Column | Description | Format | -|--------|-------------|--------| -| **Reaction_ID** | Unique reaction identifier | String | -| **Reaction_Formula** | Stoichiometric equation | Metabolite notation | -| **Lower_Bound** | Minimum flux constraint | Numeric | -| **Upper_Bound** | Maximum flux constraint | Numeric | - -### Optional Columns - -| Column | Description | Default | -|--------|-------------|---------| -| **GPR_Rule** | Gene-protein-reaction association | Empty string | -| **Objective_Coefficient** | Biomass/objective weight | 0.0 | -| **Medium_Member** | Exchange reaction flag | FALSE | -| **Compartment** | Subcellular location | Empty | -| **Subsystem** | Metabolic pathway | Empty | - -## Output Formats - -### SBML (Systems Biology Markup Language) -- **Format**: XML-based standard -- **Extension**: `.xml` or `.sbml` -- **Use Case**: Interoperability with other tools -- **Advantages**: Widely supported, standardized - -### JSON (JavaScript Object Notation) -- **Format**: COBRApy native JSON -- **Extension**: `.json` -- **Use Case**: Python/COBRApy workflows -- **Advantages**: Human-readable, lightweight - -### MATLAB (.mat) -- **Format**: MATLAB workspace format -- **Extension**: `.mat` -- **Use Case**: MATLAB COBRA Toolbox -- **Advantages**: Direct MATLAB compatibility - -### YAML (YAML Ain't Markup Language) -- **Format**: Human-readable data serialization -- **Extension**: `.yml` or `.yaml` -- **Use Case**: Configuration and documentation -- **Advantages**: Most human-readable format - -## Reaction Formula Syntax - -### Standard Notation -``` -# Irreversible reaction -A + B -> C + D - -# Reversible reaction -A + B <-> C + D - -# With stoichiometric coefficients -2 A + 3 B -> 1 C + 4 D - -# Compartmentalized metabolites -glc_c + atp_c -> g6p_c + adp_c -``` - -### Compartment Suffixes -- `_c`: Cytosol -- `_m`: Mitochondria -- `_e`: Extracellular -- `_r`: Endoplasmic reticulum -- `_x`: Peroxisome -- `_n`: Nucleus - -### Exchange Reactions -``` -# Import reaction -EX_glc_e: glc_e <-> - -# Export reaction -EX_co2_e: co2_e <-> -``` - -## GPR Rule Syntax - -### Logical Operators -- **AND**: Gene products required together -- **OR**: Alternative gene products -- **Parentheses**: Grouping for complex logic - -### Examples -``` -# Single gene -GENE1 - -# Alternative genes (isozymes) -GENE1 or GENE2 or GENE3 - -# Required genes (complex) -GENE1 and GENE2 - -# Complex logic -(GENE1 and GENE2) or (GENE3 and GENE4) -``` - -## Examples - -### Create Basic Model - -```bash -# Convert simple CSV to SBML model -tabular2MetabolicModel --input simple_model.csv \ - --format sbml \ - --output simple_model.xml \ - --out_log simple_conversion.log -``` - -### Multi-format Export - -```bash -# Create models in all supported formats -formats=("sbml" "json" "mat" "yaml") -for fmt in "${formats[@]}"; do - tabular2MetabolicModel --input comprehensive_model.csv \ - --format "$fmt" \ - --output "model.$fmt" \ - --out_log "conversion_$fmt.log" -done -``` - -### Custom Model Creation - -```bash -# Build tissue-specific model from curated data -tabular2MetabolicModel --input liver_reactions.tsv \ - --format sbml \ - --output liver_model.xml \ - --out_log liver_model.log \ - --tool_dir /opt/COBRAxy -``` - -### Model Integration Pipeline - -```bash -# Extract existing model, modify, and recreate -metabolicModel2Tabular --model ENGRO2 --out_tabular base_model.csv - -# Edit base_model.csv with custom reactions/constraints - -# Create modified model -tabular2MetabolicModel --input modified_model.csv \ - --format sbml \ - --output custom_model.xml \ - --out_log custom_creation.log -``` - -## Model Validation - -### Automatic Checks - -The tool performs validation during conversion: -- **Stoichiometric Balance**: Reaction mass balance -- **Metabolite Consistency**: Compartment assignments -- **Bound Validation**: Feasible constraint ranges -- **Objective Function**: Valid biomass reaction - -### Post-conversion Validation - -```python -import cobra - -# Load and validate model -model = cobra.io.read_sbml_model('custom_model.xml') - -# Check basic properties -print(f"Reactions: {len(model.reactions)}") -print(f"Metabolites: {len(model.metabolites)}") -print(f"Genes: {len(model.genes)}") - -# Test model solvability -solution = model.optimize() -print(f"Growth rate: {solution.objective_value}") - -# Validate mass balance -unbalanced = cobra.flux_analysis.check_mass_balance(model) -if unbalanced: - print("Unbalanced reactions found:", unbalanced) -``` - -## Integration Workflow - -### Upstream Data Sources - -#### COBRAxy Tools -- [Metabolic Model Setting](metabolic-model-setting.md) - Extract tabular data for modification - -#### External Sources -- **Databases**: KEGG, Reactome, BiGG -- **Literature**: Manually curated reactions -- **Spreadsheets**: User-defined custom models - -### Downstream Applications - -#### COBRAxy Analysis -- [RAS to Bounds](ras-to-bounds.md) - Apply constraints to custom model -- [Flux Simulation](flux-simulation.md) - Sample fluxes from custom model -- [MAREA](marea.md) - Analyze custom pathways - -#### External Tools -- **COBRApy**: Python-based analysis -- **COBRA Toolbox**: MATLAB analysis -- **OptFlux**: Strain design -- **Escher**: Pathway visualization - -### Typical Pipeline - -```bash -# 1. Start with existing model data -metabolicModel2Tabular --model ENGRO2 \ - --out_tabular base_reactions.csv - -# 2. Modify/extend the reaction data -# Edit base_reactions.csv to add tissue-specific reactions - -# 3. Create custom model -tabular2MetabolicModel --input modified_reactions.csv \ - --format sbml \ - --output tissue_model.xml \ - --out_log tissue_creation.log - -# 4. Validate and use custom model -ras_to_bounds --model Custom --input tissue_model.xml \ - --ras_input tissue_expression.tsv \ - --idop tissue_bounds/ - -# 5. Perform flux analysis -flux_simulation --model Custom --input tissue_model.xml \ - --bounds tissue_bounds/*.tsv \ - --algorithm CBS --idop tissue_fluxes/ -``` - -## Quality Control - -### Input Data Validation - -#### Pre-conversion Checks -- **Format Consistency**: Verify column headers and data types -- **Reaction Completeness**: Check for missing required fields -- **Stoichiometric Validity**: Validate reaction formulas -- **Bound Feasibility**: Ensure lower ≤ upper bounds - -#### Common Data Issues -```bash -# Check for missing reaction IDs -awk -F',' 'NR>1 && ($1=="" || $1=="NA") {print "Empty ID in line " NR}' input.csv - -# Validate reaction directions -awk -F',' 'NR>1 && $3 !~ /->|<->/ {print "Invalid formula: " $1 ", " $3}' input.csv - -# Check bound consistency -awk -F',' 'NR>1 && $4>$5 {print "Invalid bounds: " $1 ", LB=" $4 " > UB=" $5}' input.csv -``` - -### Model Quality Assessment - -#### Structural Properties -- **Network Connectivity**: Ensure realistic pathway structure -- **Compartmentalization**: Validate transport reactions -- **Exchange Reactions**: Verify medium composition -- **Biomass Function**: Check objective reaction completeness - -#### Functional Testing -```python -# Test model functionality -model = cobra.io.read_sbml_model('custom_model.xml') - -# Check growth capability -growth = model.optimize().objective_value -print(f"Maximum growth rate: {growth}") - -# Flux Variability Analysis -fva_result = cobra.flux_analysis.flux_variability_analysis(model) -blocked_reactions = fva_result[(fva_result.minimum == 0) & (fva_result.maximum == 0)] -print(f"Blocked reactions: {len(blocked_reactions)}") - -# Essential gene analysis -essential_genes = cobra.flux_analysis.find_essential_genes(model) -print(f"Essential genes: {len(essential_genes)}") -``` - -## Tips and Best Practices - -### Data Preparation -- **Consistent Naming**: Use systematic metabolite/reaction IDs -- **Compartment Notation**: Follow standard suffixes (_c, _m, _e) -- **Balanced Reactions**: Verify mass and charge balance -- **Realistic Bounds**: Use physiologically relevant constraints - -### Model Design -- **Modular Structure**: Organize reactions by pathway/subsystem -- **Exchange Reactions**: Include all necessary transport processes -- **Biomass Function**: Define appropriate growth objective -- **Gene Associations**: Add GPR rules where available - -### Format Selection -- **SBML**: Choose for maximum compatibility and sharing -- **JSON**: Use for COBRApy-specific workflows -- **MATLAB**: Select for COBRA Toolbox integration -- **YAML**: Pick for human-readable documentation - -### Performance Optimization -- **Model Size**: Balance comprehensiveness with computational efficiency -- **Reaction Pruning**: Remove unnecessary or blocked reactions -- **Compartmentalization**: Minimize unnecessary compartments -- **Validation**: Test model properties before distribution - -## Troubleshooting - -### Common Issues - -**Conversion fails with format error** -- Check CSV/TSV column headers and data consistency -- Verify reaction formula syntax -- Ensure numeric fields contain valid numbers - -**Model is infeasible after conversion** -- Check reaction bounds for conflicts -- Verify exchange reaction setup -- Validate stoichiometric balance - -**Missing metabolites or reactions** -- Confirm all required columns present in input -- Check for empty rows or malformed data -- Validate reaction formula parsing - -### Error Messages - -| Error | Cause | Solution | -|-------|-------|----------| -| "Input file not found" | Invalid file path | Check file location and permissions | -| "Unknown format" | Invalid output format | Use: sbml, json, mat, or yaml | -| "Formula parsing failed" | Malformed reaction equation | Check reaction formula syntax | -| "Model infeasible" | Conflicting constraints | Review bounds and exchange reactions | - -### Performance Issues - -**Slow conversion** -- Large input files require more processing time -- Complex GPR rules increase parsing overhead -- Monitor system memory usage - -**Memory errors** -- Reduce model size or split into smaller files -- Increase available system memory -- Use more efficient data structures - -**Output file corruption** -- Ensure sufficient disk space -- Check file write permissions -- Verify format-specific requirements - -## Advanced Usage - -### Batch Model Creation - -```python -#!/usr/bin/env python3 -import subprocess -import pandas as pd - -# Create multiple tissue-specific models -tissues = ['liver', 'muscle', 'brain', 'heart'] -base_data = pd.read_csv('base_model.csv') - -for tissue in tissues: - # Modify base data for tissue specificity - tissue_data = customize_for_tissue(base_data, tissue) - tissue_data.to_csv(f'{tissue}_model.csv', index=False) - - # Convert to SBML - subprocess.run([ - 'tabular2MetabolicModel', - '--input', f'{tissue}_model.csv', - '--format', 'sbml', - '--output', f'{tissue}_model.xml', - '--out_log', f'{tissue}_conversion.log' - ]) -``` - -### Model Merging - -Combine multiple tabular files into comprehensive models: - -```bash -# Merge core metabolism with tissue-specific pathways -cat core_reactions.csv > combined_model.csv -tail -n +2 tissue_reactions.csv >> combined_model.csv -tail -n +2 disease_reactions.csv >> combined_model.csv - -# Create merged model -tabular2MetabolicModel --input combined_model.csv \ - --format sbml \ - --output comprehensive_model.xml -``` - -### Model Versioning - -Track model versions and changes: - -```bash -# Version control for model development -git add model_v1.csv -git commit -m "Initial model version" - -# Create versioned models -tabular2MetabolicModel --input model_v1.csv --format sbml --output model_v1.xml -tabular2MetabolicModel --input model_v2.csv --format sbml --output model_v2.xml - -# Compare model versions -cobra_compare_models model_v1.xml model_v2.xml -``` - -## See Also - -- [Metabolic Model Setting](metabolic-model-setting.md) - Extract tabular data from existing models -- [RAS to Bounds](ras-to-bounds.md) - Apply constraints to custom models -- [Flux Simulation](flux-simulation.md) - Analyze custom models with flux sampling -- [Model Creation Tutorial](../tutorials/custom-model-creation.md) -- [COBRA Model Standards](../tutorials/cobra-model-standards.md) \ No newline at end of file
--- a/COBRAxy/docs/troubleshooting.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/troubleshooting.md Sun Oct 26 19:27:41 2025 +0000 @@ -4,23 +4,50 @@ ## Installation Issues +### Missing Build Tools + +**Problem**: `gcc: command not found` or compilation errors (Linux/macOS) +```bash +# Ubuntu/Debian +sudo apt-get install build-essential cmake pkg-config + +# macOS +xcode-select --install +brew install cmake pkg-config +``` + +**Problem**: `CMake not found` +```bash +# Ubuntu/Debian +sudo apt-get install cmake + +# macOS +brew install cmake + +# Or via conda +conda install -c conda-forge cmake +``` + ### Python Import Errors **Problem**: `ModuleNotFoundError: No module named 'cobra'` ```bash -# Solution: Install missing dependencies -pip install cobra pandas numpy scipy +# Solution: Reinstall COBRAxy with dependencies +cd COBRAxy/src +pip install . -# Or reinstall COBRAxy -cd COBRAxy -pip install -e . +# Or install missing dependency directly +pip install cobra ``` **Problem**: `ImportError: No module named 'cobraxy'` ```python -# Solution: Add COBRAxy to Python path +# Solution: Ensure COBRAxy is installed +pip install /path/to/COBRAxy/src/ + +# Or add to Python path temporarily import sys -sys.path.insert(0, '/path/to/COBRAxy') +sys.path.insert(0, '/path/to/COBRAxy/src') ``` ### System Dependencies @@ -64,7 +91,7 @@ ```python # Check gene overlap with model import pickle -genes_dict = pickle.load(open('local/pickle files/ENGRO2_genes.p', 'rb')) +genes_dict = pickle.load(open('src/local/pickle files/ENGRO2_genes.p', 'rb')) model_genes = set(genes_dict['hugo_id'].keys()) import pandas as pd @@ -244,7 +271,7 @@ conda activate cobraxy # Install COBRAxy fresh -cd COBRAxy +cd COBRAxy/src pip install -e . ``` @@ -328,8 +355,7 @@ ### Community Resources -- **GitHub Issues**: [Report bugs](https://github.com/CompBtBs/COBRAxy/issues) -- **Discussions**: [Ask questions](https://github.com/CompBtBs/COBRAxy/discussions) +- **GitHub Issues**: [Report bugs and ask questions](https://github.com/CompBtBs/COBRAxy/issues) - **COBRApy Community**: [General metabolic modeling help](https://github.com/opencobra/cobrapy) ### Self-Help Checklist
--- a/COBRAxy/docs/tutorials/README.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tutorials/README.md Sun Oct 26 19:27:41 2025 +0000 @@ -1,31 +1,26 @@ # Tutorials -Learn COBRAxy through hands-on tutorials covering the two main usage scenarios. +Learn COBRAxy through hands-on tutorials for web-based analysis. ## Available Tutorials | Tutorial | Duration | Description | |----------|----------|-------------| -| [Galaxy Setup](galaxy-setup.md) | 30 min | Set up Galaxy for web-based analysis | -| [Python API Usage](python-api.md) | 45 min | Programmatic analysis with Python | +| [Galaxy Setup](tutorials/galaxy-setup) | 30 min | Set up Galaxy for web-based analysis | -## Choose Your Path +## Web Interface Tutorial -### Web Interface → [Galaxy Setup Tutorial](galaxy-setup.md) - -Set up a local Galaxy instance with COBRAxy tools for point-and-click analysis. Perfect for users who prefer graphical interfaces and don't want to write code. +### [Galaxy Setup Tutorial](tutorials/galaxy-setup) -### Python Programming → [Python API Tutorial](python-api.md) - -Learn to call COBRAxy tools programmatically in your analysis pipelines. Ideal for integrating COBRAxy into custom workflows and automation. +Set up a local Galaxy instance with COBRAxy tools for point-and-click analysis. Perfect for users who prefer graphical interfaces and reproducible workflows. ## Prerequisites Before starting the tutorials, make sure you have: -- ✅ [COBRAxy installed](../installation.md) +- ✅ [COBRAxy installed](installation) - ✅ Basic understanding of metabolic modeling (helpful but not required) -- ✅ Familiarity with command line or Python (depending on tutorial) +- ✅ Familiarity with command line basics ## Tutorial Data @@ -43,23 +38,12 @@ - Pre-configured Galaxy workflows - Expected output files for verification -## Learning Path - -We recommend following tutorials in this order: - -1. **[Data Formats](data-formats.md)** - Understand input requirements -2. **[Basic Workflow](workflow.md)** - Learn the analysis pipeline -3. Choose your interface: - - **[Galaxy Setup](galaxy-setup.md)** for web-based analysis - - **[Python API](python-api.md)** for programmatic analysis - ## Getting Help If you encounter issues during tutorials: 1. Check the specific tutorial's troubleshooting section -2. Refer to the main [Troubleshooting Guide](../troubleshooting.md) -3. Ask questions in [GitHub Discussions](https://github.com/CompBtBs/COBRAxy/discussions) +2. Refer to the main [Troubleshooting Guide](/troubleshooting.md) ## Contributing
--- a/COBRAxy/docs/tutorials/galaxy-setup.md Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/docs/tutorials/galaxy-setup.md Sun Oct 26 19:27:41 2025 +0000 @@ -44,17 +44,27 @@ ### COBRAxy-Specific Setup -1. **Copy COBRAxy files** to Galaxy's tools directory: +1. **Link COBRAxy to Galaxy tools** directory: ```bash - mkdir -p tools/cobraxy - cp /path/to/COBRAxy/*.xml tools/cobraxy/ - cp /path/to/COBRAxy/*.py tools/cobraxy/ - cp -r /path/to/COBRAxy/utils tools/cobraxy/ - cp -r /path/to/COBRAxy/local tools/cobraxy/ + cd /path/to/galaxy + ln -s /path/to/COBRAxy/src tools/cobraxy ``` 2. **Add tools to Galaxy configuration**: - Edit `config/tool_conf.xml` and add a COBRAxy section with all tool XML files. + Edit `config/tool_conf.xml` and add a COBRAxy section: + ```xml + <section id="cobraxy" name="COBRAxy"> + <tool file="cobraxy/importMetabolicModel.xml" /> + <tool file="cobraxy/exportMetabolicModel.xml" /> + <tool file="cobraxy/ras_generator.xml" /> + <tool file="cobraxy/rps_generator.xml" /> + <tool file="cobraxy/marea.xml" /> + <tool file="cobraxy/ras_to_bounds.xml" /> + <tool file="cobraxy/flux_simulation.xml" /> + <tool file="cobraxy/flux_to_map.xml" /> + <tool file="cobraxy/marea_cluster.xml" /> + </section> + ``` 3. **Restart Galaxy** to load the new tools. @@ -88,10 +98,6 @@ - Creating, editing, and sharing workflows - Workflow best practices -- **[Workflow Management](https://docs.galaxyproject.org/en/master/user/galaxy_workflow.html)** - - Official workflow documentation - - Advanced workflow features - ### Example COBRAxy Workflow A typical COBRAxy workflow might include: @@ -106,10 +112,6 @@ ### Galaxy Administration Resources -- **[Galaxy Admin Documentation](https://docs.galaxyproject.org/en/master/admin/)** - - Complete administrator guide - - Configuration, security, and maintenance - - **[Galaxy Training Materials](https://training.galaxyproject.org/)** - Hands-on tutorials for administrators and users - Best practices and troubleshooting @@ -118,29 +120,11 @@ - Community support and resources - Tool repositories and shared workflows -### COBRAxy-Specific Resources - -- **Dependencies**: Ensure `cobra`, `pandas`, `numpy`, `scipy` are installed in Galaxy's Python environment -- **Tool Files**: All COBRAxy XML and Python files should be accessible to Galaxy -- **Configuration**: Follow Galaxy's tool installation procedures for proper integration ## Troubleshooting For troubleshooting Galaxy installations and tool integration issues: -### Official Troubleshooting Resources - -- **[Galaxy FAQ](https://docs.galaxyproject.org/en/master/admin/faq.html)** - - Common installation and configuration issues - - Performance optimization tips - -- **[Galaxy Help Forum](https://help.galaxyproject.org/)** - - Community-driven support - - Search existing solutions or ask new questions - -- **[Galaxy GitHub Issues](https://github.com/galaxyproject/galaxy/issues)** - - Report bugs and technical issues - - Feature requests and discussions ### COBRAxy-Specific Issues @@ -150,7 +134,7 @@ - **Execution failures**: Verify Python dependencies and file permissions - **Parameter errors**: Ensure input data formats match tool requirements -Refer to the [COBRAxy Tools Documentation](../tools/) for detailed parameter information and data format requirements. +Refer to the [COBRAxy Tools Documentation](/tools/) for detailed parameter information and data format requirements. ## Summary
--- a/COBRAxy/docs/tutorials/python-api.md Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,531 +0,0 @@ -# Python API Tutorial - -Learn how to use COBRAxy tools programmatically in Python scripts and analysis pipelines. - -## Overview - -This tutorial teaches you to integrate COBRAxy into Python workflows by calling tool main functions directly with parsed arguments. - -**Time required**: ~45 minutes -**Difficulty**: Intermediate -**Prerequisites**: Basic Python knowledge, COBRAxy installation - -## Understanding COBRAxy Architecture - -### Tool Structure - -Each COBRAxy tool is a Python module with: -- `main(args)` function that accepts argument list -- Command-line argument parsing -- Self-contained execution logic - -```python -# General pattern for all tools -import tool_module -tool_module.main(['-arg1', 'value1', '-arg2', 'value2']) -``` - -### Available Tools - -| Python Module | Purpose | Key Arguments | -|---------------|---------|---------------| -| `ras_generator` | Compute reaction activity scores | `-in`, `-ra`, `-rs` | -| `rps_generator` | Compute reaction propensity scores | `-id`, `-rp` | -| `marea` | Statistical pathway analysis | `-input_data`, `-choice_map` | -| `ras_to_bounds` | Apply RAS constraints to model | `-ir`, `-ms`, `-idop` | -| `flux_simulation` | Sample metabolic fluxes | `-ms`, `-in`, `-a`, `-ns` | -| `flux_to_map` | Add flux data to maps | `-if`, `-mp`, `-idop` | - -## Setup Your Environment - -### Import Required Modules - -```python -import sys -import os -from pathlib import Path - -# Add COBRAxy to Python path -cobraxy_path = "/path/to/COBRAxy" -sys.path.insert(0, cobraxy_path) - -# Import COBRAxy tools -import ras_generator -import rps_generator -import marea -import ras_to_bounds -import flux_simulation -import flux_to_map -import metabolicModel2Tabular as model_setting -``` - -### Set Working Directory - -```python -# Set up working directory -work_dir = Path("/path/to/analysis") -work_dir.mkdir(exist_ok=True) -os.chdir(work_dir) - -# COBRAxy tools expect this parameter -tool_dir = str(Path(cobraxy_path).absolute()) -``` - -## Basic Workflow Example - -### Step 1: Prepare Sample Data - -```python -import pandas as pd -import numpy as np - -# Create sample gene expression data -genes = ['HGNC:5', 'HGNC:10', 'HGNC:15', 'HGNC:25', 'HGNC:30'] -samples = ['Control_1', 'Control_2', 'Treatment_1', 'Treatment_2'] - -# Generate random expression values -np.random.seed(42) -data = np.random.lognormal(mean=2, sigma=1, size=(len(genes), len(samples))) - -# Create DataFrame -expression_df = pd.DataFrame(data, index=genes, columns=samples) -expression_df.index.name = 'Gene_ID' - -# Save to file -expression_file = work_dir / "expression_data.tsv" -expression_df.to_csv(expression_file, sep='\t') -print(f"Created sample data: {expression_file}") -``` - -### Step 2: Extract Model Information - -```python -# Extract model components (optional, for understanding model structure) -model_args = [ - '-td', tool_dir, - '-ms', 'ENGRO2', # Use built-in ENGRO2 model - '-idop', str(work_dir / 'model_info') -] - -try: - model_setting.main(model_args) - print("✓ Model information extracted") -except Exception as e: - print(f"Model extraction failed: {e}") -``` - -### Step 3: Generate RAS Scores - -```python -# Generate Reaction Activity Scores -ras_output = work_dir / "ras_scores.tsv" - -ras_args = [ - '-td', tool_dir, - '-in', str(expression_file), - '-ra', str(ras_output), - '-rs', 'ENGRO2', # Built-in model - '-n', 'true' # Handle missing genes -] - -try: - ras_generator.main(ras_args) - print(f"✓ RAS scores generated: {ras_output}") -except Exception as e: - print(f"RAS generation failed: {e}") - raise -``` - -### Step 4: Generate RPS Scores (Optional) - -```python -# Create sample metabolite data -metabolites = ['glucose', 'pyruvate', 'lactate', 'ATP', 'NADH'] -met_data = np.random.lognormal(mean=3, sigma=0.5, size=(len(metabolites), len(samples))) - -met_df = pd.DataFrame(met_data, index=metabolites, columns=samples) -met_df.index.name = 'Metabolite_ID' - -metabolite_file = work_dir / "metabolite_data.tsv" -met_df.to_csv(metabolite_file, sep='\t') - -# Generate Reaction Propensity Scores -rps_output = work_dir / "rps_scores.tsv" - -rps_args = [ - '-td', tool_dir, - '-id', str(metabolite_file), - '-rp', str(rps_output) -] - -try: - rps_generator.main(rps_args) - print(f"✓ RPS scores generated: {rps_output}") -except Exception as e: - print(f"RPS generation warning: {e}") - # RPS generation might fail with sample data - that's OK -``` - -### Step 5: Statistical Analysis with MAREA - -```python -# Create enriched pathway maps -maps_output = work_dir / "pathway_maps" - -marea_args = [ - '-td', tool_dir, - '-using_RAS', 'true', - '-input_data', str(ras_output), - '-choice_map', 'ENGRO2', - '-gs', 'true', # Gene set analysis - '-idop', str(maps_output) -] - -try: - marea.main(marea_args) - print(f"✓ Pathway maps created: {maps_output}") -except Exception as e: - print(f"MAREA analysis failed: {e}") -``` - -### Step 6: Flux Simulation Pipeline - -```python -# Apply RAS constraints to model -bounds_output = work_dir / "model_bounds" - -bounds_args = [ - '-td', tool_dir, - '-ms', 'ENGRO2', - '-ir', str(ras_output), - '-rs', 'true', # Use RAS values - '-idop', str(bounds_output) -] - -try: - ras_to_bounds.main(bounds_args) - print(f"✓ Model constraints applied: {bounds_output}") -except Exception as e: - print(f"Bounds generation failed: {e}") - raise - -# Sample metabolic fluxes -flux_output = work_dir / "flux_samples" - -flux_args = [ - '-td', tool_dir, - '-ms', 'ENGRO2', - '-in', str(bounds_output / "*.tsv"), # Will be expanded by tool - '-a', 'CBS', # Sampling algorithm - '-ns', '1000', # Number of samples - '-idop', str(flux_output) -] - -try: - flux_simulation.main(flux_args) - print(f"✓ Flux samples generated: {flux_output}") -except Exception as e: - print(f"Flux simulation failed: {e}") -``` - -### Step 7: Create Final Visualizations - -```python -# Add flux data to enriched maps -final_maps = work_dir / "final_visualizations" - -# Check if we have both maps and flux data -maps_dir = maps_output -flux_dir = flux_output - -if maps_dir.exists() and flux_dir.exists(): - flux_to_map_args = [ - '-td', tool_dir, - '-if', str(flux_dir / "*.tsv"), - '-mp', str(maps_dir / "*.svg"), - '-idop', str(final_maps) - ] - - try: - flux_to_map.main(flux_to_map_args) - print(f"✓ Final visualizations created: {final_maps}") - except Exception as e: - print(f"Final mapping failed: {e}") -else: - print("Skipping final visualization - missing input files") -``` - -## Advanced Usage Patterns - -### Error Handling and Validation - -```python -def run_cobraxy_tool(tool_module, args, description): - """Helper function to run COBRAxy tools with error handling.""" - try: - print(f"Running {description}...") - tool_module.main(args) - print(f"✓ {description} completed successfully") - return True - except Exception as e: - print(f"✗ {description} failed: {e}") - return False - -# Usage -success = run_cobraxy_tool( - ras_generator, - ras_args, - "RAS generation" -) - -if not success: - print("Pipeline stopped due to error") - exit(1) -``` - -### Batch Processing Multiple Datasets - -```python -def process_dataset(dataset_path, output_dir): - """Process a single dataset through COBRAxy pipeline.""" - - dataset_name = dataset_path.stem - out_dir = Path(output_dir) / dataset_name - out_dir.mkdir(exist_ok=True) - - # Generate RAS - ras_file = out_dir / "ras_scores.tsv" - ras_args = [ - '-td', tool_dir, - '-in', str(dataset_path), - '-ra', str(ras_file), - '-rs', 'ENGRO2' - ] - - if run_cobraxy_tool(ras_generator, ras_args, f"RAS for {dataset_name}"): - # Continue with MAREA analysis - maps_dir = out_dir / "maps" - marea_args = [ - '-td', tool_dir, - '-using_RAS', 'true', - '-input_data', str(ras_file), - '-choice_map', 'ENGRO2', - '-idop', str(maps_dir) - ] - run_cobraxy_tool(marea, marea_args, f"MAREA for {dataset_name}") - - return out_dir - -# Process multiple datasets -datasets = [ - "/path/to/dataset1.tsv", - "/path/to/dataset2.tsv", - "/path/to/dataset3.tsv" -] - -results = [] -for dataset in datasets: - result_dir = process_dataset(Path(dataset), work_dir / "batch_results") - results.append(result_dir) - -print(f"Processed {len(results)} datasets") -``` - -### Custom Analysis Pipelines - -```python -class COBRAxyPipeline: - """Custom COBRAxy analysis pipeline.""" - - def __init__(self, tool_dir, work_dir): - self.tool_dir = tool_dir - self.work_dir = Path(work_dir) - self.work_dir.mkdir(exist_ok=True) - - def run_enrichment_analysis(self, expression_file, model='ENGRO2'): - """Run enrichment-focused analysis.""" - - # Generate RAS - ras_file = self.work_dir / "ras_scores.tsv" - ras_args = ['-td', self.tool_dir, '-in', str(expression_file), - '-ra', str(ras_file), '-rs', model] - - if not run_cobraxy_tool(ras_generator, ras_args, "RAS generation"): - return None - - # Run MAREA - maps_dir = self.work_dir / "enrichment_maps" - marea_args = ['-td', self.tool_dir, '-using_RAS', 'true', - '-input_data', str(ras_file), '-choice_map', model, - '-gs', 'true', '-idop', str(maps_dir)] - - if run_cobraxy_tool(marea, marea_args, "MAREA analysis"): - return maps_dir - return None - - def run_flux_analysis(self, expression_file, model='ENGRO2', n_samples=1000): - """Run flux sampling analysis.""" - - # Generate RAS and apply bounds - ras_file = self.work_dir / "ras_scores.tsv" - bounds_dir = self.work_dir / "bounds" - flux_dir = self.work_dir / "flux_samples" - - # RAS generation - ras_args = ['-td', self.tool_dir, '-in', str(expression_file), - '-ra', str(ras_file), '-rs', model] - if not run_cobraxy_tool(ras_generator, ras_args, "RAS generation"): - return None - - # Apply bounds - bounds_args = ['-td', self.tool_dir, '-ms', model, '-ir', str(ras_file), - '-rs', 'true', '-idop', str(bounds_dir)] - if not run_cobraxy_tool(ras_to_bounds, bounds_args, "Bounds application"): - return None - - # Flux sampling - flux_args = ['-td', self.tool_dir, '-ms', model, - '-in', str(bounds_dir / "*.tsv"), - '-a', 'CBS', '-ns', str(n_samples), '-idop', str(flux_dir)] - - if run_cobraxy_tool(flux_simulation, flux_args, "Flux simulation"): - return flux_dir - return None - -# Usage -pipeline = COBRAxyPipeline(tool_dir, work_dir / "custom_analysis") - -# Run enrichment analysis -enrichment_results = pipeline.run_enrichment_analysis(expression_file) -if enrichment_results: - print(f"Enrichment analysis completed: {enrichment_results}") - -# Run flux analysis -flux_results = pipeline.run_flux_analysis(expression_file, n_samples=500) -if flux_results: - print(f"Flux analysis completed: {flux_results}") -``` - -## Integration with Data Analysis Libraries - -### Pandas Integration - -```python -# Read COBRAxy results back into pandas -ras_df = pd.read_csv(ras_output, sep='\t', index_col=0) -print(f"RAS data shape: {ras_df.shape}") -print(f"Sample statistics:\n{ras_df.describe()}") - -# Filter highly variable reactions -ras_std = ras_df.std(axis=1) -variable_reactions = ras_std.nlargest(20).index -print(f"Most variable reactions: {list(variable_reactions)}") -``` - -### Matplotlib Visualization - -```python -import matplotlib.pyplot as plt -import seaborn as sns - -# Visualize RAS distributions -plt.figure(figsize=(12, 8)) -sns.heatmap(ras_df.iloc[:50], cmap='RdBu_r', center=0, cbar_kws={'label': 'RAS Score'}) -plt.title('Reaction Activity Scores (Top 50 Reactions)') -plt.xlabel('Samples') -plt.ylabel('Reactions') -plt.tight_layout() -plt.savefig(work_dir / 'ras_heatmap.png', dpi=300) -plt.show() -``` - -## Best Practices - -### 1. Environment Management -```python -# Use pathlib for cross-platform compatibility -from pathlib import Path - -# Use absolute paths -tool_dir = str(Path(cobraxy_path).absolute()) -work_dir = Path("/analysis").absolute() -``` - -### 2. Error Handling -```python -# Always wrap tool calls in try-except -try: - ras_generator.main(ras_args) -except Exception as e: - print(f"RAS generation failed: {e}") - # Log details, cleanup, or alternative action -``` - -### 3. Argument Validation -```python -def validate_file_exists(filepath): - """Validate input file exists.""" - path = Path(filepath) - if not path.exists(): - raise FileNotFoundError(f"Input file not found: {filepath}") - return str(path.absolute()) - -# Use before calling tools -expression_file = validate_file_exists(expression_file) -``` - - - -## Troubleshooting - -### Common Issues - -**Import errors** -```python -# Check if COBRAxy path is correct -import sys -print("Python path includes:") -for p in sys.path: - print(f" {p}") - -# Add COBRAxy path -sys.path.insert(0, "/correct/path/to/COBRAxy") -``` - -**Tool execution failures** -```python -# Enable verbose output -import logging -logging.basicConfig(level=logging.DEBUG) - -# Check working directory -print(f"Current directory: {os.getcwd()}") -print(f"Directory contents: {list(Path('.').iterdir())}") -``` - -**File path issues** -```python -# Use absolute paths -ras_args = [ - '-td', str(Path(tool_dir).absolute()), - '-in', str(Path(expression_file).absolute()), - '-ra', str(Path(ras_output).absolute()), - '-rs', 'ENGRO2' -] -``` - -## Next Steps - -Now that you can use COBRAxy programmatically: - -1. **[Tools Reference](../tools/)** - Detailed parameter documentation -2. **[Examples](../examples/)** - Real-world analysis scripts -3. **Build custom analysis pipelines** for your research needs -4. **Integrate with workflow managers** like Snakemake or Nextflow - -## Resources - -- [COBRApy Documentation](https://cobrapy.readthedocs.io/) - Underlying metabolic modeling library -- [Pandas Documentation](https://pandas.pydata.org/) - Data manipulation -- [Matplotlib Gallery](https://matplotlib.org/gallery/) - Visualization examples -- [Python Pathlib](https://docs.python.org/3/library/pathlib.html) - Modern path handling \ No newline at end of file
--- a/COBRAxy/src/exportMetabolicModel.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/exportMetabolicModel.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,117 +1,123 @@ -""" -Convert a tabular (CSV/TSV/Tabular) description of a COBRA model into a COBRA file. - -Supported output formats: SBML, JSON, MATLAB (.mat), YAML. -The script logs to a user-provided file for easier debugging in Galaxy. -""" - -import os -import cobra -import argparse -from typing import List -import logging -import utils.model_utils as modelUtils - -ARGS : argparse.Namespace -def process_args(args: List[str] = None) -> argparse.Namespace: - """ - Parse command-line arguments for the CSV-to-COBRA conversion tool. - - Returns: - argparse.Namespace: Parsed arguments. - """ - parser = argparse.ArgumentParser( - usage="%(prog)s [options]", - description="Convert a tabular/CSV file to a COBRA model" - ) - - - parser.add_argument("--out_log", type=str, required=True, - help="Output log file") - - - parser.add_argument("--input", type=str, required=True, - help="Input tabular file (CSV/TSV)") - - - parser.add_argument("--format", type=str, required=True, choices=["sbml", "json", "mat", "yaml"], - help="Model format (SBML, JSON, MATLAB, YAML)") - - - parser.add_argument("--output", type=str, required=True, - help="Output model file path") - - parser.add_argument("--tool_dir", type=str, default=os.path.dirname(__file__), - help="Tool directory (passed from Galaxy as $__tool_directory__)") - - - return parser.parse_args(args) - - -###############################- ENTRY POINT -################################ - -def main(args: List[str] = None) -> None: - """ - Entry point: parse arguments, build the COBRA model from a CSV/TSV file, - and save it in the requested format. - - Returns: - None - """ - global ARGS - ARGS = process_args(args) - - # configure logging to the requested log file (overwrite each run) - logging.basicConfig(filename=ARGS.out_log, - level=logging.DEBUG, - format='%(asctime)s %(levelname)s: %(message)s', - filemode='w') - - logging.info('Starting fromCSVtoCOBRA tool') - logging.debug('Args: input=%s format=%s output=%s tool_dir=%s', ARGS.input, ARGS.format, ARGS.output, ARGS.tool_dir) - - try: - # Basic sanity checks - if not os.path.exists(ARGS.input): - logging.error('Input file not found: %s', ARGS.input) - - out_dir = os.path.dirname(os.path.abspath(ARGS.output)) - - if out_dir and not os.path.isdir(out_dir): - try: - os.makedirs(out_dir, exist_ok=True) - logging.info('Created missing output directory: %s', out_dir) - except Exception as e: - logging.exception('Cannot create output directory: %s', out_dir) - - model = modelUtils.build_cobra_model_from_csv(ARGS.input) - - - logging.info('Created model with name: %s (ID: %s)', model.name, model.id) - - # Save model in requested format - Galaxy handles the filename - if ARGS.format == "sbml": - cobra.io.write_sbml_model(model, ARGS.output) - elif ARGS.format == "json": - cobra.io.save_json_model(model, ARGS.output) - elif ARGS.format == "mat": - cobra.io.save_matlab_model(model, ARGS.output) - elif ARGS.format == "yaml": - cobra.io.save_yaml_model(model, ARGS.output) - else: - logging.error('Unknown format requested: %s', ARGS.format) - raise ValueError(f"Unknown format: {ARGS.format}") - - - logging.info('Model successfully written to %s (format=%s)', ARGS.output, ARGS.format) - print(f"Model created successfully in {ARGS.format.upper()} format") - - except Exception as e: - # Log full traceback to the out_log so Galaxy users/admins can see what happened - logging.exception('Unhandled exception in fromCSVtoCOBRA') - print(f"ERROR: {str(e)}") - raise - - -if __name__ == '__main__': - main() +""" +Convert a tabular (CSV/TSV/Tabular) description of a COBRA model into a COBRA file. + +Supported output formats: SBML, JSON, MATLAB (.mat), YAML. +The script logs to a user-provided file for easier debugging in Galaxy. +""" + +import os +import cobra +import argparse +from typing import List +import logging + +try: + from .utils import model_utils as modelUtils + from .utils import general_utils as utils +except: + import utils.model_utils as modelUtils + import utils.general_utils as utils + +ARGS : argparse.Namespace +def process_args(args: List[str] = None) -> argparse.Namespace: + """ + Parse command-line arguments for the CSV-to-COBRA conversion tool. + + Returns: + argparse.Namespace: Parsed arguments. + """ + parser = argparse.ArgumentParser( + usage="%(prog)s [options]", + description="Convert a tabular/CSV file to a COBRA model" + ) + + + parser.add_argument("--out_log", type=str, required=True, + help="Output log file") + + + parser.add_argument("--input", type=str, required=True, + help="Input tabular file (CSV/TSV)") + + + parser.add_argument("--format", type=str, required=True, choices=["sbml", "json", "mat", "yaml"], + help="Model format (SBML, JSON, MATLAB, YAML)") + + + parser.add_argument("--output", type=str, required=True, + help="Output model file path") + + parser.add_argument("--tool_dir", type=str, default=os.path.dirname(os.path.abspath(__file__)), + help="Tool directory (default: auto-detected package location)") + + + return parser.parse_args(args) + + +###############################- ENTRY POINT -################################ + +def main(args: List[str] = None) -> None: + """ + Entry point: parse arguments, build the COBRA model from a CSV/TSV file, + and save it in the requested format. + + Returns: + None + """ + global ARGS + ARGS = process_args(args) + + # configure logging to the requested log file (overwrite each run) + logging.basicConfig(filename=ARGS.out_log, + level=logging.DEBUG, + format='%(asctime)s %(levelname)s: %(message)s', + filemode='w') + + logging.info('Starting fromCSVtoCOBRA tool') + logging.debug('Args: input=%s format=%s output=%s tool_dir=%s', ARGS.input, ARGS.format, ARGS.output, ARGS.tool_dir) + + try: + # Basic sanity checks + if not os.path.exists(ARGS.input): + logging.error('Input file not found: %s', ARGS.input) + + out_dir = os.path.dirname(os.path.abspath(ARGS.output)) + + if out_dir and not os.path.isdir(out_dir): + try: + os.makedirs(out_dir, exist_ok=True) + logging.info('Created missing output directory: %s', out_dir) + except Exception as e: + logging.exception('Cannot create output directory: %s', out_dir) + + model = modelUtils.build_cobra_model_from_csv(ARGS.input) + + + logging.info('Created model with name: %s (ID: %s)', model.name, model.id) + + # Save model in requested format - Galaxy handles the filename + if ARGS.format == "sbml": + cobra.io.write_sbml_model(model, ARGS.output) + elif ARGS.format == "json": + cobra.io.save_json_model(model, ARGS.output) + elif ARGS.format == "mat": + cobra.io.save_matlab_model(model, ARGS.output) + elif ARGS.format == "yaml": + cobra.io.save_yaml_model(model, ARGS.output) + else: + logging.error('Unknown format requested: %s', ARGS.format) + raise ValueError(f"Unknown format: {ARGS.format}") + + + logging.info('Model successfully written to %s (format=%s)', ARGS.output, ARGS.format) + print(f"Model created successfully in {ARGS.format.upper()} format") + + except Exception as e: + # Log full traceback to the out_log so Galaxy users/admins can see what happened + logging.exception('Unhandled exception in fromCSVtoCOBRA') + print(f"ERROR: {str(e)}") + raise + + +if __name__ == '__main__': + main()
--- a/COBRAxy/src/exportMetabolicModel.xml Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/exportMetabolicModel.xml Sun Oct 26 19:27:41 2025 +0000 @@ -38,7 +38,7 @@ <!-- Tool outputs --> <outputs> - <data name="log" format="txt" label="Tabular to Model Conversion - Log" /> + <data name="log" format="txt" label="Export Metabolic Model - Log" /> <data name="output" format="xml" label="${model_name}.${format}"> <change_format> <when input="format" value="sbml" format="xml"/>
--- a/COBRAxy/src/flux_simulation.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/flux_simulation.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,636 +1,642 @@ -""" -Flux sampling and analysis utilities for COBRA models. - -This script supports two modes: -- Mode 1 (model_and_bounds=True): load a base model and apply bounds from - separate files before sampling. -- Mode 2 (model_and_bounds=False): load complete models and sample directly. - -Sampling algorithms supported: OPTGP and CBS. Outputs include flux samples -and optional analyses (pFBA, FVA, sensitivity), saved as tabular files. -""" - -import argparse -import utils.general_utils as utils -from typing import List -import os -import pandas as pd -import numpy as np -import cobra -import utils.CBS_backend as CBS_backend -from joblib import Parallel, delayed, cpu_count -from cobra.sampling import OptGPSampler -import sys -import utils.model_utils as model_utils - - -################################# process args ############################### -def process_args(args: List[str] = None) -> argparse.Namespace: - """ - Processes command-line arguments. - - Args: - args (list): List of command-line arguments. - - Returns: - Namespace: An object containing parsed arguments. - """ - parser = argparse.ArgumentParser(usage='%(prog)s [options]', - description='process some value\'s') - - parser.add_argument("-mo", "--model_upload", type=str, - help="path to input file with custom rules, if provided") - - parser.add_argument("-mab", "--model_and_bounds", type=str, - choices=['True', 'False'], - required=True, - help="upload mode: True for model+bounds, False for complete models") - - parser.add_argument("-ens", "--sampling_enabled", type=str, - choices=['true', 'false'], - required=True, - help="enable sampling: 'true' for sampling, 'false' for no sampling") - - parser.add_argument('-ol', '--out_log', - help="Output log") - - parser.add_argument('-td', '--tool_dir', - type=str, - required=True, - help='your tool directory') - - parser.add_argument('-in', '--input', - required=True, - type=str, - help='input bounds files or complete model files') - - parser.add_argument('-ni', '--name', - required=True, - type=str, - help='cell names') - - parser.add_argument('-a', '--algorithm', - type=str, - choices=['OPTGP', 'CBS'], - required=True, - help='choose sampling algorithm') - - parser.add_argument('-th', '--thinning', - type=int, - default=100, - required=True, - help='choose thinning') - - parser.add_argument('-ns', '--n_samples', - type=int, - required=True, - help='choose how many samples (set to 0 for optimization only)') - - parser.add_argument('-sd', '--seed', - type=int, - required=True, - help='seed for random number generation') - - parser.add_argument('-nb', '--n_batches', - type=int, - required=True, - help='choose how many batches') - - parser.add_argument('-opt', '--perc_opt', - type=float, - default=0.9, - required=False, - help='choose the fraction of optimality for FVA (0-1)') - - parser.add_argument('-ot', '--output_type', - type=str, - required=True, - help='output type for sampling results') - - parser.add_argument('-ota', '--output_type_analysis', - type=str, - required=False, - help='output type analysis (optimization methods)') - - parser.add_argument('-idop', '--output_path', - type=str, - default='flux_simulation/', - help = 'output path for fluxes') - - parser.add_argument('-otm', '--out_mean', - type = str, - required=False, - help = 'output of mean of fluxes') - - parser.add_argument('-otmd', '--out_median', - type = str, - required=False, - help = 'output of median of fluxes') - - parser.add_argument('-otq', '--out_quantiles', - type = str, - required=False, - help = 'output of quantiles of fluxes') - - parser.add_argument('-otfva', '--out_fva', - type = str, - required=False, - help = 'output of FVA results') - parser.add_argument('-otp', '--out_pfba', - type = str, - required=False, - help = 'output of pFBA results') - parser.add_argument('-ots', '--out_sensitivity', - type = str, - required=False, - help = 'output of sensitivity results') - ARGS = parser.parse_args(args) - return ARGS -########################### warning ########################################### -def warning(s :str) -> None: - """ - Log a warning message to an output log file and print it to the console. - - Args: - s (str): The warning message to be logged and printed. - - Returns: - None - """ - with open(ARGS.out_log, 'a') as log: - log.write(s + "\n\n") - print(s) - - -def write_to_file(dataset: pd.DataFrame, path: str, keep_index:bool=False, name:str=None)->None: - """ - Write a DataFrame to a TSV file under path with a given base name. - - Args: - dataset: The DataFrame to write. - name: Base file name (without extension). If None, 'path' is treated as the full file path. - path: Directory path where the file will be saved. - keep_index: Whether to keep the DataFrame index in the file. - - Returns: - None - """ - dataset.index.name = 'Reactions' - if name: - dataset.to_csv(os.path.join(path, name + ".csv"), sep = '\t', index = keep_index) - else: - dataset.to_csv(path, sep = '\t', index = keep_index) - -############################ dataset input #################################### -def read_dataset(data :str, name :str) -> pd.DataFrame: - """ - Read a dataset from a CSV file and return it as a pandas DataFrame. - - Args: - data (str): Path to the CSV file containing the dataset. - name (str): Name of the dataset, used in error messages. - - Returns: - pandas.DataFrame: DataFrame containing the dataset. - - Raises: - pd.errors.EmptyDataError: If the CSV file is empty. - sys.exit: If the CSV file has the wrong format, the execution is aborted. - """ - try: - dataset = pd.read_csv(data, sep = '\t', header = 0, index_col=0, engine='python') - except pd.errors.EmptyDataError: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - if len(dataset.columns) < 2: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - return dataset - - - -def OPTGP_sampler(model: cobra.Model, model_name: str, n_samples: int = 1000, thinning: int = 100, n_batches: int = 1, seed: int = 0) -> None: - """ - Samples from the OPTGP (Optimal Global Perturbation) algorithm and saves the results to CSV files. - - Args: - model (cobra.Model): The COBRA model to sample from. - model_name (str): The name of the model, used in naming output files. - n_samples (int, optional): Number of samples per batch. Default is 1000. - thinning (int, optional): Thinning parameter for the sampler. Default is 100. - n_batches (int, optional): Number of batches to run. Default is 1. - seed (int, optional): Random seed for reproducibility. Default is 0. - - Returns: - None - """ - import numpy as np - - # Get reaction IDs for consistent column ordering - reaction_ids = [rxn.id for rxn in model.reactions] - - # Sample and save each batch as numpy file - for i in range(n_batches): - optgp = OptGPSampler(model, thinning, seed) - samples = optgp.sample(n_samples) - - # Save as numpy array (more memory efficient) - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" - np.save(batch_filename, samples.to_numpy()) - - seed += 1 - - # Merge all batches into a single DataFrame - all_samples = [] - - for i in range(n_batches): - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" - batch_data = np.load(batch_filename, allow_pickle=True) - all_samples.append(batch_data) - - # Concatenate all batches - samplesTotal_array = np.vstack(all_samples) - - # Convert back to DataFrame with proper column names - samplesTotal = pd.DataFrame(samplesTotal_array, columns=reaction_ids) - - # Save the final merged result as CSV - write_to_file(samplesTotal.T, ARGS.output_path, True, name=model_name) - - # Clean up temporary numpy files - for i in range(n_batches): - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" - if os.path.exists(batch_filename): - os.remove(batch_filename) - - -def CBS_sampler(model: cobra.Model, model_name: str, n_samples: int = 1000, n_batches: int = 1, seed: int = 0) -> None: - """ - Samples using the CBS (Constraint-based Sampling) algorithm and saves the results to CSV files. - - Args: - model (cobra.Model): The COBRA model to sample from. - model_name (str): The name of the model, used in naming output files. - n_samples (int, optional): Number of samples per batch. Default is 1000. - n_batches (int, optional): Number of batches to run. Default is 1. - seed (int, optional): Random seed for reproducibility. Default is 0. - - Returns: - None - """ - import numpy as np - - # Get reaction IDs for consistent column ordering - reaction_ids = [reaction.id for reaction in model.reactions] - - # Perform FVA analysis once for all batches - df_FVA = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=0).round(6) - - # Generate random objective functions for all samples across all batches - df_coefficients = CBS_backend.randomObjectiveFunction(model, n_samples * n_batches, df_FVA, seed=seed) - - # Sample and save each batch as numpy file - for i in range(n_batches): - samples = pd.DataFrame(columns=reaction_ids, index=range(n_samples)) - - try: - CBS_backend.randomObjectiveFunctionSampling( - model, - n_samples, - df_coefficients.iloc[:, i * n_samples:(i + 1) * n_samples], - samples - ) - except Exception as e: - utils.logWarning( - f"Warning: GLPK solver has failed for {model_name}. Trying with COBRA interface. Error: {str(e)}", - ARGS.out_log - ) - CBS_backend.randomObjectiveFunctionSampling_cobrapy( - model, - n_samples, - df_coefficients.iloc[:, i * n_samples:(i + 1) * n_samples], - samples - ) - - # Save as numpy array (more memory efficient) - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" - utils.logWarning(batch_filename, ARGS.out_log) - np.save(batch_filename, samples.to_numpy()) - - # Merge all batches into a single DataFrame - all_samples = [] - - for i in range(n_batches): - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" - batch_data = np.load(batch_filename, allow_pickle=True) - all_samples.append(batch_data) - - # Concatenate all batches - samplesTotal_array = np.vstack(all_samples) - - # Convert back to DataFrame with proper column namesq - samplesTotal = pd.DataFrame(samplesTotal_array, columns=reaction_ids) - - # Save the final merged result as CSV - write_to_file(samplesTotal.T, ARGS.output_path, True, name=model_name) - - # Clean up temporary numpy files - for i in range(n_batches): - batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" - if os.path.exists(batch_filename): - os.remove(batch_filename) - - - -def model_sampler_with_bounds(model_input_original: cobra.Model, bounds_path: str, cell_name: str) -> List[pd.DataFrame]: - """ - MODE 1: Prepares the model with bounds from separate bounds file and performs sampling. - - Args: - model_input_original (cobra.Model): The original COBRA model. - bounds_path (str): Path to the CSV file containing the bounds dataset. - cell_name (str): Name of the cell, used to generate filenames for output. - - Returns: - List[pd.DataFrame]: A list of DataFrames containing statistics and analysis results. - """ - - model_input = model_input_original.copy() - bounds_df = read_dataset(bounds_path, "bounds dataset") - - # Apply bounds to model - for rxn_index, row in bounds_df.iterrows(): - try: - model_input.reactions.get_by_id(rxn_index).lower_bound = row.lower_bound - model_input.reactions.get_by_id(rxn_index).upper_bound = row.upper_bound - except KeyError: - warning(f"Warning: Reaction {rxn_index} not found in model. Skipping.") - - return perform_sampling_and_analysis(model_input, cell_name) - - -def perform_sampling_and_analysis(model_input: cobra.Model, cell_name: str) -> List[pd.DataFrame]: - """ - Common function to perform sampling and analysis on a prepared model. - - Args: - model_input (cobra.Model): The prepared COBRA model with bounds applied. - cell_name (str): Name of the cell, used to generate filenames for output. - - Returns: - List[pd.DataFrame]: A list of DataFrames containing statistics and analysis results. - """ - - returnList = [] - - if ARGS.sampling_enabled == "true": - - if ARGS.algorithm == 'OPTGP': - OPTGP_sampler(model_input, cell_name, ARGS.n_samples, ARGS.thinning, ARGS.n_batches, ARGS.seed) - elif ARGS.algorithm == 'CBS': - CBS_sampler(model_input, cell_name, ARGS.n_samples, ARGS.n_batches, ARGS.seed) - - df_mean, df_median, df_quantiles = fluxes_statistics(cell_name, ARGS.output_types) - - if("fluxes" not in ARGS.output_types): - os.remove(ARGS.output_path + "/" + cell_name + '.csv') - - returnList = [df_mean, df_median, df_quantiles] - - df_pFBA, df_FVA, df_sensitivity = fluxes_analysis(model_input, cell_name, ARGS.output_type_analysis) - - if("pFBA" in ARGS.output_type_analysis): - returnList.append(df_pFBA) - if("FVA" in ARGS.output_type_analysis): - returnList.append(df_FVA) - if("sensitivity" in ARGS.output_type_analysis): - returnList.append(df_sensitivity) - - return returnList - -def fluxes_statistics(model_name: str, output_types:List)-> List[pd.DataFrame]: - """ - Computes statistics (mean, median, quantiles) for the fluxes. - - Args: - model_name (str): Name of the model, used in filename for input. - output_types (List[str]): Types of statistics to compute (mean, median, quantiles). - - Returns: - List[pd.DataFrame]: List of DataFrames containing mean, median, and quantiles statistics. - """ - - df_mean = pd.DataFrame() - df_median= pd.DataFrame() - df_quantiles= pd.DataFrame() - - df_samples = pd.read_csv(ARGS.output_path + "/" + model_name + '.csv', sep = '\t', index_col = 0).T - df_samples = df_samples.round(8) - - for output_type in output_types: - if(output_type == "mean"): - df_mean = df_samples.mean() - df_mean = df_mean.to_frame().T - df_mean = df_mean.reset_index(drop=True) - df_mean.index = [model_name] - elif(output_type == "median"): - df_median = df_samples.median() - df_median = df_median.to_frame().T - df_median = df_median.reset_index(drop=True) - df_median.index = [model_name] - elif(output_type == "quantiles"): - newRow = [] - cols = [] - for rxn in df_samples.columns: - quantiles = df_samples[rxn].quantile([0.25, 0.50, 0.75]) - newRow.append(quantiles[0.25]) - cols.append(rxn + "_q1") - newRow.append(quantiles[0.5]) - cols.append(rxn + "_q2") - newRow.append(quantiles[0.75]) - cols.append(rxn + "_q3") - df_quantiles = pd.DataFrame(columns=cols) - df_quantiles.loc[0] = newRow - df_quantiles = df_quantiles.reset_index(drop=True) - df_quantiles.index = [model_name] - - return df_mean, df_median, df_quantiles - -def fluxes_analysis(model:cobra.Model, model_name:str, output_types:List)-> List[pd.DataFrame]: - """ - Performs flux analysis including pFBA, FVA, and sensitivity analysis. The objective function - is assumed to be already set in the model. - - Args: - model (cobra.Model): The COBRA model to analyze. - model_name (str): Name of the model, used in filenames for output. - output_types (List[str]): Types of analysis to perform (pFBA, FVA, sensitivity). - - Returns: - List[pd.DataFrame]: List of DataFrames containing pFBA, FVA, and sensitivity analysis results. - """ - - df_pFBA = pd.DataFrame() - df_FVA= pd.DataFrame() - df_sensitivity= pd.DataFrame() - - for output_type in output_types: - if(output_type == "pFBA"): - solution = cobra.flux_analysis.pfba(model) - fluxes = solution.fluxes - df_pFBA.loc[0,[rxn.id for rxn in model.reactions]] = fluxes.tolist() - df_pFBA = df_pFBA.reset_index(drop=True) - df_pFBA.index = [model_name] - df_pFBA = df_pFBA.astype(float).round(6) - elif(output_type == "FVA"): - fva = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=ARGS.perc_opt, processes=1).round(8) - columns = [] - for rxn in fva.index.to_list(): - columns.append(rxn + "_min") - columns.append(rxn + "_max") - df_FVA= pd.DataFrame(columns = columns) - for index_rxn, row in fva.iterrows(): - df_FVA.loc[0, index_rxn+ "_min"] = fva.loc[index_rxn, "minimum"] - df_FVA.loc[0, index_rxn+ "_max"] = fva.loc[index_rxn, "maximum"] - df_FVA = df_FVA.reset_index(drop=True) - df_FVA.index = [model_name] - df_FVA = df_FVA.astype(float).round(6) - elif(output_type == "sensitivity"): - solution_original = model.optimize().objective_value - reactions = model.reactions - single = cobra.flux_analysis.single_reaction_deletion(model) - newRow = [] - df_sensitivity = pd.DataFrame(columns = [rxn.id for rxn in reactions], index = [model_name]) - for rxn in reactions: - newRow.append(single.knockout[rxn.id].growth.values[0]/solution_original) - df_sensitivity.loc[model_name] = newRow - df_sensitivity = df_sensitivity.astype(float).round(6) - return df_pFBA, df_FVA, df_sensitivity - -############################# main ########################################### -def main(args: List[str] = None) -> None: - """ - Initialize and run sampling/analysis based on the frontend input arguments. - - Returns: - None - """ - - num_processors = max(1, cpu_count() - 1) - - global ARGS - ARGS = process_args(args) - - if not os.path.exists('flux_simulation'): - os.makedirs('flux_simulation') - - # --- Normalize inputs (the tool may pass comma-separated --input and either --name or --names) --- - ARGS.input_files = ARGS.input.split(",") if ARGS.input else [] - ARGS.file_names = ARGS.name.split(",") - # output types (required) -> list - ARGS.output_types = ARGS.output_type.split(",") if ARGS.output_type else [] - # optional analysis output types -> list or empty - ARGS.output_type_analysis = ARGS.output_type_analysis.split(",") if ARGS.output_type_analysis else [] - - # Determine if sampling should be performed - if ARGS.sampling_enabled == "true": - perform_sampling = True - else: - perform_sampling = False - - print("=== INPUT FILES ===") - print(f"{ARGS.input_files}") - print(f"{ARGS.file_names}") - print(f"{ARGS.output_type}") - print(f"{ARGS.output_types}") - print(f"{ARGS.output_type_analysis}") - print(f"Sampling enabled: {perform_sampling} (n_samples: {ARGS.n_samples})") - - if ARGS.model_and_bounds == "True": - # MODE 1: Model + bounds (separate files) - print("=== MODE 1: Model + Bounds (separate files) ===") - - # Load base model - if not ARGS.model_upload: - sys.exit("Error: model_upload is required for Mode 1") - - base_model = model_utils.build_cobra_model_from_csv(ARGS.model_upload) - - validation = model_utils.validate_model(base_model) - - print("\n=== MODEL VALIDATION ===") - for key, value in validation.items(): - print(f"{key}: {value}") - - # Set solver verbosity to 1 to see warning and error messages only. - base_model.solver.configuration.verbosity = 1 - - # Process each bounds file with the base model - results = Parallel(n_jobs=num_processors)( - delayed(model_sampler_with_bounds)(base_model, bounds_file, cell_name) - for bounds_file, cell_name in zip(ARGS.input_files, ARGS.file_names) - ) - - else: - # MODE 2: Multiple complete models - print("=== MODE 2: Multiple complete models ===") - - # Process each complete model file - results = Parallel(n_jobs=num_processors)( - delayed(perform_sampling_and_analysis)(model_utils.build_cobra_model_from_csv(model_file), cell_name) - for model_file, cell_name in zip(ARGS.input_files, ARGS.file_names) - ) - - # Handle sampling outputs (only if sampling was performed) - if perform_sampling: - print("=== PROCESSING SAMPLING RESULTS ===") - - all_mean = pd.concat([result[0] for result in results], ignore_index=False) - all_median = pd.concat([result[1] for result in results], ignore_index=False) - all_quantiles = pd.concat([result[2] for result in results], ignore_index=False) - - if "mean" in ARGS.output_types: - all_mean = all_mean.fillna(0.0) - all_mean = all_mean.sort_index() - write_to_file(all_mean.T, ARGS.out_mean, True) - - if "median" in ARGS.output_types: - all_median = all_median.fillna(0.0) - all_median = all_median.sort_index() - write_to_file(all_median.T, ARGS.out_median, True) - - if "quantiles" in ARGS.output_types: - all_quantiles = all_quantiles.fillna(0.0) - all_quantiles = all_quantiles.sort_index() - write_to_file(all_quantiles.T, ARGS.out_quantiles, True) - else: - print("=== SAMPLING SKIPPED (n_samples = 0 or sampling disabled) ===") - - # Handle optimization analysis outputs (always available) - print("=== PROCESSING OPTIMIZATION RESULTS ===") - - # Determine the starting index for optimization results - # If sampling was performed, optimization results start at index 3 - # If no sampling, optimization results start at index 0 - index_result = 3 if perform_sampling else 0 - - if "pFBA" in ARGS.output_type_analysis: - all_pFBA = pd.concat([result[index_result] for result in results], ignore_index=False) - all_pFBA = all_pFBA.sort_index() - write_to_file(all_pFBA.T, ARGS.out_pfba, True) - index_result += 1 - - if "FVA" in ARGS.output_type_analysis: - all_FVA = pd.concat([result[index_result] for result in results], ignore_index=False) - all_FVA = all_FVA.sort_index() - write_to_file(all_FVA.T, ARGS.out_fva, True) - index_result += 1 - - if "sensitivity" in ARGS.output_type_analysis: - all_sensitivity = pd.concat([result[index_result] for result in results], ignore_index=False) - all_sensitivity = all_sensitivity.sort_index() - write_to_file(all_sensitivity.T, ARGS.out_sensitivity, True) - - return - -############################################################################## -if __name__ == "__main__": +""" +Flux sampling and analysis utilities for COBRA models. + +This script supports two modes: +- Mode 1 (model_and_bounds=True): load a base model and apply bounds from + separate files before sampling. +- Mode 2 (model_and_bounds=False): load complete models and sample directly. + +Sampling algorithms supported: OPTGP and CBS. Outputs include flux samples +and optional analyses (pFBA, FVA, sensitivity), saved as tabular files. +""" + +import argparse +from typing import List +import os +import pandas as pd +import numpy as np +import cobra +from joblib import Parallel, delayed, cpu_count +from cobra.sampling import OptGPSampler +import sys + +try: + from .utils import general_utils as utils + from .utils import CBS_backend + from .utils import model_utils +except: + import utils.general_utils as utils + import utils.CBS_backend as CBS_backend + import utils.model_utils as model_utils + + +################################# process args ############################### +def process_args(args: List[str] = None) -> argparse.Namespace: + """ + Processes command-line arguments. + + Args: + args (list): List of command-line arguments. + + Returns: + Namespace: An object containing parsed arguments. + """ + parser = argparse.ArgumentParser(usage='%(prog)s [options]', + description='process some value\'s') + + parser.add_argument("-mo", "--model_upload", type=str, + help="path to input file with custom rules, if provided") + + parser.add_argument("-mab", "--model_and_bounds", type=str, + choices=['True', 'False'], + required=True, + help="upload mode: True for model+bounds, False for complete models") + + parser.add_argument("-ens", "--sampling_enabled", type=str, + choices=['true', 'false'], + required=True, + help="enable sampling: 'true' for sampling, 'false' for no sampling") + + parser.add_argument('-ol', '--out_log', + help="Output log") + + parser.add_argument('-td', '--tool_dir', + type=str, + default=os.path.dirname(os.path.abspath(__file__)), + help='your tool directory (default: auto-detected package location)') + + parser.add_argument('-in', '--input', + required=True, + type=str, + help='input bounds files or complete model files') + + parser.add_argument('-ni', '--name', + required=True, + type=str, + help='cell names') + + parser.add_argument('-a', '--algorithm', + type=str, + choices=['OPTGP', 'CBS'], + required=True, + help='choose sampling algorithm') + + parser.add_argument('-th', '--thinning', + type=int, + default=100, + required=True, + help='choose thinning') + + parser.add_argument('-ns', '--n_samples', + type=int, + required=True, + help='choose how many samples (set to 0 for optimization only)') + + parser.add_argument('-sd', '--seed', + type=int, + required=True, + help='seed for random number generation') + + parser.add_argument('-nb', '--n_batches', + type=int, + required=True, + help='choose how many batches') + + parser.add_argument('-opt', '--perc_opt', + type=float, + default=0.9, + required=False, + help='choose the fraction of optimality for FVA (0-1)') + + parser.add_argument('-ot', '--output_type', + type=str, + required=True, + help='output type for sampling results') + + parser.add_argument('-ota', '--output_type_analysis', + type=str, + required=False, + help='output type analysis (optimization methods)') + + parser.add_argument('-idop', '--output_path', + type=str, + default='flux_simulation/', + help = 'output path for fluxes') + + parser.add_argument('-otm', '--out_mean', + type = str, + required=False, + help = 'output of mean of fluxes') + + parser.add_argument('-otmd', '--out_median', + type = str, + required=False, + help = 'output of median of fluxes') + + parser.add_argument('-otq', '--out_quantiles', + type = str, + required=False, + help = 'output of quantiles of fluxes') + + parser.add_argument('-otfva', '--out_fva', + type = str, + required=False, + help = 'output of FVA results') + parser.add_argument('-otp', '--out_pfba', + type = str, + required=False, + help = 'output of pFBA results') + parser.add_argument('-ots', '--out_sensitivity', + type = str, + required=False, + help = 'output of sensitivity results') + ARGS = parser.parse_args(args) + return ARGS +########################### warning ########################################### +def warning(s :str) -> None: + """ + Log a warning message to an output log file and print it to the console. + + Args: + s (str): The warning message to be logged and printed. + + Returns: + None + """ + with open(ARGS.out_log, 'a') as log: + log.write(s + "\n\n") + print(s) + + +def write_to_file(dataset: pd.DataFrame, path: str, keep_index:bool=False, name:str=None)->None: + """ + Write a DataFrame to a TSV file under path with a given base name. + + Args: + dataset: The DataFrame to write. + name: Base file name (without extension). If None, 'path' is treated as the full file path. + path: Directory path where the file will be saved. + keep_index: Whether to keep the DataFrame index in the file. + + Returns: + None + """ + dataset.index.name = 'Reactions' + if name: + dataset.to_csv(os.path.join(path, name + ".csv"), sep = '\t', index = keep_index) + else: + dataset.to_csv(path, sep = '\t', index = keep_index) + +############################ dataset input #################################### +def read_dataset(data :str, name :str) -> pd.DataFrame: + """ + Read a dataset from a CSV file and return it as a pandas DataFrame. + + Args: + data (str): Path to the CSV file containing the dataset. + name (str): Name of the dataset, used in error messages. + + Returns: + pandas.DataFrame: DataFrame containing the dataset. + + Raises: + pd.errors.EmptyDataError: If the CSV file is empty. + sys.exit: If the CSV file has the wrong format, the execution is aborted. + """ + try: + dataset = pd.read_csv(data, sep = '\t', header = 0, index_col=0, engine='python') + except pd.errors.EmptyDataError: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + if len(dataset.columns) < 2: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + return dataset + + + +def OPTGP_sampler(model: cobra.Model, model_name: str, n_samples: int = 1000, thinning: int = 100, n_batches: int = 1, seed: int = 0) -> None: + """ + Samples from the OPTGP (Optimal Global Perturbation) algorithm and saves the results to CSV files. + + Args: + model (cobra.Model): The COBRA model to sample from. + model_name (str): The name of the model, used in naming output files. + n_samples (int, optional): Number of samples per batch. Default is 1000. + thinning (int, optional): Thinning parameter for the sampler. Default is 100. + n_batches (int, optional): Number of batches to run. Default is 1. + seed (int, optional): Random seed for reproducibility. Default is 0. + + Returns: + None + """ + import numpy as np + + # Get reaction IDs for consistent column ordering + reaction_ids = [rxn.id for rxn in model.reactions] + + # Sample and save each batch as numpy file + for i in range(n_batches): + optgp = OptGPSampler(model, thinning, seed) + samples = optgp.sample(n_samples) + + # Save as numpy array (more memory efficient) + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" + np.save(batch_filename, samples.to_numpy()) + + seed += 1 + + # Merge all batches into a single DataFrame + all_samples = [] + + for i in range(n_batches): + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" + batch_data = np.load(batch_filename, allow_pickle=True) + all_samples.append(batch_data) + + # Concatenate all batches + samplesTotal_array = np.vstack(all_samples) + + # Convert back to DataFrame with proper column names + samplesTotal = pd.DataFrame(samplesTotal_array, columns=reaction_ids) + + # Save the final merged result as CSV + write_to_file(samplesTotal.T, ARGS.output_path, True, name=model_name) + + # Clean up temporary numpy files + for i in range(n_batches): + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_OPTGP.npy" + if os.path.exists(batch_filename): + os.remove(batch_filename) + + +def CBS_sampler(model: cobra.Model, model_name: str, n_samples: int = 1000, n_batches: int = 1, seed: int = 0) -> None: + """ + Samples using the CBS (Constraint-based Sampling) algorithm and saves the results to CSV files. + + Args: + model (cobra.Model): The COBRA model to sample from. + model_name (str): The name of the model, used in naming output files. + n_samples (int, optional): Number of samples per batch. Default is 1000. + n_batches (int, optional): Number of batches to run. Default is 1. + seed (int, optional): Random seed for reproducibility. Default is 0. + + Returns: + None + """ + import numpy as np + + # Get reaction IDs for consistent column ordering + reaction_ids = [reaction.id for reaction in model.reactions] + + # Perform FVA analysis once for all batches + df_FVA = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=0).round(6) + + # Generate random objective functions for all samples across all batches + df_coefficients = CBS_backend.randomObjectiveFunction(model, n_samples * n_batches, df_FVA, seed=seed) + + # Sample and save each batch as numpy file + for i in range(n_batches): + samples = pd.DataFrame(columns=reaction_ids, index=range(n_samples)) + + try: + CBS_backend.randomObjectiveFunctionSampling( + model, + n_samples, + df_coefficients.iloc[:, i * n_samples:(i + 1) * n_samples], + samples + ) + except Exception as e: + utils.logWarning( + f"Warning: GLPK solver has failed for {model_name}. Trying with COBRA interface. Error: {str(e)}", + ARGS.out_log + ) + CBS_backend.randomObjectiveFunctionSampling_cobrapy( + model, + n_samples, + df_coefficients.iloc[:, i * n_samples:(i + 1) * n_samples], + samples + ) + + # Save as numpy array (more memory efficient) + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" + utils.logWarning(batch_filename, ARGS.out_log) + np.save(batch_filename, samples.to_numpy()) + + # Merge all batches into a single DataFrame + all_samples = [] + + for i in range(n_batches): + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" + batch_data = np.load(batch_filename, allow_pickle=True) + all_samples.append(batch_data) + + # Concatenate all batches + samplesTotal_array = np.vstack(all_samples) + + # Convert back to DataFrame with proper column namesq + samplesTotal = pd.DataFrame(samplesTotal_array, columns=reaction_ids) + + # Save the final merged result as CSV + write_to_file(samplesTotal.T, ARGS.output_path, True, name=model_name) + + # Clean up temporary numpy files + for i in range(n_batches): + batch_filename = f"{ARGS.output_path}/{model_name}_{i}_CBS.npy" + if os.path.exists(batch_filename): + os.remove(batch_filename) + + + +def model_sampler_with_bounds(model_input_original: cobra.Model, bounds_path: str, cell_name: str) -> List[pd.DataFrame]: + """ + MODE 1: Prepares the model with bounds from separate bounds file and performs sampling. + + Args: + model_input_original (cobra.Model): The original COBRA model. + bounds_path (str): Path to the CSV file containing the bounds dataset. + cell_name (str): Name of the cell, used to generate filenames for output. + + Returns: + List[pd.DataFrame]: A list of DataFrames containing statistics and analysis results. + """ + + model_input = model_input_original.copy() + bounds_df = read_dataset(bounds_path, "bounds dataset") + + # Apply bounds to model + for rxn_index, row in bounds_df.iterrows(): + try: + model_input.reactions.get_by_id(rxn_index).lower_bound = row.lower_bound + model_input.reactions.get_by_id(rxn_index).upper_bound = row.upper_bound + except KeyError: + warning(f"Warning: Reaction {rxn_index} not found in model. Skipping.") + + return perform_sampling_and_analysis(model_input, cell_name) + + +def perform_sampling_and_analysis(model_input: cobra.Model, cell_name: str) -> List[pd.DataFrame]: + """ + Common function to perform sampling and analysis on a prepared model. + + Args: + model_input (cobra.Model): The prepared COBRA model with bounds applied. + cell_name (str): Name of the cell, used to generate filenames for output. + + Returns: + List[pd.DataFrame]: A list of DataFrames containing statistics and analysis results. + """ + + returnList = [] + + if ARGS.sampling_enabled == "true": + + if ARGS.algorithm == 'OPTGP': + OPTGP_sampler(model_input, cell_name, ARGS.n_samples, ARGS.thinning, ARGS.n_batches, ARGS.seed) + elif ARGS.algorithm == 'CBS': + CBS_sampler(model_input, cell_name, ARGS.n_samples, ARGS.n_batches, ARGS.seed) + + df_mean, df_median, df_quantiles = fluxes_statistics(cell_name, ARGS.output_types) + + if("fluxes" not in ARGS.output_types): + os.remove(ARGS.output_path + "/" + cell_name + '.csv') + + returnList = [df_mean, df_median, df_quantiles] + + df_pFBA, df_FVA, df_sensitivity = fluxes_analysis(model_input, cell_name, ARGS.output_type_analysis) + + if("pFBA" in ARGS.output_type_analysis): + returnList.append(df_pFBA) + if("FVA" in ARGS.output_type_analysis): + returnList.append(df_FVA) + if("sensitivity" in ARGS.output_type_analysis): + returnList.append(df_sensitivity) + + return returnList + +def fluxes_statistics(model_name: str, output_types:List)-> List[pd.DataFrame]: + """ + Computes statistics (mean, median, quantiles) for the fluxes. + + Args: + model_name (str): Name of the model, used in filename for input. + output_types (List[str]): Types of statistics to compute (mean, median, quantiles). + + Returns: + List[pd.DataFrame]: List of DataFrames containing mean, median, and quantiles statistics. + """ + + df_mean = pd.DataFrame() + df_median= pd.DataFrame() + df_quantiles= pd.DataFrame() + + df_samples = pd.read_csv(ARGS.output_path + "/" + model_name + '.csv', sep = '\t', index_col = 0).T + df_samples = df_samples.round(8) + + for output_type in output_types: + if(output_type == "mean"): + df_mean = df_samples.mean() + df_mean = df_mean.to_frame().T + df_mean = df_mean.reset_index(drop=True) + df_mean.index = [model_name] + elif(output_type == "median"): + df_median = df_samples.median() + df_median = df_median.to_frame().T + df_median = df_median.reset_index(drop=True) + df_median.index = [model_name] + elif(output_type == "quantiles"): + newRow = [] + cols = [] + for rxn in df_samples.columns: + quantiles = df_samples[rxn].quantile([0.25, 0.50, 0.75]) + newRow.append(quantiles[0.25]) + cols.append(rxn + "_q1") + newRow.append(quantiles[0.5]) + cols.append(rxn + "_q2") + newRow.append(quantiles[0.75]) + cols.append(rxn + "_q3") + df_quantiles = pd.DataFrame(columns=cols) + df_quantiles.loc[0] = newRow + df_quantiles = df_quantiles.reset_index(drop=True) + df_quantiles.index = [model_name] + + return df_mean, df_median, df_quantiles + +def fluxes_analysis(model:cobra.Model, model_name:str, output_types:List)-> List[pd.DataFrame]: + """ + Performs flux analysis including pFBA, FVA, and sensitivity analysis. The objective function + is assumed to be already set in the model. + + Args: + model (cobra.Model): The COBRA model to analyze. + model_name (str): Name of the model, used in filenames for output. + output_types (List[str]): Types of analysis to perform (pFBA, FVA, sensitivity). + + Returns: + List[pd.DataFrame]: List of DataFrames containing pFBA, FVA, and sensitivity analysis results. + """ + + df_pFBA = pd.DataFrame() + df_FVA= pd.DataFrame() + df_sensitivity= pd.DataFrame() + + for output_type in output_types: + if(output_type == "pFBA"): + solution = cobra.flux_analysis.pfba(model) + fluxes = solution.fluxes + df_pFBA.loc[0,[rxn.id for rxn in model.reactions]] = fluxes.tolist() + df_pFBA = df_pFBA.reset_index(drop=True) + df_pFBA.index = [model_name] + df_pFBA = df_pFBA.astype(float).round(6) + elif(output_type == "FVA"): + fva = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=ARGS.perc_opt, processes=1).round(8) + columns = [] + for rxn in fva.index.to_list(): + columns.append(rxn + "_min") + columns.append(rxn + "_max") + df_FVA= pd.DataFrame(columns = columns) + for index_rxn, row in fva.iterrows(): + df_FVA.loc[0, index_rxn+ "_min"] = fva.loc[index_rxn, "minimum"] + df_FVA.loc[0, index_rxn+ "_max"] = fva.loc[index_rxn, "maximum"] + df_FVA = df_FVA.reset_index(drop=True) + df_FVA.index = [model_name] + df_FVA = df_FVA.astype(float).round(6) + elif(output_type == "sensitivity"): + solution_original = model.optimize().objective_value + reactions = model.reactions + single = cobra.flux_analysis.single_reaction_deletion(model) + newRow = [] + df_sensitivity = pd.DataFrame(columns = [rxn.id for rxn in reactions], index = [model_name]) + for rxn in reactions: + newRow.append(single.knockout[rxn.id].growth.values[0]/solution_original) + df_sensitivity.loc[model_name] = newRow + df_sensitivity = df_sensitivity.astype(float).round(6) + return df_pFBA, df_FVA, df_sensitivity + +############################# main ########################################### +def main(args: List[str] = None) -> None: + """ + Initialize and run sampling/analysis based on the frontend input arguments. + + Returns: + None + """ + + num_processors = max(1, cpu_count() - 1) + + global ARGS + ARGS = process_args(args) + + if not os.path.exists('flux_simulation'): + os.makedirs('flux_simulation') + + # --- Normalize inputs (the tool may pass comma-separated --input and either --name or --names) --- + ARGS.input_files = ARGS.input.split(",") if ARGS.input else [] + ARGS.file_names = ARGS.name.split(",") + # output types (required) -> list + ARGS.output_types = ARGS.output_type.split(",") if ARGS.output_type else [] + # optional analysis output types -> list or empty + ARGS.output_type_analysis = ARGS.output_type_analysis.split(",") if ARGS.output_type_analysis else [] + + # Determine if sampling should be performed + if ARGS.sampling_enabled == "true": + perform_sampling = True + else: + perform_sampling = False + + print("=== INPUT FILES ===") + print(f"{ARGS.input_files}") + print(f"{ARGS.file_names}") + print(f"{ARGS.output_type}") + print(f"{ARGS.output_types}") + print(f"{ARGS.output_type_analysis}") + print(f"Sampling enabled: {perform_sampling} (n_samples: {ARGS.n_samples})") + + if ARGS.model_and_bounds == "True": + # MODE 1: Model + bounds (separate files) + print("=== MODE 1: Model + Bounds (separate files) ===") + + # Load base model + if not ARGS.model_upload: + sys.exit("Error: model_upload is required for Mode 1") + + base_model = model_utils.build_cobra_model_from_csv(ARGS.model_upload) + + validation = model_utils.validate_model(base_model) + + print("\n=== MODEL VALIDATION ===") + for key, value in validation.items(): + print(f"{key}: {value}") + + # Set solver verbosity to 1 to see warning and error messages only. + base_model.solver.configuration.verbosity = 1 + + # Process each bounds file with the base model + results = Parallel(n_jobs=num_processors)( + delayed(model_sampler_with_bounds)(base_model, bounds_file, cell_name) + for bounds_file, cell_name in zip(ARGS.input_files, ARGS.file_names) + ) + + else: + # MODE 2: Multiple complete models + print("=== MODE 2: Multiple complete models ===") + + # Process each complete model file + results = Parallel(n_jobs=num_processors)( + delayed(perform_sampling_and_analysis)(model_utils.build_cobra_model_from_csv(model_file), cell_name) + for model_file, cell_name in zip(ARGS.input_files, ARGS.file_names) + ) + + # Handle sampling outputs (only if sampling was performed) + if perform_sampling: + print("=== PROCESSING SAMPLING RESULTS ===") + + all_mean = pd.concat([result[0] for result in results], ignore_index=False) + all_median = pd.concat([result[1] for result in results], ignore_index=False) + all_quantiles = pd.concat([result[2] for result in results], ignore_index=False) + + if "mean" in ARGS.output_types: + all_mean = all_mean.fillna(0.0) + all_mean = all_mean.sort_index() + write_to_file(all_mean.T, ARGS.out_mean, True) + + if "median" in ARGS.output_types: + all_median = all_median.fillna(0.0) + all_median = all_median.sort_index() + write_to_file(all_median.T, ARGS.out_median, True) + + if "quantiles" in ARGS.output_types: + all_quantiles = all_quantiles.fillna(0.0) + all_quantiles = all_quantiles.sort_index() + write_to_file(all_quantiles.T, ARGS.out_quantiles, True) + else: + print("=== SAMPLING SKIPPED (n_samples = 0 or sampling disabled) ===") + + # Handle optimization analysis outputs (always available) + print("=== PROCESSING OPTIMIZATION RESULTS ===") + + # Determine the starting index for optimization results + # If sampling was performed, optimization results start at index 3 + # If no sampling, optimization results start at index 0 + index_result = 3 if perform_sampling else 0 + + if "pFBA" in ARGS.output_type_analysis: + all_pFBA = pd.concat([result[index_result] for result in results], ignore_index=False) + all_pFBA = all_pFBA.sort_index() + write_to_file(all_pFBA.T, ARGS.out_pfba, True) + index_result += 1 + + if "FVA" in ARGS.output_type_analysis: + all_FVA = pd.concat([result[index_result] for result in results], ignore_index=False) + all_FVA = all_FVA.sort_index() + write_to_file(all_FVA.T, ARGS.out_fva, True) + index_result += 1 + + if "sensitivity" in ARGS.output_type_analysis: + all_sensitivity = pd.concat([result[index_result] for result in results], ignore_index=False) + all_sensitivity = all_sensitivity.sort_index() + write_to_file(all_sensitivity.T, ARGS.out_sensitivity, True) + + return + +############################################################################## +if __name__ == "__main__": main() \ No newline at end of file
--- a/COBRAxy/src/flux_to_map.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/flux_to_map.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,1082 +1,1085 @@ -from __future__ import division -import csv -from enum import Enum -import re -import sys -import numpy as np -import pandas as pd -import itertools as it -import scipy.stats as st -import lxml.etree as ET -import math -import utils.general_utils as utils -from PIL import Image -import os -import copy -import argparse -import pyvips -from PIL import Image -from typing import Tuple, Union, Optional, List, Dict -import matplotlib.pyplot as plt - -ERRORS = [] -########################## argparse ########################################## -ARGS :argparse.Namespace -def process_args(args:List[str] = None) -> argparse.Namespace: - """ - Interfaces the script of a module with its frontend, making the user's choices for various parameters available as values in code. - - Args: - args : Always obtained (in file) from sys.argv - - Returns: - Namespace : An object containing the parsed arguments - """ - parser = argparse.ArgumentParser( - usage = "%(prog)s [options]", - description = "process some value's genes to create a comparison's map.") - - #General: - parser.add_argument( - '-td', '--tool_dir', - type = str, - required = True, - help = 'your tool directory') - - parser.add_argument('-on', '--control', type = str) - parser.add_argument('-ol', '--out_log', help = "Output log") - - #Computation details: - parser.add_argument( - '-co', '--comparison', - type = str, - default = 'manyvsmany', - choices = ['manyvsmany', 'onevsrest', 'onevsmany']) - - parser.add_argument( - '-te' ,'--test', - type = str, - default = 'ks', - choices = ['ks', 'ttest_p', 'ttest_ind', 'wilcoxon', 'mw'], - help = 'Statistical test to use (default: %(default)s)') - - parser.add_argument( - '-pv' ,'--pValue', - type = float, - default = 0.1, - help = 'P-Value threshold (default: %(default)s)') - - parser.add_argument( - '-adj' ,'--adjusted', - type = utils.Bool("adjusted"), default = False, - help = 'Apply the FDR (Benjamini-Hochberg) correction (default: %(default)s)') - - parser.add_argument( - '-fc', '--fChange', - type = float, - default = 1.5, - help = 'Fold-Change threshold (default: %(default)s)') - - parser.add_argument( - '-op', '--option', - type = str, - choices = ['datasets', 'dataset_class'], - help='dataset or dataset and class') - - parser.add_argument( - '-idf', '--input_data_fluxes', - type = str, - help = 'input dataset fluxes') - - parser.add_argument( - '-icf', '--input_class_fluxes', - type = str, - help = 'sample group specification fluxes') - - parser.add_argument( - '-idsf', '--input_datas_fluxes', - type = str, - nargs = '+', - help = 'input datasets fluxes') - - parser.add_argument( - '-naf', '--names_fluxes', - type = str, - nargs = '+', - help = 'input names fluxes') - - #Output: - parser.add_argument( - "-gs", "--generate_svg", - type = utils.Bool("generate_svg"), default = True, - help = "choose whether to generate svg") - - parser.add_argument( - "-gp", "--generate_pdf", - type = utils.Bool("generate_pdf"), default = True, - help = "choose whether to generate pdf") - - parser.add_argument( - '-cm', '--custom_map', - type = str, - help='custom map to use') - - parser.add_argument( - '-mc', '--choice_map', - type = utils.Model, default = utils.Model.HMRcore, - choices = [utils.Model.HMRcore, utils.Model.ENGRO2, utils.Model.Custom]) - - parser.add_argument( - '-colorm', '--color_map', - type = str, - choices = ["jet", "viridis"]) - - parser.add_argument( - '-idop', '--output_path', - type = str, - default='result', - help = 'output path for maps') - - args :argparse.Namespace = parser.parse_args(args) - args.net = True # TODO SICCOME I FLUSSI POSSONO ESSERE ANCHE NEGATIVI SONO SEMPRE CONSIDERATI NETTI - - return args - -############################ dataset input #################################### -def read_dataset(data :str, name :str) -> pd.DataFrame: - """ - Tries to read the dataset from its path (data) as a tsv and turns it into a DataFrame. - - Args: - data : filepath of a dataset (from frontend input params or literals upon calling) - name : name associated with the dataset (from frontend input params or literals upon calling) - - Returns: - pd.DataFrame : dataset in a runtime operable shape - - Raises: - sys.exit : if there's no data (pd.errors.EmptyDataError) or if the dataset has less than 2 columns - """ - try: - dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') - except pd.errors.EmptyDataError: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - if len(dataset.columns) < 2: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - return dataset - -############################ dataset name ##################################### -def name_dataset(name_data :str, count :int) -> str: - """ - Produces a unique name for a dataset based on what was provided by the user. The default name for any dataset is "Dataset", thus if the user didn't change it this function appends f"_{count}" to make it unique. - - Args: - name_data : name associated with the dataset (from frontend input params) - count : counter from 1 to make these names unique (external) - - Returns: - str : the name made unique - """ - if str(name_data) == 'Dataset': - return str(name_data) + '_' + str(count) - else: - return str(name_data) - -############################ map_methods ###################################### -FoldChange = Union[float, int, str] # Union[float, Literal[0, "-INF", "INF"]] -def fold_change(avg1 :float, avg2 :float) -> FoldChange: - """ - Calculates the fold change between two gene expression values. - - Args: - avg1 : average expression value from one dataset avg2 : average expression value from the other dataset - - Returns: - FoldChange : - 0 : when both input values are 0 - "-INF" : when avg1 is 0 - "INF" : when avg2 is 0 - float : for any other combination of values - """ - if avg1 == 0 and avg2 == 0: - return 0 - elif avg1 == 0: - return '-INF' - elif avg2 == 0: - return 'INF' - else: # (threshold_F_C - 1) / (abs(threshold_F_C) + 1) con threshold_F_C > 1 - return (avg1 - avg2) / (abs(avg1) + abs(avg2)) - -def getElementById(reactionId :str, metabMap :ET.ElementTree) -> utils.Result[ET.Element, utils.Result.ResultErr]: - """ - Finds any element in the given map with the given ID. ID uniqueness in an svg file is recommended but - not enforced, if more than one element with the exact ID is found only the first will be returned. - - Args: - reactionId (str): exact ID of the requested element. - metabMap (ET.ElementTree): metabolic map containing the element. - - Returns: - utils.Result[ET.Element, ResultErr]: result of the search, either the first match found or a ResultErr. - """ - return utils.Result.Ok( - f"//*[@id=\"{reactionId}\"]").map( - lambda xPath : metabMap.xpath(xPath)[0]).mapErr( - lambda _ : utils.Result.ResultErr(f"No elements with ID \"{reactionId}\" found in map")) - # ^^^ we shamelessly ignore the contents of the IndexError, it offers nothing to the user. - -def styleMapElement(element :ET.Element, styleStr :str) -> None: - currentStyles :str = element.get("style", "") - if re.search(r";stroke:[^;]+;stroke-width:[^;]+;stroke-dasharray:[^;]+$", currentStyles): - currentStyles = ';'.join(currentStyles.split(';')[:-3]) - - element.set("style", currentStyles + styleStr) - -class ReactionDirection(Enum): - Unknown = "" - Direct = "_F" - Inverse = "_B" - - @classmethod - def fromDir(cls, s :str) -> "ReactionDirection": - # vvv as long as there's so few variants I actually condone the if spam: - if s == ReactionDirection.Direct.value: return ReactionDirection.Direct - if s == ReactionDirection.Inverse.value: return ReactionDirection.Inverse - return ReactionDirection.Unknown - - @classmethod - def fromReactionId(cls, reactionId :str) -> "ReactionDirection": - return ReactionDirection.fromDir(reactionId[-2:]) - -def getArrowBodyElementId(reactionId :str) -> str: - if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV - elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: reactionId = reactionId[:-2] - return f"R_{reactionId}" - -def getArrowHeadElementId(reactionId :str) -> Tuple[str, str]: - """ - We attempt extracting the direction information from the provided reaction ID, if unsuccessful we provide the IDs of both directions. - - Args: - reactionId : the provided reaction ID. - - Returns: - Tuple[str, str]: either a single str ID for the correct arrow head followed by an empty string or both options to try. - """ - if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV - elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: return reactionId[:-3:-1] + reactionId[:-2], "" - return f"F_{reactionId}", f"B_{reactionId}" - -class ArrowColor(Enum): - """ - Encodes possible arrow colors based on their meaning in the enrichment process. - """ - Invalid = "#BEBEBE" # gray, fold-change under treshold or not significant p-value - Transparent = "#ffffff00" # transparent, to make some arrow segments disappear - UpRegulated = "#ecac68" # red, up-regulated reaction - DownRegulated = "#6495ed" # blue, down-regulated reaction - - UpRegulatedInv = "#FF0000" - # ^^^ different shade of red (actually orange), up-regulated net value for a reversible reaction with - # conflicting enrichment in the two directions. - - DownRegulatedInv = "#0000FF" - # ^^^ different shade of blue (actually purple), down-regulated net value for a reversible reaction with - # conflicting enrichment in the two directions. - - @classmethod - def fromFoldChangeSign(cls, foldChange :float, *, useAltColor = False) -> "ArrowColor": - colors = (cls.DownRegulated, cls.DownRegulatedInv) if foldChange < 0 else (cls.UpRegulated, cls.UpRegulatedInv) - return colors[useAltColor] - - def __str__(self) -> str: return self.value - -class Arrow: - """ - Models the properties of a reaction arrow that change based on enrichment. - """ - MIN_W = 2 - MAX_W = 12 - - def __init__(self, width :int, col: ArrowColor, *, isDashed = False) -> None: - """ - (Private) Initializes an instance of Arrow. - - Args: - width : width of the arrow, ideally to be kept within Arrow.MIN_W and Arrow.MAX_W (not enforced). - col : color of the arrow. - isDashed : whether the arrow should be dashed, meaning the associated pValue resulted not significant. - - Returns: - None : practically, a Arrow instance. - """ - self.w = width - self.col = col - self.dash = isDashed - - def applyTo(self, reactionId :str, metabMap :ET.ElementTree, styleStr :str) -> None: - if getElementById(reactionId, metabMap).map(lambda el : styleMapElement(el, styleStr)).isErr: - ERRORS.append(reactionId) - - def styleReactionElements(self, metabMap :ET.ElementTree, reactionId :str, *, mindReactionDir = True) -> None: - if not mindReactionDir: - return self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) - - # Now we style the arrow head(s): - idOpt1, idOpt2 = getArrowHeadElementId(reactionId) - self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) - if idOpt2: self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) - - def styleReactionElementsMeanMedian(self, metabMap :ET.ElementTree, reactionId :str, isNegative:bool) -> None: - - self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) - idOpt1, idOpt2 = getArrowHeadElementId(reactionId) - - if(isNegative): - self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) - self.col = ArrowColor.Transparent - self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) #trasp - else: - self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) - self.col = ArrowColor.Transparent - self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) #trasp - - - - def getMapReactionId(self, reactionId :str, mindReactionDir :bool) -> str: - """ - Computes the reaction ID as encoded in the map for a given reaction ID from the dataset. - - Args: - reactionId: the reaction ID, as encoded in the dataset. - mindReactionDir: if True forward (F_) and backward (B_) directions will be encoded in the result. - - Returns: - str : the ID of an arrow's body or tips in the map. - """ - # we assume the reactionIds also don't encode reaction dir if they don't mind it when styling the map. - if not mindReactionDir: return "R_" + reactionId - - #TODO: this is clearly something we need to make consistent in fluxes - return (reactionId[:-3:-1] + reactionId[:-2]) if reactionId[:-2] in ["_F", "_B"] else f"F_{reactionId}" # "Pyr_F" --> "F_Pyr" - - def toStyleStr(self, *, downSizedForTips = False) -> str: - """ - Collapses the styles of this Arrow into a str, ready to be applied as part of the "style" property on an svg element. - - Returns: - str : the styles string. - """ - width = self.w - if downSizedForTips: width *= 0.8 - return f";stroke:{self.col};stroke-width:{width};stroke-dasharray:{'5,5' if self.dash else 'none'}" - -# vvv These constants could be inside the class itself a static properties, but python -# was built by brainless organisms so here we are! -INVALID_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid) -INSIGNIFICANT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid, isDashed = True) - -def applyFluxesEnrichmentToMap(fluxesEnrichmentRes :Dict[str, Union[Tuple[float, FoldChange], Tuple[float, FoldChange, float, float]]], metabMap :ET.ElementTree, maxNumericZScore :float) -> None: - """ - Applies fluxes enrichment results to the provided metabolic map. - - Args: - fluxesEnrichmentRes : fluxes enrichment results. - metabMap : the metabolic map to edit. - maxNumericZScore : biggest finite z-score value found. - - Side effects: - metabMap : mut - - Returns: - None - """ - for reactionId, values in fluxesEnrichmentRes.items(): - pValue = values[0] - foldChange = values[1] - z_score = values[2] - - if math.isnan(pValue) or (isinstance(foldChange, float) and math.isnan(foldChange)): - continue - - if isinstance(foldChange, str): foldChange = float(foldChange) - if pValue > ARGS.pValue: # pValue above tresh: dashed arrow - INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId) - INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId, mindReactionDir = False) - - continue - - if abs(foldChange) < (ARGS.fChange - 1) / (abs(ARGS.fChange) + 1): - INVALID_ARROW.styleReactionElements(metabMap, reactionId) - INVALID_ARROW.styleReactionElements(metabMap, reactionId, mindReactionDir = False) - - continue - - width = Arrow.MAX_W - if not math.isinf(z_score): - try: - width = min( - max(abs(z_score * Arrow.MAX_W) / maxNumericZScore, Arrow.MIN_W), - Arrow.MAX_W) - - except ZeroDivisionError: pass - # TODO CHECK RV - #if not reactionId.endswith("_RV"): # RV stands for reversible reactions - # Arrow(width, ArrowColor.fromFoldChangeSign(foldChange)).styleReactionElements(metabMap, reactionId) - # continue - - #reactionId = reactionId[:-3] # Remove "_RV" - - inversionScore = (values[3] < 0) + (values[4] < 0) # Compacts the signs of averages into 1 easy to check score - if inversionScore == 2: foldChange *= -1 - # ^^^ Style the inverse direction with the opposite sign netValue - - # If the score is 1 (opposite signs) we use alternative colors vvv - arrow = Arrow(width, ArrowColor.fromFoldChangeSign(foldChange, useAltColor = inversionScore == 1)) - - # vvv These 2 if statements can both be true and can both happen - if ARGS.net: # style arrow head(s): - arrow.styleReactionElements(metabMap, reactionId + ("_B" if inversionScore == 2 else "_F")) - arrow.applyTo(("F_" if inversionScore == 2 else "B_") + reactionId, metabMap, f";stroke:{ArrowColor.Transparent};stroke-width:0;stroke-dasharray:None") - - arrow.styleReactionElements(metabMap, reactionId, mindReactionDir = False) - - -############################ split class ###################################### -def split_class(classes :pd.DataFrame, resolve_rules :Dict[str, List[float]]) -> Dict[str, List[List[float]]]: - """ - Generates a :dict that groups together data from a :DataFrame based on classes the data is related to. - - Args: - classes : a :DataFrame of only string values, containing class information (rows) and keys to query the resolve_rules :dict - resolve_rules : a :dict containing :float data - - Returns: - dict : the dict with data grouped by class - - Side effects: - classes : mut - """ - class_pat :Dict[str, List[List[float]]] = {} - for i in range(len(classes)): - classe :str = classes.iloc[i, 1] - if pd.isnull(classe): continue - - l :List[List[float]] = [] - for j in range(i, len(classes)): - if classes.iloc[j, 1] == classe: - pat_id :str = classes.iloc[j, 0] - tmp = resolve_rules.get(pat_id, None) - if tmp != None: - l.append(tmp) - classes.iloc[j, 1] = None - - if l: - class_pat[classe] = list(map(list, zip(*l))) - continue - - utils.logWarning( - f"Warning: no sample found in class \"{classe}\", the class has been disregarded", ARGS.out_log) - - return class_pat - -############################ conversion ############################################## -#conversion from svg to png -def svg_to_png_with_background(svg_path :utils.FilePath, png_path :utils.FilePath, dpi :int = 72, scale :int = 1, size :Optional[float] = None) -> None: - """ - Internal utility to convert an SVG to PNG (forced opaque) to aid in PDF conversion. - - Args: - svg_path : path to SVG file - png_path : path for new PNG file - dpi : dots per inch of the generated PNG - scale : scaling factor for the generated PNG, computed internally when a size is provided - size : final effective width of the generated PNG - - Returns: - None - """ - if size: - image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=1) - scale = size / image.width - image = image.resize(scale) - else: - image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=scale) - - white_background = pyvips.Image.black(image.width, image.height).new_from_image([255, 255, 255]) - white_background = white_background.affine([scale, 0, 0, scale]) - - if white_background.bands != image.bands: - white_background = white_background.extract_band(0) - - composite_image = white_background.composite2(image, 'over') - composite_image.write_to_file(png_path.show()) - -#funzione unica, lascio fuori i file e li passo in input -#conversion from png to pdf -def convert_png_to_pdf(png_file :utils.FilePath, pdf_file :utils.FilePath) -> None: - """ - Internal utility to convert a PNG to PDF to aid from SVG conversion. - - Args: - png_file : path to PNG file - pdf_file : path to new PDF file - - Returns: - None - """ - image = Image.open(png_file.show()) - image = image.convert("RGB") - image.save(pdf_file.show(), "PDF", resolution=100.0) - -#function called to reduce redundancy in the code -def convert_to_pdf(file_svg :utils.FilePath, file_png :utils.FilePath, file_pdf :utils.FilePath) -> None: - """ - Converts the SVG map at the provided path to PDF. - - Args: - file_svg : path to SVG file - file_png : path to PNG file - file_pdf : path to new PDF file - - Returns: - None - """ - svg_to_png_with_background(file_svg, file_png) - try: - convert_png_to_pdf(file_png, file_pdf) - print(f'PDF file {file_pdf.filePath} successfully generated.') - - except Exception as e: - raise utils.DataErr(file_pdf.show(), f'Error generating PDF file: {e}') - -############################ map ############################################## -def buildOutputPath(dataset1Name :str, dataset2Name = "rest", *, details = "", ext :utils.FileFormat) -> utils.FilePath: - """ - Builds a FilePath instance from the names of confronted datasets ready to point to a location in the - "result/" folder, used by this tool for output files in collections. - - Args: - dataset1Name : _description_ - dataset2Name : _description_. Defaults to "rest". - details : _description_ - ext : _description_ - - Returns: - utils.FilePath : _description_ - """ - # This function returns a util data structure but is extremely specific to this module. - # RAS also uses collections as output and as such might benefit from a method like this, but I'd wait - # TODO: until a third tool with multiple outputs appears before porting this to utils. - return utils.FilePath( - f"{dataset1Name}_vs_{dataset2Name}" + (f" ({details})" if details else ""), - # ^^^ yes this string is built every time even if the form is the same for the same 2 datasets in - # all output files: I don't care, this was never the performance bottleneck of the tool and - # there is no other net gain in saving and re-using the built string. - ext, - prefix = ARGS.output_path) - -FIELD_NOT_AVAILABLE = '/' -def writeToCsv(rows: List[list], fieldNames :List[str], outPath :utils.FilePath) -> None: - fieldsAmt = len(fieldNames) - with open(outPath.show(), "w", newline = "") as fd: - writer = csv.DictWriter(fd, fieldnames = fieldNames, delimiter = '\t') - writer.writeheader() - - for row in rows: - sizeMismatch = fieldsAmt - len(row) - if sizeMismatch > 0: row.extend([FIELD_NOT_AVAILABLE] * sizeMismatch) - writer.writerow({ field : data for field, data in zip(fieldNames, row) }) - -OldEnrichedScores = Dict[str, List[Union[float, FoldChange]]] #TODO: try to use Tuple whenever possible -def writeTabularResult(enrichedScores : OldEnrichedScores, outPath :utils.FilePath) -> None: - fieldNames = ["ids", "P_Value", "fold change", "z-score"] - fieldNames.extend(["average_1", "average_2"]) - - writeToCsv([ [reactId] + values for reactId, values in enrichedScores.items() ], fieldNames, outPath) - -def temp_thingsInCommon(tmp :Dict[str, List[Union[float, FoldChange]]], core_map :ET.ElementTree, max_z_score :float, dataset1Name :str, dataset2Name = "rest") -> None: - # this function compiles the things always in common between comparison modes after enrichment. - # TODO: organize, name better. - writeTabularResult(tmp, buildOutputPath(dataset1Name, dataset2Name, details = "Tabular Result", ext = utils.FileFormat.TSV)) - for reactId, enrichData in tmp.items(): tmp[reactId] = tuple(enrichData) - applyFluxesEnrichmentToMap(tmp, core_map, max_z_score) - -def computePValue(dataset1Data: List[float], dataset2Data: List[float]) -> Tuple[float, float]: - """ - Computes the statistical significance score (P-value) of the comparison between coherent data - from two datasets. The data is supposed to, in both datasets: - - be related to the same reaction ID; - - be ordered by sample, such that the item at position i in both lists is related to the - same sample or cell line. - - Args: - dataset1Data : data from the 1st dataset. - dataset2Data : data from the 2nd dataset. - - Returns: - tuple: (P-value, Z-score) - - P-value from the selected test on the provided data. - - Z-score of the difference between means of the two datasets. - """ - - match ARGS.test: - case "ks": - # Perform Kolmogorov-Smirnov test - _, p_value = st.ks_2samp(dataset1Data, dataset2Data) - case "ttest_p": - # Datasets should have same size - if len(dataset1Data) != len(dataset2Data): - raise ValueError("Datasets must have the same size for paired t-test.") - # Perform t-test for paired samples - _, p_value = st.ttest_rel(dataset1Data, dataset2Data) - case "ttest_ind": - # Perform t-test for independent samples - _, p_value = st.ttest_ind(dataset1Data, dataset2Data) - case "wilcoxon": - # Datasets should have same size - if len(dataset1Data) != len(dataset2Data): - raise ValueError("Datasets must have the same size for Wilcoxon signed-rank test.") - # Perform Wilcoxon signed-rank test - np.random.seed(42) # Ensure reproducibility since zsplit method is used - _, p_value = st.wilcoxon(dataset1Data, dataset2Data, zero_method="zsplit") - case "mw": - # Perform Mann-Whitney U test - _, p_value = st.mannwhitneyu(dataset1Data, dataset2Data) - - # Calculate means and standard deviations - mean1 = np.nanmean(dataset1Data) - mean2 = np.nanmean(dataset2Data) - std1 = np.nanstd(dataset1Data, ddof=1) - std2 = np.nanstd(dataset2Data, ddof=1) - - n1 = len(dataset1Data) - n2 = len(dataset2Data) - - # Calculate Z-score - z_score = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2)) - - return p_value, z_score - -def compareDatasetPair(dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> Tuple[Dict[str, List[Union[float, FoldChange]]], float]: - #TODO: the following code still suffers from "dumbvarnames-osis" - comparisonResult :Dict[str, List[Union[float, FoldChange]]] = {} - count = 0 - max_z_score = 0 - for l1, l2 in zip(dataset1Data, dataset2Data): - reactId = ids[count] - count += 1 - if not reactId: continue # we skip ids that have already been processed - - try: - p_value, z_score = computePValue(l1, l2) - avg1 = sum(l1) / len(l1) - avg2 = sum(l2) / len(l2) - f_c = fold_change(avg1, avg2) - if np.isfinite(z_score) and max_z_score < abs(z_score): max_z_score = abs(z_score) - - comparisonResult[reactId] = [float(p_value), f_c, z_score, avg1, avg2] - except (TypeError, ZeroDivisionError): continue - - # Apply multiple testing correction if set by the user - if ARGS.adjusted: - - # Retrieve the p-values from the comparisonResult dictionary, they have to be different from NaN - validPValues = [(reactId, result[0]) for reactId, result in comparisonResult.items() if not np.isnan(result[0])] - - if not validPValues: - return comparisonResult, max_z_score - - # Unpack the valid p-values - reactIds, pValues = zip(*validPValues) - # Adjust the p-values using the Benjamini-Hochberg method - adjustedPValues = st.false_discovery_control(pValues) - # Update the comparisonResult dictionary with the adjusted p-values - for reactId , adjustedPValue in zip(reactIds, adjustedPValues): - comparisonResult[reactId][0] = adjustedPValue - - return comparisonResult, max_z_score - -def computeEnrichment(class_pat :Dict[str, List[List[float]]], ids :List[str]) -> List[Tuple[str, str, dict, float]]: - """ - Compares clustered data based on a given comparison mode and applies enrichment-based styling on the - provided metabolic map. - - Args: - class_pat : the clustered data. - ids : ids for data association. - - - Returns: - List[Tuple[str, str, dict, float]]: List of tuples with pairs of dataset names, comparison dictionary, and max z-score. - - Raises: - sys.exit : if there are less than 2 classes for comparison - - """ - class_pat = { k.strip() : v for k, v in class_pat.items() } - #TODO: simplfy this stuff vvv and stop using sys.exit (raise the correct utils error) - if (not class_pat) or (len(class_pat.keys()) < 2): sys.exit('Execution aborted: classes provided for comparisons are less than two\n') - - enrichment_results = [] - - - if ARGS.comparison == "manyvsmany": - for i, j in it.combinations(class_pat.keys(), 2): - comparisonDict, max_z_score = compareDatasetPair(class_pat.get(i), class_pat.get(j), ids) - enrichment_results.append((i, j, comparisonDict, max_z_score)) - - elif ARGS.comparison == "onevsrest": - for single_cluster in class_pat.keys(): - rest = [item for k, v in class_pat.items() if k != single_cluster for item in v] - - comparisonDict, max_z_score = compareDatasetPair(class_pat.get(single_cluster), rest, ids) - enrichment_results.append((single_cluster, "rest", comparisonDict, max_z_score)) - - #elif ARGS.comparison == "onevsmany": - # controlItems = class_pat.get(ARGS.control) - # for otherDataset in class_pat.keys(): - # if otherDataset == ARGS.control: - # continue - # comparisonDict, max_z_score = compareDatasetPair(controlItems, class_pat.get(otherDataset), ids) - # enrichment_results.append((ARGS.control, otherDataset, comparisonDict, max_z_score)) - elif ARGS.comparison == "onevsmany": - controlItems = class_pat.get(ARGS.control) - for otherDataset in class_pat.keys(): - if otherDataset == ARGS.control: - continue - comparisonDict, max_z_score = compareDatasetPair(class_pat.get(otherDataset),controlItems, ids) - enrichment_results.append(( otherDataset,ARGS.control, comparisonDict, max_z_score)) - - return enrichment_results - -def createOutputMaps(dataset1Name :str, dataset2Name :str, core_map :ET.ElementTree) -> None: - svgFilePath = buildOutputPath(dataset1Name, dataset2Name, details="SVG Map", ext=utils.FileFormat.SVG) - utils.writeSvg(svgFilePath, core_map) - - if ARGS.generate_pdf: - pngPath = buildOutputPath(dataset1Name, dataset2Name, details="PNG Map", ext=utils.FileFormat.PNG) - pdfPath = buildOutputPath(dataset1Name, dataset2Name, details="PDF Map", ext=utils.FileFormat.PDF) - convert_to_pdf(svgFilePath, pngPath, pdfPath) - - if not ARGS.generate_svg: - os.remove(svgFilePath.show()) - -ClassPat = Dict[str, List[List[float]]] -def getClassesAndIdsFromDatasets(datasetsPaths :List[str], datasetPath :str, classPath :str, names :List[str]) -> Tuple[List[str], ClassPat]: - # TODO: I suggest creating dicts with ids as keys instead of keeping class_pat and ids separate, - # for the sake of everyone's sanity. - class_pat :ClassPat = {} - if ARGS.option == 'datasets': - num = 1 #TODO: the dataset naming function could be a generator - for path, name in zip(datasetsPaths, names): - name = name_dataset(name, num) - resolve_rules_float, ids = getDatasetValues(path, name) - if resolve_rules_float != None: - class_pat[name] = list(map(list, zip(*resolve_rules_float.values()))) - - num += 1 - - elif ARGS.option == "dataset_class": - classes = read_dataset(classPath, "class") - classes = classes.astype(str) - resolve_rules_float, ids = getDatasetValues(datasetPath, "Dataset Class (not actual name)") - #check if classes have match on ids - if not all(classes.iloc[:, 0].isin(ids)): - utils.logWarning( - "No match between classes and sample IDs", ARGS.out_log) - if resolve_rules_float != None: class_pat = split_class(classes, resolve_rules_float) - - return ids, class_pat - #^^^ TODO: this could be a match statement over an enum, make it happen future marea dev with python 3.12! (it's why I kept the ifs) - -#TODO: create these damn args as FilePath objects -def getDatasetValues(datasetPath :str, datasetName :str) -> Tuple[ClassPat, List[str]]: - """ - Opens the dataset at the given path and extracts the values (expected nullable numerics) and the IDs. - - Args: - datasetPath : path to the dataset - datasetName (str): dataset name, used in error reporting - - Returns: - Tuple[ClassPat, List[str]]: values and IDs extracted from the dataset - """ - dataset = read_dataset(datasetPath, datasetName) - - # Ensure the first column is treated as the reaction name - dataset = dataset.set_index(dataset.columns[0]) - - # Check if required reactions exist in the dataset - required_reactions = ['EX_lac__L_e', 'EX_glc__D_e', 'EX_gln__L_e', 'EX_glu__L_e'] - missing_reactions = [reaction for reaction in required_reactions if reaction not in dataset.index] - - if missing_reactions: - sys.exit(f'Execution aborted: Missing required reactions {missing_reactions} in {datasetName}\n') - - # Calculate new rows using safe division - lact_glc = np.divide( - np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), - np.clip(dataset.loc['EX_glc__D_e'].to_numpy(), a_min=None, a_max=0), - out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), # Prepara un array con NaN come output di default - where=dataset.loc['EX_glc__D_e'].to_numpy() != 0 # Condizione per evitare la divisione per zero - ) - lact_gln = np.divide( - np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), - np.clip(dataset.loc['EX_gln__L_e'].to_numpy(), a_min=None, a_max=0), - out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), - where=dataset.loc['EX_gln__L_e'].to_numpy() != 0 - ) - lact_o2 = np.divide( - np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), - np.clip(dataset.loc['EX_o2_e'].to_numpy(), a_min=None, a_max=0), - out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), - where=dataset.loc['EX_o2_e'].to_numpy() != 0 - ) - glu_gln = np.divide( - dataset.loc['EX_glu__L_e'].to_numpy(), - np.clip(dataset.loc['EX_gln__L_e'].to_numpy(), a_min=None, a_max=0), - out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), - where=dataset.loc['EX_gln__L_e'].to_numpy() != 0 - ) - - - values = {'lact_glc': lact_glc, 'lact_gln': lact_gln, 'lact_o2': lact_o2, 'glu_gln': glu_gln} - - # Sostituzione di inf e NaN con 0 se necessario - for key in values: - values[key] = np.nan_to_num(values[key], nan=0.0, posinf=0.0, neginf=0.0) - - # Creazione delle nuove righe da aggiungere al dataset - new_rows = pd.DataFrame({ - dataset.index.name: ['LactGlc', 'LactGln', 'LactO2', 'GluGln'], - **{col: [values['lact_glc'][i], values['lact_gln'][i], values['lact_o2'][i], values['glu_gln'][i]] - for i, col in enumerate(dataset.columns)} - }) - - #print(new_rows) - - # Ritorna il dataset originale con le nuove righe - dataset.reset_index(inplace=True) - dataset = pd.concat([dataset, new_rows], ignore_index=True) - - IDs = pd.Series.tolist(dataset.iloc[:, 0].astype(str)) - - dataset = dataset.drop(dataset.columns[0], axis = "columns").to_dict("list") - return { id : list(map(utils.Float("Dataset values, not an argument"), values)) for id, values in dataset.items() }, IDs - -def rgb_to_hex(rgb): - """ - Convert RGB values (0-1 range) to hexadecimal color format. - - Args: - rgb (numpy.ndarray): An array of RGB color components (in the range [0, 1]). - - Returns: - str: The color in hexadecimal format (e.g., '#ff0000' for red). - """ - # Convert RGB values (0-1 range) to hexadecimal format - rgb = (np.array(rgb) * 255).astype(int) - return '#{:02x}{:02x}{:02x}'.format(rgb[0], rgb[1], rgb[2]) - -def save_colormap_image(min_value: float, max_value: float, path: utils.FilePath, colorMap:str="viridis"): - """ - Create and save an image of the colormap showing the gradient and its range. - - Args: - min_value (float): The minimum value of the colormap range. - max_value (float): The maximum value of the colormap range. - filename (str): The filename for saving the image. - """ - - # Create a colormap using matplotlib - cmap = plt.get_cmap(colorMap) - - # Create a figure and axis - fig, ax = plt.subplots(figsize=(6, 1)) - fig.subplots_adjust(bottom=0.5) - - # Create a gradient image - gradient = np.linspace(0, 1, 256) - gradient = np.vstack((gradient, gradient)) - - # Add min and max value annotations - ax.text(0, 0.5, f'{np.round(min_value, 3)}', va='center', ha='right', transform=ax.transAxes, fontsize=12, color='black') - ax.text(1, 0.5, f'{np.round(max_value, 3)}', va='center', ha='left', transform=ax.transAxes, fontsize=12, color='black') - - - # Display the gradient image - ax.imshow(gradient, aspect='auto', cmap=cmap) - ax.set_axis_off() - - # Save the image - plt.savefig(path.show(), bbox_inches='tight', pad_inches=0) - plt.close() - pass - -def min_nonzero_abs(arr): - # Flatten the array and filter out zeros, then find the minimum of the remaining values - non_zero_elements = np.abs(arr)[np.abs(arr) > 0] - return np.min(non_zero_elements) if non_zero_elements.size > 0 else None - -def computeEnrichmentMeanMedian(metabMap: ET.ElementTree, class_pat: Dict[str, List[List[float]]], ids: List[str], colormap:str) -> None: - """ - Compute and visualize the metabolic map based on mean and median of the input fluxes. - The fluxes are normalised across classes/datasets and visualised using the given colormap. - - Args: - metabMap (ET.ElementTree): An XML tree representing the metabolic map. - class_pat (Dict[str, List[List[float]]]): A dictionary where keys are class names and values are lists of enrichment values. - ids (List[str]): A list of reaction IDs to be used for coloring arrows. - - Returns: - None - """ - # Create copies only if they are needed - metabMap_mean = copy.deepcopy(metabMap) - metabMap_median = copy.deepcopy(metabMap) - - # Compute medians and means - medians = {key: np.round(np.nanmedian(np.array(value), axis=1), 6) for key, value in class_pat.items()} - means = {key: np.round(np.nanmean(np.array(value), axis=1),6) for key, value in class_pat.items()} - - # Normalize medians and means - max_flux_medians = max(np.max(np.abs(arr)) for arr in medians.values()) - max_flux_means = max(np.max(np.abs(arr)) for arr in means.values()) - - min_flux_medians = min(min_nonzero_abs(arr) for arr in medians.values()) - min_flux_means = min(min_nonzero_abs(arr) for arr in means.values()) - - medians = {key: median/max_flux_medians for key, median in medians.items()} - means = {key: mean/max_flux_means for key, mean in means.items()} - - save_colormap_image(min_flux_medians, max_flux_medians, utils.FilePath("Color map median", ext=utils.FileFormat.PNG, prefix=ARGS.output_path), colormap) - save_colormap_image(min_flux_means, max_flux_means, utils.FilePath("Color map mean", ext=utils.FileFormat.PNG, prefix=ARGS.output_path), colormap) - - cmap = plt.get_cmap(colormap) - - min_width = 2.0 # Minimum arrow width - max_width = 15.0 # Maximum arrow width - - for key in class_pat: - # Create color mappings for median and mean - colors_median = { - rxn_id: rgb_to_hex(cmap(abs(medians[key][i]))) if medians[key][i] != 0 else '#bebebe' #grey blocked - for i, rxn_id in enumerate(ids) - } - - colors_mean = { - rxn_id: rgb_to_hex(cmap(abs(means[key][i]))) if means[key][i] != 0 else '#bebebe' #grey blocked - for i, rxn_id in enumerate(ids) - } - - for i, rxn_id in enumerate(ids): - # Calculate arrow width for median - width_median = np.interp(abs(medians[key][i]), [0, 1], [min_width, max_width]) - isNegative = medians[key][i] < 0 - apply_arrow(metabMap_median, rxn_id, colors_median[rxn_id], isNegative, width_median) - - # Calculate arrow width for mean - width_mean = np.interp(abs(means[key][i]), [0, 1], [min_width, max_width]) - isNegative = means[key][i] < 0 - apply_arrow(metabMap_mean, rxn_id, colors_mean[rxn_id], isNegative, width_mean) - - # Save and convert the SVG files - save_and_convert(metabMap_mean, "mean", key) - save_and_convert(metabMap_median, "median", key) - -def apply_arrow(metabMap, rxn_id, color, isNegative, width=5): - """ - Apply an arrow to a specific reaction in the metabolic map with a given color. - - Args: - metabMap (ET.ElementTree): An XML tree representing the metabolic map. - rxn_id (str): The ID of the reaction to which the arrow will be applied. - color (str): The color of the arrow in hexadecimal format. - isNegative (bool): A boolean indicating if the arrow represents a negative value. - width (int): The width of the arrow. - - Returns: - None - """ - arrow = Arrow(width=width, col=color) - arrow.styleReactionElementsMeanMedian(metabMap, rxn_id, isNegative) - pass - -def save_and_convert(metabMap, map_type, key): - """ - Save the metabolic map as an SVG file and optionally convert it to PNG and PDF formats. - - Args: - metabMap (ET.ElementTree): An XML tree representing the metabolic map. - map_type (str): The type of map ('mean' or 'median'). - key (str): The key identifying the specific map. - - Returns: - None - """ - svgFilePath = utils.FilePath(f"SVG Map {map_type} - {key}", ext=utils.FileFormat.SVG, prefix=ARGS.output_path) - utils.writeSvg(svgFilePath, metabMap) - if ARGS.generate_pdf: - pngPath = utils.FilePath(f"PNG Map {map_type} - {key}", ext=utils.FileFormat.PNG, prefix=ARGS.output_path) - pdfPath = utils.FilePath(f"PDF Map {map_type} - {key}", ext=utils.FileFormat.PDF, prefix=ARGS.output_path) - convert_to_pdf(svgFilePath, pngPath, pdfPath) - if not ARGS.generate_svg: - os.remove(svgFilePath.show()) - -############################ MAIN ############################################# -def main(args:List[str] = None) -> None: - """ - Initializes everything and sets the program in motion based on the fronted input arguments. - - Returns: - None - - Raises: - sys.exit : if a user-provided custom map is in the wrong format (ET.XMLSyntaxError, ET.XMLSchemaParseError) - """ - - global ARGS - ARGS = process_args(args) - - if ARGS.custom_map == 'None': - ARGS.custom_map = None - - if os.path.isdir(ARGS.output_path) == False: os.makedirs(ARGS.output_path) - - core_map :ET.ElementTree = ARGS.choice_map.getMap( - ARGS.tool_dir, - utils.FilePath.fromStrPath(ARGS.custom_map) if ARGS.custom_map else None) - # TODO: ^^^ ugly but fine for now, the argument is None if the model isn't custom because no file was given. - # getMap will None-check the customPath and panic when the model IS custom but there's no file (good). A cleaner - # solution can be derived from my comment in FilePath.fromStrPath - - ids, class_pat = getClassesAndIdsFromDatasets(ARGS.input_datas_fluxes, ARGS.input_data_fluxes, ARGS.input_class_fluxes, ARGS.names_fluxes) - - if(ARGS.choice_map == utils.Model.HMRcore): - temp_map = utils.Model.HMRcore_no_legend - computeEnrichmentMeanMedian(temp_map.getMap(ARGS.tool_dir), class_pat, ids, ARGS.color_map) - elif(ARGS.choice_map == utils.Model.ENGRO2): - temp_map = utils.Model.ENGRO2_no_legend - computeEnrichmentMeanMedian(temp_map.getMap(ARGS.tool_dir), class_pat, ids, ARGS.color_map) - else: - computeEnrichmentMeanMedian(core_map, class_pat, ids, ARGS.color_map) - - - enrichment_results = computeEnrichment(class_pat, ids) - for i, j, comparisonDict, max_z_score in enrichment_results: - map_copy = copy.deepcopy(core_map) - temp_thingsInCommon(comparisonDict, map_copy, max_z_score, i, j) - createOutputMaps(i, j, map_copy) - - if not ERRORS: return - utils.logWarning( - f"The following reaction IDs were mentioned in the dataset but weren't found in the map: {ERRORS}", - ARGS.out_log) - - print('Execution succeded') - -############################################################################### -if __name__ == "__main__": - main() - +from __future__ import division +import csv +from enum import Enum +import re +import sys +import numpy as np +import pandas as pd +import itertools as it +import scipy.stats as st +import lxml.etree as ET +import math +try: + from .utils import general_utils as utils +except: + import utils.general_utils as utils +from PIL import Image +import os +import copy +import argparse +import pyvips +from PIL import Image +from typing import Tuple, Union, Optional, List, Dict +import matplotlib.pyplot as plt + +ERRORS = [] +########################## argparse ########################################## +ARGS :argparse.Namespace +def process_args(args:List[str] = None) -> argparse.Namespace: + """ + Interfaces the script of a module with its frontend, making the user's choices for various parameters available as values in code. + + Args: + args : Always obtained (in file) from sys.argv + + Returns: + Namespace : An object containing the parsed arguments + """ + parser = argparse.ArgumentParser( + usage = "%(prog)s [options]", + description = "process some value's genes to create a comparison's map.") + + #General: + parser.add_argument( + '-td', '--tool_dir', + type = str, + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') + + parser.add_argument('-on', '--control', type = str) + parser.add_argument('-ol', '--out_log', help = "Output log") + + #Computation details: + parser.add_argument( + '-co', '--comparison', + type = str, + default = 'manyvsmany', + choices = ['manyvsmany', 'onevsrest', 'onevsmany']) + + parser.add_argument( + '-te' ,'--test', + type = str, + default = 'ks', + choices = ['ks', 'ttest_p', 'ttest_ind', 'wilcoxon', 'mw'], + help = 'Statistical test to use (default: %(default)s)') + + parser.add_argument( + '-pv' ,'--pValue', + type = float, + default = 0.1, + help = 'P-Value threshold (default: %(default)s)') + + parser.add_argument( + '-adj' ,'--adjusted', + type = utils.Bool("adjusted"), default = False, + help = 'Apply the FDR (Benjamini-Hochberg) correction (default: %(default)s)') + + parser.add_argument( + '-fc', '--fChange', + type = float, + default = 1.5, + help = 'Fold-Change threshold (default: %(default)s)') + + parser.add_argument( + '-op', '--option', + type = str, + choices = ['datasets', 'dataset_class'], + help='dataset or dataset and class') + + parser.add_argument( + '-idf', '--input_data_fluxes', + type = str, + help = 'input dataset fluxes') + + parser.add_argument( + '-icf', '--input_class_fluxes', + type = str, + help = 'sample group specification fluxes') + + parser.add_argument( + '-idsf', '--input_datas_fluxes', + type = str, + nargs = '+', + help = 'input datasets fluxes') + + parser.add_argument( + '-naf', '--names_fluxes', + type = str, + nargs = '+', + help = 'input names fluxes') + + #Output: + parser.add_argument( + "-gs", "--generate_svg", + type = utils.Bool("generate_svg"), default = True, + help = "choose whether to generate svg") + + parser.add_argument( + "-gp", "--generate_pdf", + type = utils.Bool("generate_pdf"), default = True, + help = "choose whether to generate pdf") + + parser.add_argument( + '-cm', '--custom_map', + type = str, + help='custom map to use') + + parser.add_argument( + '-mc', '--choice_map', + type = utils.Model, default = utils.Model.HMRcore, + choices = [utils.Model.HMRcore, utils.Model.ENGRO2, utils.Model.Custom]) + + parser.add_argument( + '-colorm', '--color_map', + type = str, + choices = ["jet", "viridis"]) + + parser.add_argument( + '-idop', '--output_path', + type = str, + default='result', + help = 'output path for maps') + + args :argparse.Namespace = parser.parse_args(args) + args.net = True # TODO SICCOME I FLUSSI POSSONO ESSERE ANCHE NEGATIVI SONO SEMPRE CONSIDERATI NETTI + + return args + +############################ dataset input #################################### +def read_dataset(data :str, name :str) -> pd.DataFrame: + """ + Tries to read the dataset from its path (data) as a tsv and turns it into a DataFrame. + + Args: + data : filepath of a dataset (from frontend input params or literals upon calling) + name : name associated with the dataset (from frontend input params or literals upon calling) + + Returns: + pd.DataFrame : dataset in a runtime operable shape + + Raises: + sys.exit : if there's no data (pd.errors.EmptyDataError) or if the dataset has less than 2 columns + """ + try: + dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') + except pd.errors.EmptyDataError: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + if len(dataset.columns) < 2: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + return dataset + +############################ dataset name ##################################### +def name_dataset(name_data :str, count :int) -> str: + """ + Produces a unique name for a dataset based on what was provided by the user. The default name for any dataset is "Dataset", thus if the user didn't change it this function appends f"_{count}" to make it unique. + + Args: + name_data : name associated with the dataset (from frontend input params) + count : counter from 1 to make these names unique (external) + + Returns: + str : the name made unique + """ + if str(name_data) == 'Dataset': + return str(name_data) + '_' + str(count) + else: + return str(name_data) + +############################ map_methods ###################################### +FoldChange = Union[float, int, str] # Union[float, Literal[0, "-INF", "INF"]] +def fold_change(avg1 :float, avg2 :float) -> FoldChange: + """ + Calculates the fold change between two gene expression values. + + Args: + avg1 : average expression value from one dataset avg2 : average expression value from the other dataset + + Returns: + FoldChange : + 0 : when both input values are 0 + "-INF" : when avg1 is 0 + "INF" : when avg2 is 0 + float : for any other combination of values + """ + if avg1 == 0 and avg2 == 0: + return 0 + elif avg1 == 0: + return '-INF' + elif avg2 == 0: + return 'INF' + else: # (threshold_F_C - 1) / (abs(threshold_F_C) + 1) con threshold_F_C > 1 + return (avg1 - avg2) / (abs(avg1) + abs(avg2)) + +def getElementById(reactionId :str, metabMap :ET.ElementTree) -> utils.Result[ET.Element, utils.Result.ResultErr]: + """ + Finds any element in the given map with the given ID. ID uniqueness in an svg file is recommended but + not enforced, if more than one element with the exact ID is found only the first will be returned. + + Args: + reactionId (str): exact ID of the requested element. + metabMap (ET.ElementTree): metabolic map containing the element. + + Returns: + utils.Result[ET.Element, ResultErr]: result of the search, either the first match found or a ResultErr. + """ + return utils.Result.Ok( + f"//*[@id=\"{reactionId}\"]").map( + lambda xPath : metabMap.xpath(xPath)[0]).mapErr( + lambda _ : utils.Result.ResultErr(f"No elements with ID \"{reactionId}\" found in map")) + # ^^^ we shamelessly ignore the contents of the IndexError, it offers nothing to the user. + +def styleMapElement(element :ET.Element, styleStr :str) -> None: + currentStyles :str = element.get("style", "") + if re.search(r";stroke:[^;]+;stroke-width:[^;]+;stroke-dasharray:[^;]+$", currentStyles): + currentStyles = ';'.join(currentStyles.split(';')[:-3]) + + element.set("style", currentStyles + styleStr) + +class ReactionDirection(Enum): + Unknown = "" + Direct = "_F" + Inverse = "_B" + + @classmethod + def fromDir(cls, s :str) -> "ReactionDirection": + # vvv as long as there's so few variants I actually condone the if spam: + if s == ReactionDirection.Direct.value: return ReactionDirection.Direct + if s == ReactionDirection.Inverse.value: return ReactionDirection.Inverse + return ReactionDirection.Unknown + + @classmethod + def fromReactionId(cls, reactionId :str) -> "ReactionDirection": + return ReactionDirection.fromDir(reactionId[-2:]) + +def getArrowBodyElementId(reactionId :str) -> str: + if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV + elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: reactionId = reactionId[:-2] + return f"R_{reactionId}" + +def getArrowHeadElementId(reactionId :str) -> Tuple[str, str]: + """ + We attempt extracting the direction information from the provided reaction ID, if unsuccessful we provide the IDs of both directions. + + Args: + reactionId : the provided reaction ID. + + Returns: + Tuple[str, str]: either a single str ID for the correct arrow head followed by an empty string or both options to try. + """ + if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV + elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: return reactionId[:-3:-1] + reactionId[:-2], "" + return f"F_{reactionId}", f"B_{reactionId}" + +class ArrowColor(Enum): + """ + Encodes possible arrow colors based on their meaning in the enrichment process. + """ + Invalid = "#BEBEBE" # gray, fold-change under treshold or not significant p-value + Transparent = "#ffffff00" # transparent, to make some arrow segments disappear + UpRegulated = "#ecac68" # red, up-regulated reaction + DownRegulated = "#6495ed" # blue, down-regulated reaction + + UpRegulatedInv = "#FF0000" + # ^^^ different shade of red (actually orange), up-regulated net value for a reversible reaction with + # conflicting enrichment in the two directions. + + DownRegulatedInv = "#0000FF" + # ^^^ different shade of blue (actually purple), down-regulated net value for a reversible reaction with + # conflicting enrichment in the two directions. + + @classmethod + def fromFoldChangeSign(cls, foldChange :float, *, useAltColor = False) -> "ArrowColor": + colors = (cls.DownRegulated, cls.DownRegulatedInv) if foldChange < 0 else (cls.UpRegulated, cls.UpRegulatedInv) + return colors[useAltColor] + + def __str__(self) -> str: return self.value + +class Arrow: + """ + Models the properties of a reaction arrow that change based on enrichment. + """ + MIN_W = 2 + MAX_W = 12 + + def __init__(self, width :int, col: ArrowColor, *, isDashed = False) -> None: + """ + (Private) Initializes an instance of Arrow. + + Args: + width : width of the arrow, ideally to be kept within Arrow.MIN_W and Arrow.MAX_W (not enforced). + col : color of the arrow. + isDashed : whether the arrow should be dashed, meaning the associated pValue resulted not significant. + + Returns: + None : practically, a Arrow instance. + """ + self.w = width + self.col = col + self.dash = isDashed + + def applyTo(self, reactionId :str, metabMap :ET.ElementTree, styleStr :str) -> None: + if getElementById(reactionId, metabMap).map(lambda el : styleMapElement(el, styleStr)).isErr: + ERRORS.append(reactionId) + + def styleReactionElements(self, metabMap :ET.ElementTree, reactionId :str, *, mindReactionDir = True) -> None: + if not mindReactionDir: + return self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) + + # Now we style the arrow head(s): + idOpt1, idOpt2 = getArrowHeadElementId(reactionId) + self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) + if idOpt2: self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) + + def styleReactionElementsMeanMedian(self, metabMap :ET.ElementTree, reactionId :str, isNegative:bool) -> None: + + self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) + idOpt1, idOpt2 = getArrowHeadElementId(reactionId) + + if(isNegative): + self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) + self.col = ArrowColor.Transparent + self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) #trasp + else: + self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) + self.col = ArrowColor.Transparent + self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) #trasp + + + + def getMapReactionId(self, reactionId :str, mindReactionDir :bool) -> str: + """ + Computes the reaction ID as encoded in the map for a given reaction ID from the dataset. + + Args: + reactionId: the reaction ID, as encoded in the dataset. + mindReactionDir: if True forward (F_) and backward (B_) directions will be encoded in the result. + + Returns: + str : the ID of an arrow's body or tips in the map. + """ + # we assume the reactionIds also don't encode reaction dir if they don't mind it when styling the map. + if not mindReactionDir: return "R_" + reactionId + + #TODO: this is clearly something we need to make consistent in fluxes + return (reactionId[:-3:-1] + reactionId[:-2]) if reactionId[:-2] in ["_F", "_B"] else f"F_{reactionId}" # "Pyr_F" --> "F_Pyr" + + def toStyleStr(self, *, downSizedForTips = False) -> str: + """ + Collapses the styles of this Arrow into a str, ready to be applied as part of the "style" property on an svg element. + + Returns: + str : the styles string. + """ + width = self.w + if downSizedForTips: width *= 0.8 + return f";stroke:{self.col};stroke-width:{width};stroke-dasharray:{'5,5' if self.dash else 'none'}" + +# vvv These constants could be inside the class itself a static properties, but python +# was built by brainless organisms so here we are! +INVALID_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid) +INSIGNIFICANT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid, isDashed = True) + +def applyFluxesEnrichmentToMap(fluxesEnrichmentRes :Dict[str, Union[Tuple[float, FoldChange], Tuple[float, FoldChange, float, float]]], metabMap :ET.ElementTree, maxNumericZScore :float) -> None: + """ + Applies fluxes enrichment results to the provided metabolic map. + + Args: + fluxesEnrichmentRes : fluxes enrichment results. + metabMap : the metabolic map to edit. + maxNumericZScore : biggest finite z-score value found. + + Side effects: + metabMap : mut + + Returns: + None + """ + for reactionId, values in fluxesEnrichmentRes.items(): + pValue = values[0] + foldChange = values[1] + z_score = values[2] + + if math.isnan(pValue) or (isinstance(foldChange, float) and math.isnan(foldChange)): + continue + + if isinstance(foldChange, str): foldChange = float(foldChange) + if pValue > ARGS.pValue: # pValue above tresh: dashed arrow + INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId) + INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId, mindReactionDir = False) + + continue + + if abs(foldChange) < (ARGS.fChange - 1) / (abs(ARGS.fChange) + 1): + INVALID_ARROW.styleReactionElements(metabMap, reactionId) + INVALID_ARROW.styleReactionElements(metabMap, reactionId, mindReactionDir = False) + + continue + + width = Arrow.MAX_W + if not math.isinf(z_score): + try: + width = min( + max(abs(z_score * Arrow.MAX_W) / maxNumericZScore, Arrow.MIN_W), + Arrow.MAX_W) + + except ZeroDivisionError: pass + # TODO CHECK RV + #if not reactionId.endswith("_RV"): # RV stands for reversible reactions + # Arrow(width, ArrowColor.fromFoldChangeSign(foldChange)).styleReactionElements(metabMap, reactionId) + # continue + + #reactionId = reactionId[:-3] # Remove "_RV" + + inversionScore = (values[3] < 0) + (values[4] < 0) # Compacts the signs of averages into 1 easy to check score + if inversionScore == 2: foldChange *= -1 + # ^^^ Style the inverse direction with the opposite sign netValue + + # If the score is 1 (opposite signs) we use alternative colors vvv + arrow = Arrow(width, ArrowColor.fromFoldChangeSign(foldChange, useAltColor = inversionScore == 1)) + + # vvv These 2 if statements can both be true and can both happen + if ARGS.net: # style arrow head(s): + arrow.styleReactionElements(metabMap, reactionId + ("_B" if inversionScore == 2 else "_F")) + arrow.applyTo(("F_" if inversionScore == 2 else "B_") + reactionId, metabMap, f";stroke:{ArrowColor.Transparent};stroke-width:0;stroke-dasharray:None") + + arrow.styleReactionElements(metabMap, reactionId, mindReactionDir = False) + + +############################ split class ###################################### +def split_class(classes :pd.DataFrame, resolve_rules :Dict[str, List[float]]) -> Dict[str, List[List[float]]]: + """ + Generates a :dict that groups together data from a :DataFrame based on classes the data is related to. + + Args: + classes : a :DataFrame of only string values, containing class information (rows) and keys to query the resolve_rules :dict + resolve_rules : a :dict containing :float data + + Returns: + dict : the dict with data grouped by class + + Side effects: + classes : mut + """ + class_pat :Dict[str, List[List[float]]] = {} + for i in range(len(classes)): + classe :str = classes.iloc[i, 1] + if pd.isnull(classe): continue + + l :List[List[float]] = [] + for j in range(i, len(classes)): + if classes.iloc[j, 1] == classe: + pat_id :str = classes.iloc[j, 0] + tmp = resolve_rules.get(pat_id, None) + if tmp != None: + l.append(tmp) + classes.iloc[j, 1] = None + + if l: + class_pat[classe] = list(map(list, zip(*l))) + continue + + utils.logWarning( + f"Warning: no sample found in class \"{classe}\", the class has been disregarded", ARGS.out_log) + + return class_pat + +############################ conversion ############################################## +#conversion from svg to png +def svg_to_png_with_background(svg_path :utils.FilePath, png_path :utils.FilePath, dpi :int = 72, scale :int = 1, size :Optional[float] = None) -> None: + """ + Internal utility to convert an SVG to PNG (forced opaque) to aid in PDF conversion. + + Args: + svg_path : path to SVG file + png_path : path for new PNG file + dpi : dots per inch of the generated PNG + scale : scaling factor for the generated PNG, computed internally when a size is provided + size : final effective width of the generated PNG + + Returns: + None + """ + if size: + image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=1) + scale = size / image.width + image = image.resize(scale) + else: + image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=scale) + + white_background = pyvips.Image.black(image.width, image.height).new_from_image([255, 255, 255]) + white_background = white_background.affine([scale, 0, 0, scale]) + + if white_background.bands != image.bands: + white_background = white_background.extract_band(0) + + composite_image = white_background.composite2(image, 'over') + composite_image.write_to_file(png_path.show()) + +#funzione unica, lascio fuori i file e li passo in input +#conversion from png to pdf +def convert_png_to_pdf(png_file :utils.FilePath, pdf_file :utils.FilePath) -> None: + """ + Internal utility to convert a PNG to PDF to aid from SVG conversion. + + Args: + png_file : path to PNG file + pdf_file : path to new PDF file + + Returns: + None + """ + image = Image.open(png_file.show()) + image = image.convert("RGB") + image.save(pdf_file.show(), "PDF", resolution=100.0) + +#function called to reduce redundancy in the code +def convert_to_pdf(file_svg :utils.FilePath, file_png :utils.FilePath, file_pdf :utils.FilePath) -> None: + """ + Converts the SVG map at the provided path to PDF. + + Args: + file_svg : path to SVG file + file_png : path to PNG file + file_pdf : path to new PDF file + + Returns: + None + """ + svg_to_png_with_background(file_svg, file_png) + try: + convert_png_to_pdf(file_png, file_pdf) + print(f'PDF file {file_pdf.filePath} successfully generated.') + + except Exception as e: + raise utils.DataErr(file_pdf.show(), f'Error generating PDF file: {e}') + +############################ map ############################################## +def buildOutputPath(dataset1Name :str, dataset2Name = "rest", *, details = "", ext :utils.FileFormat) -> utils.FilePath: + """ + Builds a FilePath instance from the names of confronted datasets ready to point to a location in the + "result/" folder, used by this tool for output files in collections. + + Args: + dataset1Name : _description_ + dataset2Name : _description_. Defaults to "rest". + details : _description_ + ext : _description_ + + Returns: + utils.FilePath : _description_ + """ + # This function returns a util data structure but is extremely specific to this module. + # RAS also uses collections as output and as such might benefit from a method like this, but I'd wait + # TODO: until a third tool with multiple outputs appears before porting this to utils. + return utils.FilePath( + f"{dataset1Name}_vs_{dataset2Name}" + (f" ({details})" if details else ""), + # ^^^ yes this string is built every time even if the form is the same for the same 2 datasets in + # all output files: I don't care, this was never the performance bottleneck of the tool and + # there is no other net gain in saving and re-using the built string. + ext, + prefix = ARGS.output_path) + +FIELD_NOT_AVAILABLE = '/' +def writeToCsv(rows: List[list], fieldNames :List[str], outPath :utils.FilePath) -> None: + fieldsAmt = len(fieldNames) + with open(outPath.show(), "w", newline = "") as fd: + writer = csv.DictWriter(fd, fieldnames = fieldNames, delimiter = '\t') + writer.writeheader() + + for row in rows: + sizeMismatch = fieldsAmt - len(row) + if sizeMismatch > 0: row.extend([FIELD_NOT_AVAILABLE] * sizeMismatch) + writer.writerow({ field : data for field, data in zip(fieldNames, row) }) + +OldEnrichedScores = Dict[str, List[Union[float, FoldChange]]] #TODO: try to use Tuple whenever possible +def writeTabularResult(enrichedScores : OldEnrichedScores, outPath :utils.FilePath) -> None: + fieldNames = ["ids", "P_Value", "fold change", "z-score"] + fieldNames.extend(["average_1", "average_2"]) + + writeToCsv([ [reactId] + values for reactId, values in enrichedScores.items() ], fieldNames, outPath) + +def temp_thingsInCommon(tmp :Dict[str, List[Union[float, FoldChange]]], core_map :ET.ElementTree, max_z_score :float, dataset1Name :str, dataset2Name = "rest") -> None: + # this function compiles the things always in common between comparison modes after enrichment. + # TODO: organize, name better. + writeTabularResult(tmp, buildOutputPath(dataset1Name, dataset2Name, details = "Tabular Result", ext = utils.FileFormat.TSV)) + for reactId, enrichData in tmp.items(): tmp[reactId] = tuple(enrichData) + applyFluxesEnrichmentToMap(tmp, core_map, max_z_score) + +def computePValue(dataset1Data: List[float], dataset2Data: List[float]) -> Tuple[float, float]: + """ + Computes the statistical significance score (P-value) of the comparison between coherent data + from two datasets. The data is supposed to, in both datasets: + - be related to the same reaction ID; + - be ordered by sample, such that the item at position i in both lists is related to the + same sample or cell line. + + Args: + dataset1Data : data from the 1st dataset. + dataset2Data : data from the 2nd dataset. + + Returns: + tuple: (P-value, Z-score) + - P-value from the selected test on the provided data. + - Z-score of the difference between means of the two datasets. + """ + + match ARGS.test: + case "ks": + # Perform Kolmogorov-Smirnov test + _, p_value = st.ks_2samp(dataset1Data, dataset2Data) + case "ttest_p": + # Datasets should have same size + if len(dataset1Data) != len(dataset2Data): + raise ValueError("Datasets must have the same size for paired t-test.") + # Perform t-test for paired samples + _, p_value = st.ttest_rel(dataset1Data, dataset2Data) + case "ttest_ind": + # Perform t-test for independent samples + _, p_value = st.ttest_ind(dataset1Data, dataset2Data) + case "wilcoxon": + # Datasets should have same size + if len(dataset1Data) != len(dataset2Data): + raise ValueError("Datasets must have the same size for Wilcoxon signed-rank test.") + # Perform Wilcoxon signed-rank test + np.random.seed(42) # Ensure reproducibility since zsplit method is used + _, p_value = st.wilcoxon(dataset1Data, dataset2Data, zero_method="zsplit") + case "mw": + # Perform Mann-Whitney U test + _, p_value = st.mannwhitneyu(dataset1Data, dataset2Data) + + # Calculate means and standard deviations + mean1 = np.nanmean(dataset1Data) + mean2 = np.nanmean(dataset2Data) + std1 = np.nanstd(dataset1Data, ddof=1) + std2 = np.nanstd(dataset2Data, ddof=1) + + n1 = len(dataset1Data) + n2 = len(dataset2Data) + + # Calculate Z-score + z_score = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2)) + + return p_value, z_score + +def compareDatasetPair(dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> Tuple[Dict[str, List[Union[float, FoldChange]]], float]: + #TODO: the following code still suffers from "dumbvarnames-osis" + comparisonResult :Dict[str, List[Union[float, FoldChange]]] = {} + count = 0 + max_z_score = 0 + for l1, l2 in zip(dataset1Data, dataset2Data): + reactId = ids[count] + count += 1 + if not reactId: continue # we skip ids that have already been processed + + try: + p_value, z_score = computePValue(l1, l2) + avg1 = sum(l1) / len(l1) + avg2 = sum(l2) / len(l2) + f_c = fold_change(avg1, avg2) + if np.isfinite(z_score) and max_z_score < abs(z_score): max_z_score = abs(z_score) + + comparisonResult[reactId] = [float(p_value), f_c, z_score, avg1, avg2] + except (TypeError, ZeroDivisionError): continue + + # Apply multiple testing correction if set by the user + if ARGS.adjusted: + + # Retrieve the p-values from the comparisonResult dictionary, they have to be different from NaN + validPValues = [(reactId, result[0]) for reactId, result in comparisonResult.items() if not np.isnan(result[0])] + + if not validPValues: + return comparisonResult, max_z_score + + # Unpack the valid p-values + reactIds, pValues = zip(*validPValues) + # Adjust the p-values using the Benjamini-Hochberg method + adjustedPValues = st.false_discovery_control(pValues) + # Update the comparisonResult dictionary with the adjusted p-values + for reactId , adjustedPValue in zip(reactIds, adjustedPValues): + comparisonResult[reactId][0] = adjustedPValue + + return comparisonResult, max_z_score + +def computeEnrichment(class_pat :Dict[str, List[List[float]]], ids :List[str]) -> List[Tuple[str, str, dict, float]]: + """ + Compares clustered data based on a given comparison mode and applies enrichment-based styling on the + provided metabolic map. + + Args: + class_pat : the clustered data. + ids : ids for data association. + + + Returns: + List[Tuple[str, str, dict, float]]: List of tuples with pairs of dataset names, comparison dictionary, and max z-score. + + Raises: + sys.exit : if there are less than 2 classes for comparison + + """ + class_pat = { k.strip() : v for k, v in class_pat.items() } + #TODO: simplfy this stuff vvv and stop using sys.exit (raise the correct utils error) + if (not class_pat) or (len(class_pat.keys()) < 2): sys.exit('Execution aborted: classes provided for comparisons are less than two\n') + + enrichment_results = [] + + + if ARGS.comparison == "manyvsmany": + for i, j in it.combinations(class_pat.keys(), 2): + comparisonDict, max_z_score = compareDatasetPair(class_pat.get(i), class_pat.get(j), ids) + enrichment_results.append((i, j, comparisonDict, max_z_score)) + + elif ARGS.comparison == "onevsrest": + for single_cluster in class_pat.keys(): + rest = [item for k, v in class_pat.items() if k != single_cluster for item in v] + + comparisonDict, max_z_score = compareDatasetPair(class_pat.get(single_cluster), rest, ids) + enrichment_results.append((single_cluster, "rest", comparisonDict, max_z_score)) + + #elif ARGS.comparison == "onevsmany": + # controlItems = class_pat.get(ARGS.control) + # for otherDataset in class_pat.keys(): + # if otherDataset == ARGS.control: + # continue + # comparisonDict, max_z_score = compareDatasetPair(controlItems, class_pat.get(otherDataset), ids) + # enrichment_results.append((ARGS.control, otherDataset, comparisonDict, max_z_score)) + elif ARGS.comparison == "onevsmany": + controlItems = class_pat.get(ARGS.control) + for otherDataset in class_pat.keys(): + if otherDataset == ARGS.control: + continue + comparisonDict, max_z_score = compareDatasetPair(class_pat.get(otherDataset),controlItems, ids) + enrichment_results.append(( otherDataset,ARGS.control, comparisonDict, max_z_score)) + + return enrichment_results + +def createOutputMaps(dataset1Name :str, dataset2Name :str, core_map :ET.ElementTree) -> None: + svgFilePath = buildOutputPath(dataset1Name, dataset2Name, details="SVG Map", ext=utils.FileFormat.SVG) + utils.writeSvg(svgFilePath, core_map) + + if ARGS.generate_pdf: + pngPath = buildOutputPath(dataset1Name, dataset2Name, details="PNG Map", ext=utils.FileFormat.PNG) + pdfPath = buildOutputPath(dataset1Name, dataset2Name, details="PDF Map", ext=utils.FileFormat.PDF) + convert_to_pdf(svgFilePath, pngPath, pdfPath) + + if not ARGS.generate_svg: + os.remove(svgFilePath.show()) + +ClassPat = Dict[str, List[List[float]]] +def getClassesAndIdsFromDatasets(datasetsPaths :List[str], datasetPath :str, classPath :str, names :List[str]) -> Tuple[List[str], ClassPat]: + # TODO: I suggest creating dicts with ids as keys instead of keeping class_pat and ids separate, + # for the sake of everyone's sanity. + class_pat :ClassPat = {} + if ARGS.option == 'datasets': + num = 1 #TODO: the dataset naming function could be a generator + for path, name in zip(datasetsPaths, names): + name = name_dataset(name, num) + resolve_rules_float, ids = getDatasetValues(path, name) + if resolve_rules_float != None: + class_pat[name] = list(map(list, zip(*resolve_rules_float.values()))) + + num += 1 + + elif ARGS.option == "dataset_class": + classes = read_dataset(classPath, "class") + classes = classes.astype(str) + resolve_rules_float, ids = getDatasetValues(datasetPath, "Dataset Class (not actual name)") + #check if classes have match on ids + if not all(classes.iloc[:, 0].isin(ids)): + utils.logWarning( + "No match between classes and sample IDs", ARGS.out_log) + if resolve_rules_float != None: class_pat = split_class(classes, resolve_rules_float) + + return ids, class_pat + #^^^ TODO: this could be a match statement over an enum, make it happen future marea dev with python 3.12! (it's why I kept the ifs) + +#TODO: create these damn args as FilePath objects +def getDatasetValues(datasetPath :str, datasetName :str) -> Tuple[ClassPat, List[str]]: + """ + Opens the dataset at the given path and extracts the values (expected nullable numerics) and the IDs. + + Args: + datasetPath : path to the dataset + datasetName (str): dataset name, used in error reporting + + Returns: + Tuple[ClassPat, List[str]]: values and IDs extracted from the dataset + """ + dataset = read_dataset(datasetPath, datasetName) + + # Ensure the first column is treated as the reaction name + dataset = dataset.set_index(dataset.columns[0]) + + # Check if required reactions exist in the dataset + required_reactions = ['EX_lac__L_e', 'EX_glc__D_e', 'EX_gln__L_e', 'EX_glu__L_e'] + missing_reactions = [reaction for reaction in required_reactions if reaction not in dataset.index] + + if missing_reactions: + sys.exit(f'Execution aborted: Missing required reactions {missing_reactions} in {datasetName}\n') + + # Calculate new rows using safe division + lact_glc = np.divide( + np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), + np.clip(dataset.loc['EX_glc__D_e'].to_numpy(), a_min=None, a_max=0), + out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), # Prepara un array con NaN come output di default + where=dataset.loc['EX_glc__D_e'].to_numpy() != 0 # Condizione per evitare la divisione per zero + ) + lact_gln = np.divide( + np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), + np.clip(dataset.loc['EX_gln__L_e'].to_numpy(), a_min=None, a_max=0), + out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), + where=dataset.loc['EX_gln__L_e'].to_numpy() != 0 + ) + lact_o2 = np.divide( + np.clip(dataset.loc['EX_lac__L_e'].to_numpy(), a_min=0, a_max=None), + np.clip(dataset.loc['EX_o2_e'].to_numpy(), a_min=None, a_max=0), + out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), + where=dataset.loc['EX_o2_e'].to_numpy() != 0 + ) + glu_gln = np.divide( + dataset.loc['EX_glu__L_e'].to_numpy(), + np.clip(dataset.loc['EX_gln__L_e'].to_numpy(), a_min=None, a_max=0), + out=np.full_like(dataset.loc['EX_lac__L_e'].to_numpy(), np.nan), + where=dataset.loc['EX_gln__L_e'].to_numpy() != 0 + ) + + + values = {'lact_glc': lact_glc, 'lact_gln': lact_gln, 'lact_o2': lact_o2, 'glu_gln': glu_gln} + + # Sostituzione di inf e NaN con 0 se necessario + for key in values: + values[key] = np.nan_to_num(values[key], nan=0.0, posinf=0.0, neginf=0.0) + + # Creazione delle nuove righe da aggiungere al dataset + new_rows = pd.DataFrame({ + dataset.index.name: ['LactGlc', 'LactGln', 'LactO2', 'GluGln'], + **{col: [values['lact_glc'][i], values['lact_gln'][i], values['lact_o2'][i], values['glu_gln'][i]] + for i, col in enumerate(dataset.columns)} + }) + + #print(new_rows) + + # Ritorna il dataset originale con le nuove righe + dataset.reset_index(inplace=True) + dataset = pd.concat([dataset, new_rows], ignore_index=True) + + IDs = pd.Series.tolist(dataset.iloc[:, 0].astype(str)) + + dataset = dataset.drop(dataset.columns[0], axis = "columns").to_dict("list") + return { id : list(map(utils.Float("Dataset values, not an argument"), values)) for id, values in dataset.items() }, IDs + +def rgb_to_hex(rgb): + """ + Convert RGB values (0-1 range) to hexadecimal color format. + + Args: + rgb (numpy.ndarray): An array of RGB color components (in the range [0, 1]). + + Returns: + str: The color in hexadecimal format (e.g., '#ff0000' for red). + """ + # Convert RGB values (0-1 range) to hexadecimal format + rgb = (np.array(rgb) * 255).astype(int) + return '#{:02x}{:02x}{:02x}'.format(rgb[0], rgb[1], rgb[2]) + +def save_colormap_image(min_value: float, max_value: float, path: utils.FilePath, colorMap:str="viridis"): + """ + Create and save an image of the colormap showing the gradient and its range. + + Args: + min_value (float): The minimum value of the colormap range. + max_value (float): The maximum value of the colormap range. + filename (str): The filename for saving the image. + """ + + # Create a colormap using matplotlib + cmap = plt.get_cmap(colorMap) + + # Create a figure and axis + fig, ax = plt.subplots(figsize=(6, 1)) + fig.subplots_adjust(bottom=0.5) + + # Create a gradient image + gradient = np.linspace(0, 1, 256) + gradient = np.vstack((gradient, gradient)) + + # Add min and max value annotations + ax.text(0, 0.5, f'{np.round(min_value, 3)}', va='center', ha='right', transform=ax.transAxes, fontsize=12, color='black') + ax.text(1, 0.5, f'{np.round(max_value, 3)}', va='center', ha='left', transform=ax.transAxes, fontsize=12, color='black') + + + # Display the gradient image + ax.imshow(gradient, aspect='auto', cmap=cmap) + ax.set_axis_off() + + # Save the image + plt.savefig(path.show(), bbox_inches='tight', pad_inches=0) + plt.close() + pass + +def min_nonzero_abs(arr): + # Flatten the array and filter out zeros, then find the minimum of the remaining values + non_zero_elements = np.abs(arr)[np.abs(arr) > 0] + return np.min(non_zero_elements) if non_zero_elements.size > 0 else None + +def computeEnrichmentMeanMedian(metabMap: ET.ElementTree, class_pat: Dict[str, List[List[float]]], ids: List[str], colormap:str) -> None: + """ + Compute and visualize the metabolic map based on mean and median of the input fluxes. + The fluxes are normalised across classes/datasets and visualised using the given colormap. + + Args: + metabMap (ET.ElementTree): An XML tree representing the metabolic map. + class_pat (Dict[str, List[List[float]]]): A dictionary where keys are class names and values are lists of enrichment values. + ids (List[str]): A list of reaction IDs to be used for coloring arrows. + + Returns: + None + """ + # Create copies only if they are needed + metabMap_mean = copy.deepcopy(metabMap) + metabMap_median = copy.deepcopy(metabMap) + + # Compute medians and means + medians = {key: np.round(np.nanmedian(np.array(value), axis=1), 6) for key, value in class_pat.items()} + means = {key: np.round(np.nanmean(np.array(value), axis=1),6) for key, value in class_pat.items()} + + # Normalize medians and means + max_flux_medians = max(np.max(np.abs(arr)) for arr in medians.values()) + max_flux_means = max(np.max(np.abs(arr)) for arr in means.values()) + + min_flux_medians = min(min_nonzero_abs(arr) for arr in medians.values()) + min_flux_means = min(min_nonzero_abs(arr) for arr in means.values()) + + medians = {key: median/max_flux_medians for key, median in medians.items()} + means = {key: mean/max_flux_means for key, mean in means.items()} + + save_colormap_image(min_flux_medians, max_flux_medians, utils.FilePath("Color map median", ext=utils.FileFormat.PNG, prefix=ARGS.output_path), colormap) + save_colormap_image(min_flux_means, max_flux_means, utils.FilePath("Color map mean", ext=utils.FileFormat.PNG, prefix=ARGS.output_path), colormap) + + cmap = plt.get_cmap(colormap) + + min_width = 2.0 # Minimum arrow width + max_width = 15.0 # Maximum arrow width + + for key in class_pat: + # Create color mappings for median and mean + colors_median = { + rxn_id: rgb_to_hex(cmap(abs(medians[key][i]))) if medians[key][i] != 0 else '#bebebe' #grey blocked + for i, rxn_id in enumerate(ids) + } + + colors_mean = { + rxn_id: rgb_to_hex(cmap(abs(means[key][i]))) if means[key][i] != 0 else '#bebebe' #grey blocked + for i, rxn_id in enumerate(ids) + } + + for i, rxn_id in enumerate(ids): + # Calculate arrow width for median + width_median = np.interp(abs(medians[key][i]), [0, 1], [min_width, max_width]) + isNegative = medians[key][i] < 0 + apply_arrow(metabMap_median, rxn_id, colors_median[rxn_id], isNegative, width_median) + + # Calculate arrow width for mean + width_mean = np.interp(abs(means[key][i]), [0, 1], [min_width, max_width]) + isNegative = means[key][i] < 0 + apply_arrow(metabMap_mean, rxn_id, colors_mean[rxn_id], isNegative, width_mean) + + # Save and convert the SVG files + save_and_convert(metabMap_mean, "mean", key) + save_and_convert(metabMap_median, "median", key) + +def apply_arrow(metabMap, rxn_id, color, isNegative, width=5): + """ + Apply an arrow to a specific reaction in the metabolic map with a given color. + + Args: + metabMap (ET.ElementTree): An XML tree representing the metabolic map. + rxn_id (str): The ID of the reaction to which the arrow will be applied. + color (str): The color of the arrow in hexadecimal format. + isNegative (bool): A boolean indicating if the arrow represents a negative value. + width (int): The width of the arrow. + + Returns: + None + """ + arrow = Arrow(width=width, col=color) + arrow.styleReactionElementsMeanMedian(metabMap, rxn_id, isNegative) + pass + +def save_and_convert(metabMap, map_type, key): + """ + Save the metabolic map as an SVG file and optionally convert it to PNG and PDF formats. + + Args: + metabMap (ET.ElementTree): An XML tree representing the metabolic map. + map_type (str): The type of map ('mean' or 'median'). + key (str): The key identifying the specific map. + + Returns: + None + """ + svgFilePath = utils.FilePath(f"SVG Map {map_type} - {key}", ext=utils.FileFormat.SVG, prefix=ARGS.output_path) + utils.writeSvg(svgFilePath, metabMap) + if ARGS.generate_pdf: + pngPath = utils.FilePath(f"PNG Map {map_type} - {key}", ext=utils.FileFormat.PNG, prefix=ARGS.output_path) + pdfPath = utils.FilePath(f"PDF Map {map_type} - {key}", ext=utils.FileFormat.PDF, prefix=ARGS.output_path) + convert_to_pdf(svgFilePath, pngPath, pdfPath) + if not ARGS.generate_svg: + os.remove(svgFilePath.show()) + +############################ MAIN ############################################# +def main(args:List[str] = None) -> None: + """ + Initializes everything and sets the program in motion based on the fronted input arguments. + + Returns: + None + + Raises: + sys.exit : if a user-provided custom map is in the wrong format (ET.XMLSyntaxError, ET.XMLSchemaParseError) + """ + + global ARGS + ARGS = process_args(args) + + if ARGS.custom_map == 'None': + ARGS.custom_map = None + + if os.path.isdir(ARGS.output_path) == False: os.makedirs(ARGS.output_path) + + core_map :ET.ElementTree = ARGS.choice_map.getMap( + ARGS.tool_dir, + utils.FilePath.fromStrPath(ARGS.custom_map) if ARGS.custom_map else None) + # TODO: ^^^ ugly but fine for now, the argument is None if the model isn't custom because no file was given. + # getMap will None-check the customPath and panic when the model IS custom but there's no file (good). A cleaner + # solution can be derived from my comment in FilePath.fromStrPath + + ids, class_pat = getClassesAndIdsFromDatasets(ARGS.input_datas_fluxes, ARGS.input_data_fluxes, ARGS.input_class_fluxes, ARGS.names_fluxes) + + if(ARGS.choice_map == utils.Model.HMRcore): + temp_map = utils.Model.HMRcore_no_legend + computeEnrichmentMeanMedian(temp_map.getMap(ARGS.tool_dir), class_pat, ids, ARGS.color_map) + elif(ARGS.choice_map == utils.Model.ENGRO2): + temp_map = utils.Model.ENGRO2_no_legend + computeEnrichmentMeanMedian(temp_map.getMap(ARGS.tool_dir), class_pat, ids, ARGS.color_map) + else: + computeEnrichmentMeanMedian(core_map, class_pat, ids, ARGS.color_map) + + + enrichment_results = computeEnrichment(class_pat, ids) + for i, j, comparisonDict, max_z_score in enrichment_results: + map_copy = copy.deepcopy(core_map) + temp_thingsInCommon(comparisonDict, map_copy, max_z_score, i, j) + createOutputMaps(i, j, map_copy) + + if not ERRORS: return + utils.logWarning( + f"The following reaction IDs were mentioned in the dataset but weren't found in the map: {ERRORS}", + ARGS.out_log) + + print('Execution succeded') + +############################################################################### +if __name__ == "__main__": + main() +
--- a/COBRAxy/src/importMetabolicModel.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/importMetabolicModel.py Sun Oct 26 19:27:41 2025 +0000 @@ -12,9 +12,13 @@ import cobra import argparse import pandas as pd -import utils.general_utils as utils +try: + from .utils import general_utils as utils + from .utils import model_utils as modelUtils +except: + import utils.general_utils as utils + import utils.model_utils as modelUtils from typing import Optional, Tuple, List -import utils.model_utils as modelUtils import logging from pathlib import Path @@ -40,7 +44,7 @@ parser.add_argument("--name", nargs='*', required=True, help="Model name (default or custom)") - parser.add_argument("--medium_selector", type=str, required=True, + parser.add_argument("--medium_selector", type=str, default="Default", help="Medium selection option") parser.add_argument("--gene_format", type=str, default="Default", @@ -49,8 +53,8 @@ parser.add_argument("--out_tabular", type=str, help="Output file for the merged dataset (CSV or XLSX)") - parser.add_argument("--tool_dir", type=str, default=os.path.dirname(__file__), - help="Tool directory (passed from Galaxy as $__tool_directory__)") + parser.add_argument("--tool_dir", type=str, default=os.path.dirname(os.path.abspath(__file__)), + help="Tool directory (default: auto-detected package location)") return parser.parse_args(args)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COBRAxy/src/local/__init__.py Sun Oct 26 19:27:41 2025 +0000 @@ -0,0 +1,2 @@ +# Local data directory for COBRAxy +# Contains models, mappings, SVG maps, and pickle files
--- a/COBRAxy/src/marea.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/marea.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,1052 +1,1055 @@ -""" -MAREA: Enrichment and map styling for RAS/RPS data. - -This module compares groups of samples using RAS (Reaction Activity Scores) and/or -RPS (Reaction Propensity Scores), computes statistics (p-values, z-scores, fold change), -and applies visual styling to an SVG metabolic map (with optional PDF/PNG export). -""" -from __future__ import division -import csv -from enum import Enum -import re -import sys -import numpy as np -import pandas as pd -import itertools as it -import scipy.stats as st -import lxml.etree as ET -import math -import utils.general_utils as utils -from PIL import Image -import os -import argparse -import pyvips -from typing import Tuple, Union, Optional, List, Dict -import copy - -from pydeseq2.dds import DeseqDataSet -from pydeseq2.default_inference import DefaultInference -from pydeseq2.ds import DeseqStats - -ERRORS = [] -########################## argparse ########################################## -ARGS :argparse.Namespace -def process_args(args:List[str] = None) -> argparse.Namespace: - """ - Parse command-line arguments exposed by the Galaxy frontend for this module. - - Args: - args: Optional list of arguments, defaults to sys.argv when None. - - Returns: - Namespace: Parsed arguments. - """ - parser = argparse.ArgumentParser( - usage = "%(prog)s [options]", - description = "process some value's genes to create a comparison's map.") - - #General: - parser.add_argument( - '-td', '--tool_dir', - type = str, - required = True, - help = 'your tool directory') - - parser.add_argument('-on', '--control', type = str) - parser.add_argument('-ol', '--out_log', help = "Output log") - - #Computation details: - parser.add_argument( - '-co', '--comparison', - type = str, - default = 'manyvsmany', - choices = ['manyvsmany', 'onevsrest', 'onevsmany']) - - parser.add_argument( - '-te' ,'--test', - type = str, - default = 'ks', - choices = ['ks', 'ttest_p', 'ttest_ind', 'wilcoxon', 'mw', 'DESeq'], - help = 'Statistical test to use (default: %(default)s)') - - parser.add_argument( - '-pv' ,'--pValue', - type = float, - default = 0.1, - help = 'P-Value threshold (default: %(default)s)') - - parser.add_argument( - '-adj' ,'--adjusted', - type = utils.Bool("adjusted"), default = False, - help = 'Apply the FDR (Benjamini-Hochberg) correction (default: %(default)s)') - - parser.add_argument( - '-fc', '--fChange', - type = float, - default = 1.5, - help = 'Fold-Change threshold (default: %(default)s)') - - parser.add_argument( - "-ne", "--net", - type = utils.Bool("net"), default = False, - help = "choose if you want net enrichment for RPS") - - parser.add_argument( - '-op', '--option', - type = str, - choices = ['datasets', 'dataset_class'], - help='dataset or dataset and class') - - #RAS: - parser.add_argument( - "-ra", "--using_RAS", - type = utils.Bool("using_RAS"), default = True, - help = "choose whether to use RAS datasets.") - - parser.add_argument( - '-id', '--input_data', - type = str, - help = 'input dataset') - - parser.add_argument( - '-ic', '--input_class', - type = str, - help = 'sample group specification') - - parser.add_argument( - '-ids', '--input_datas', - type = str, - nargs = '+', - help = 'input datasets') - - parser.add_argument( - '-na', '--names', - type = str, - nargs = '+', - help = 'input names') - - #RPS: - parser.add_argument( - "-rp", "--using_RPS", - type = utils.Bool("using_RPS"), default = False, - help = "choose whether to use RPS datasets.") - - parser.add_argument( - '-idr', '--input_data_rps', - type = str, - help = 'input dataset rps') - - parser.add_argument( - '-icr', '--input_class_rps', - type = str, - help = 'sample group specification rps') - - parser.add_argument( - '-idsr', '--input_datas_rps', - type = str, - nargs = '+', - help = 'input datasets rps') - - parser.add_argument( - '-nar', '--names_rps', - type = str, - nargs = '+', - help = 'input names rps') - - #Output: - parser.add_argument( - "-gs", "--generate_svg", - type = utils.Bool("generate_svg"), default = True, - help = "choose whether to use RAS datasets.") - - parser.add_argument( - "-gp", "--generate_pdf", - type = utils.Bool("generate_pdf"), default = True, - help = "choose whether to use RAS datasets.") - - parser.add_argument( - '-cm', '--custom_map', - type = str, - help='custom map to use') - - parser.add_argument( - '-idop', '--output_path', - type = str, - default='result', - help = 'output path for maps') - - parser.add_argument( - '-mc', '--choice_map', - type = utils.Model, default = utils.Model.HMRcore, - choices = [utils.Model.HMRcore, utils.Model.ENGRO2, utils.Model.Custom]) - - args :argparse.Namespace = parser.parse_args(args) - if args.using_RAS and not args.using_RPS: args.net = False - - return args - -############################ dataset input #################################### -def read_dataset(data :str, name :str) -> pd.DataFrame: - """ - Tries to read the dataset from its path (data) as a tsv and turns it into a DataFrame. - - Args: - data : filepath of a dataset (from frontend input params or literals upon calling) - name : name associated with the dataset (from frontend input params or literals upon calling) - - Returns: - pd.DataFrame : dataset in a runtime operable shape - - Raises: - sys.exit : if there's no data (pd.errors.EmptyDataError) or if the dataset has less than 2 columns - """ - try: - dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') - except pd.errors.EmptyDataError: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - if len(dataset.columns) < 2: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - return dataset - -############################ map_methods ###################################### -FoldChange = Union[float, int, str] # Union[float, Literal[0, "-INF", "INF"]] -def fold_change(avg1 :float, avg2 :float) -> FoldChange: - """ - Calculates the fold change between two gene expression values. - - Args: - avg1 : average expression value from one dataset avg2 : average expression value from the other dataset - - Returns: - FoldChange : - 0 : when both input values are 0 - "-INF" : when avg1 is 0 - "INF" : when avg2 is 0 - float : for any other combination of values - """ - if avg1 == 0 and avg2 == 0: - return 0 - - if avg1 == 0: - return '-INF' # TODO: maybe fix - - if avg2 == 0: - return 'INF' - - # (threshold_F_C - 1) / (abs(threshold_F_C) + 1) con threshold_F_C > 1 - return (avg1 - avg2) / (abs(avg1) + abs(avg2)) - -# TODO: I would really like for this one to get the Thanos treatment -def fix_style(l :str, col :Optional[str], width :str, dash :str) -> str: - """ - Produces a "fixed" style string to assign to a reaction arrow in the SVG map, assigning style properties to the corresponding values passed as input params. - - Args: - l : current style string of an SVG element - col : new value for the "stroke" style property - width : new value for the "stroke-width" style property - dash : new value for the "stroke-dasharray" style property - - Returns: - str : the fixed style string - """ - tmp = l.split(';') - flag_col = False - flag_width = False - flag_dash = False - for i in range(len(tmp)): - if tmp[i].startswith('stroke:'): - tmp[i] = 'stroke:' + col - flag_col = True - if tmp[i].startswith('stroke-width:'): - tmp[i] = 'stroke-width:' + width - flag_width = True - if tmp[i].startswith('stroke-dasharray:'): - tmp[i] = 'stroke-dasharray:' + dash - flag_dash = True - if not flag_col: - tmp.append('stroke:' + col) - if not flag_width: - tmp.append('stroke-width:' + width) - if not flag_dash: - tmp.append('stroke-dasharray:' + dash) - return ';'.join(tmp) - -def fix_map(d :Dict[str, List[Union[float, FoldChange]]], core_map :ET.ElementTree, threshold_P_V :float, threshold_F_C :float, max_z_score :float) -> ET.ElementTree: - """ - Edits the selected SVG map based on the p-value and fold change data (d) and some significance thresholds also passed as inputs. - - Args: - d : dictionary mapping a p-value and a fold-change value (values) to each reaction ID as encoded in the SVG map (keys) - core_map : SVG map to modify - threshold_P_V : threshold for a p-value to be considered significant - threshold_F_C : threshold for a fold change value to be considered significant - max_z_score : highest z-score (absolute value) - - Returns: - ET.ElementTree : the modified core_map - - Side effects: - core_map : mut - """ - maxT = 12 - minT = 2 - grey = '#BEBEBE' - blue = '#6495ed' - red = '#ecac68' - for el in core_map.iter(): - el_id = str(el.get('id')) - if el_id.startswith('R_'): - tmp = d.get(el_id[2:]) - if tmp != None: - p_val, f_c, z_score, avg1, avg2 = tmp - - if math.isnan(p_val) or (isinstance(f_c, float) and math.isnan(f_c)): continue - - if p_val <= threshold_P_V: # p-value is OK - if not isinstance(f_c, str): # FC is finite - if abs(f_c) < ((threshold_F_C - 1) / (abs(threshold_F_C) + 1)): # FC is not OK - col = grey - width = str(minT) - else: # FC is OK - if f_c < 0: - col = blue - elif f_c > 0: - col = red - width = str( - min( - max(abs(z_score * maxT) / max_z_score, minT), - maxT)) - - else: # FC is infinite - if f_c == '-INF': - col = blue - elif f_c == 'INF': - col = red - width = str(maxT) - dash = 'none' - else: # p-value is not OK - dash = '5,5' - col = grey - width = str(minT) - el.set('style', fix_style(el.get('style', ""), col, width, dash)) - return core_map - -def getElementById(reactionId :str, metabMap :ET.ElementTree) -> utils.Result[ET.Element, utils.Result.ResultErr]: - """ - Finds any element in the given map with the given ID. ID uniqueness in an svg file is recommended but - not enforced, if more than one element with the exact ID is found only the first will be returned. - - Args: - reactionId (str): exact ID of the requested element. - metabMap (ET.ElementTree): metabolic map containing the element. - - Returns: - utils.Result[ET.Element, ResultErr]: result of the search, either the first match found or a ResultErr. - """ - return utils.Result.Ok( - f"//*[@id=\"{reactionId}\"]").map( - lambda xPath : metabMap.xpath(xPath)[0]).mapErr( - lambda _ : utils.Result.ResultErr(f"No elements with ID \"{reactionId}\" found in map")) - -def styleMapElement(element :ET.Element, styleStr :str) -> None: - """Append/override stroke-related styles on a given SVG element.""" - currentStyles :str = element.get("style", "") - if re.search(r";stroke:[^;]+;stroke-width:[^;]+;stroke-dasharray:[^;]+$", currentStyles): - currentStyles = ';'.join(currentStyles.split(';')[:-3]) - - element.set("style", currentStyles + styleStr) - -class ReactionDirection(Enum): - Unknown = "" - Direct = "_F" - Inverse = "_B" - - @classmethod - def fromDir(cls, s :str) -> "ReactionDirection": - # vvv as long as there's so few variants I actually condone the if spam: - if s == ReactionDirection.Direct.value: return ReactionDirection.Direct - if s == ReactionDirection.Inverse.value: return ReactionDirection.Inverse - return ReactionDirection.Unknown - - @classmethod - def fromReactionId(cls, reactionId :str) -> "ReactionDirection": - return ReactionDirection.fromDir(reactionId[-2:]) - -def getArrowBodyElementId(reactionId :str) -> str: - """Return the SVG element id for a reaction arrow body, normalizing direction tags.""" - if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV - elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: reactionId = reactionId[:-2] - return f"R_{reactionId}" - -def getArrowHeadElementId(reactionId :str) -> Tuple[str, str]: - """ - We attempt extracting the direction information from the provided reaction ID, if unsuccessful we provide the IDs of both directions. - - Args: - reactionId : the provided reaction ID. - - Returns: - Tuple[str, str]: either a single str ID for the correct arrow head followed by an empty string or both options to try. - """ - if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV - elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: - return reactionId[:-3:-1] + reactionId[:-2], "" # ^^^ Invert _F to F_ - - return f"F_{reactionId}", f"B_{reactionId}" - -class ArrowColor(Enum): - """ - Encodes possible arrow colors based on their meaning in the enrichment process. - """ - Invalid = "#BEBEBE" # gray, fold-change under treshold or not significant p-value - Transparent = "#ffffff00" # transparent, to make some arrow segments disappear - UpRegulated = "#ecac68" # orange, up-regulated reaction - DownRegulated = "#6495ed" # lightblue, down-regulated reaction - - UpRegulatedInv = "#FF0000" # bright red for reversible with conflicting directions - - DownRegulatedInv = "#0000FF" # bright blue for reversible with conflicting directions - - @classmethod - def fromFoldChangeSign(cls, foldChange :float, *, useAltColor = False) -> "ArrowColor": - colors = (cls.DownRegulated, cls.DownRegulatedInv) if foldChange < 0 else (cls.UpRegulated, cls.UpRegulatedInv) - return colors[useAltColor] - - def __str__(self) -> str: return self.value - -class Arrow: - """ - Models the properties of a reaction arrow that change based on enrichment. - """ - MIN_W = 2 - MAX_W = 12 - - def __init__(self, width :int, col: ArrowColor, *, isDashed = False) -> None: - """ - (Private) Initializes an instance of Arrow. - - Args: - width : width of the arrow, ideally to be kept within Arrow.MIN_W and Arrow.MAX_W (not enforced). - col : color of the arrow. - isDashed : whether the arrow should be dashed, meaning the associated pValue resulted not significant. - - Returns: - None : practically, a Arrow instance. - """ - self.w = width - self.col = col - self.dash = isDashed - - def applyTo(self, reactionId :str, metabMap :ET.ElementTree, styleStr :str) -> None: - if getElementById(reactionId, metabMap).map(lambda el : styleMapElement(el, styleStr)).isErr: - ERRORS.append(reactionId) - - def styleReactionElements(self, metabMap :ET.ElementTree, reactionId :str, *, mindReactionDir = True) -> None: - # If direction is irrelevant (e.g., RAS), style only the arrow body - if not mindReactionDir: - return self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) - - # Now we style the arrow head(s): - idOpt1, idOpt2 = getArrowHeadElementId(reactionId) - self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) - if idOpt2: self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) - - def toStyleStr(self, *, downSizedForTips = False) -> str: - """ - Collapses the styles of this Arrow into a str, ready to be applied as part of the "style" property on an svg element. - - Returns: - str : the styles string. - """ - width = self.w - if downSizedForTips: width *= 0.8 - return f";stroke:{self.col};stroke-width:{width};stroke-dasharray:{'5,5' if self.dash else 'none'}" - -# Default arrows used for different significance states -INVALID_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid) -INSIGNIFICANT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid, isDashed = True) -TRANSPARENT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Transparent) # Who cares how big it is if it's transparent - -def applyRpsEnrichmentToMap(rpsEnrichmentRes :Dict[str, Union[Tuple[float, FoldChange], Tuple[float, FoldChange, float, float]]], metabMap :ET.ElementTree, maxNumericZScore :float) -> None: - """ - Applies RPS enrichment results to the provided metabolic map. - - Args: - rpsEnrichmentRes : RPS enrichment results. - metabMap : the metabolic map to edit. - maxNumericZScore : biggest finite z-score value found. - - Side effects: - metabMap : mut - - Returns: - None - """ - for reactionId, values in rpsEnrichmentRes.items(): - pValue = values[0] - foldChange = values[1] - z_score = values[2] - - if math.isnan(pValue) or (isinstance(foldChange, float) and math.isnan(foldChange)): continue - - if isinstance(foldChange, str): foldChange = float(foldChange) - if pValue > ARGS.pValue: # pValue above tresh: dashed arrow - INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId) - continue - - if abs(foldChange) < (ARGS.fChange - 1) / (abs(ARGS.fChange) + 1): - INVALID_ARROW.styleReactionElements(metabMap, reactionId) - continue - - width = Arrow.MAX_W - if not math.isinf(z_score): - try: width = min( - max(abs(z_score * Arrow.MAX_W) / maxNumericZScore, Arrow.MIN_W), - Arrow.MAX_W) - - except ZeroDivisionError: pass - - if not reactionId.endswith("_RV"): # RV stands for reversible reactions - Arrow(width, ArrowColor.fromFoldChangeSign(foldChange)).styleReactionElements(metabMap, reactionId) - continue - - reactionId = reactionId[:-3] # Remove "_RV" - - inversionScore = (values[3] < 0) + (values[4] < 0) # Compacts the signs of averages into 1 easy to check score - if inversionScore == 2: foldChange *= -1 - - # If the score is 1 (opposite signs) we use alternative colors vvv - arrow = Arrow(width, ArrowColor.fromFoldChangeSign(foldChange, useAltColor = inversionScore == 1)) - - # vvv These 2 if statements can both be true and can both happen - if ARGS.net: # style arrow head(s): - arrow.styleReactionElements(metabMap, reactionId + ("_B" if inversionScore == 2 else "_F")) - - if not ARGS.using_RAS: # style arrow body - arrow.styleReactionElements(metabMap, reactionId, mindReactionDir = False) - -############################ split class ###################################### -def split_class(classes :pd.DataFrame, dataset_values :Dict[str, List[float]]) -> Dict[str, List[List[float]]]: - """ - Generates a :dict that groups together data from a :DataFrame based on classes the data is related to. - - Args: - classes : a :DataFrame of only string values, containing class information (rows) and keys to query the resolve_rules :dict - dataset_values : a :dict containing :float data - - Returns: - dict : the dict with data grouped by class - - Side effects: - classes : mut - """ - class_pat :Dict[str, List[List[float]]] = {} - for i in range(len(classes)): - classe :str = classes.iloc[i, 1] - if pd.isnull(classe): continue - - l :List[List[float]] = [] - sample_ids: List[str] = [] - - for j in range(i, len(classes)): - if classes.iloc[j, 1] == classe: - pat_id :str = classes.iloc[j, 0] # sample name - values = dataset_values.get(pat_id, None) # the column of values for that sample - if values != None: - l.append(values) - sample_ids.append(pat_id) - classes.iloc[j, 1] = None # TODO: problems? - - if l: - class_pat[classe] = { - "values": list(map(list, zip(*l))), # transpose - "samples": sample_ids - } - continue - - utils.logWarning( - f"Warning: no sample found in class \"{classe}\", the class has been disregarded", ARGS.out_log) - - return class_pat - -############################ conversion ############################################## -# Conversion from SVG to PNG -def svg_to_png_with_background(svg_path :utils.FilePath, png_path :utils.FilePath, dpi :int = 72, scale :int = 1, size :Optional[float] = None) -> None: - """ - Internal utility to convert an SVG to PNG (forced opaque) to aid in PDF conversion. - - Args: - svg_path : path to SVG file - png_path : path for new PNG file - dpi : dots per inch of the generated PNG - scale : scaling factor for the generated PNG, computed internally when a size is provided - size : final effective width of the generated PNG - - Returns: - None - """ - if size: - image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=1) - scale = size / image.width - image = image.resize(scale) - else: - image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=scale) - - white_background = pyvips.Image.black(image.width, image.height).new_from_image([255, 255, 255]) - white_background = white_background.affine([scale, 0, 0, scale]) - - if white_background.bands != image.bands: - white_background = white_background.extract_band(0) - - composite_image = white_background.composite2(image, 'over') - composite_image.write_to_file(png_path.show()) - -def convert_to_pdf(file_svg :utils.FilePath, file_png :utils.FilePath, file_pdf :utils.FilePath) -> None: - """ - Converts the SVG map at the provided path to PDF. - - Args: - file_svg : path to SVG file - file_png : path to PNG file - file_pdf : path to new PDF file - - Returns: - None - """ - svg_to_png_with_background(file_svg, file_png) - try: - image = Image.open(file_png.show()) - image = image.convert("RGB") - image.save(file_pdf.show(), "PDF", resolution=100.0) - print(f'PDF file {file_pdf.filePath} successfully generated.') - - except Exception as e: - raise utils.DataErr(file_pdf.show(), f'Error generating PDF file: {e}') - -############################ map ############################################## -def buildOutputPath(dataset1Name :str, dataset2Name = "rest", *, details = "", ext :utils.FileFormat) -> utils.FilePath: - """ - Builds a FilePath instance from the names of confronted datasets ready to point to a location in the - "result/" folder, used by this tool for output files in collections. - - Args: - dataset1Name : _description_ - dataset2Name : _description_. Defaults to "rest". - details : _description_ - ext : _description_ - - Returns: - utils.FilePath : _description_ - """ - return utils.FilePath( - f"{dataset1Name}_vs_{dataset2Name}" + (f" ({details})" if details else ""), - ext, - prefix = ARGS.output_path) - -FIELD_NOT_AVAILABLE = '/' -def writeToCsv(rows: List[list], fieldNames :List[str], outPath :utils.FilePath) -> None: - fieldsAmt = len(fieldNames) - with open(outPath.show(), "w", newline = "") as fd: - writer = csv.DictWriter(fd, fieldnames = fieldNames, delimiter = '\t') - writer.writeheader() - - for row in rows: - sizeMismatch = fieldsAmt - len(row) - if sizeMismatch > 0: row.extend([FIELD_NOT_AVAILABLE] * sizeMismatch) - writer.writerow({ field : data for field, data in zip(fieldNames, row) }) - -OldEnrichedScores = Dict[str, List[Union[float, FoldChange]]] -def temp_thingsInCommon(tmp :OldEnrichedScores, core_map :ET.ElementTree, max_z_score :float, dataset1Name :str, dataset2Name = "rest", ras_enrichment = True) -> None: - suffix = "RAS" if ras_enrichment else "RPS" - writeToCsv( - [ [reactId] + values for reactId, values in tmp.items() ], - ["ids", "P_Value", "fold change", "z-score", "average_1", "average_2"], - buildOutputPath(dataset1Name, dataset2Name, details = f"Tabular Result ({suffix})", ext = utils.FileFormat.TSV)) - - if ras_enrichment: - fix_map(tmp, core_map, ARGS.pValue, ARGS.fChange, max_z_score) - return - - for reactId, enrichData in tmp.items(): tmp[reactId] = tuple(enrichData) - applyRpsEnrichmentToMap(tmp, core_map, max_z_score) - -def computePValue(dataset1Data: List[float], dataset2Data: List[float]) -> Tuple[float, float]: - """ - Computes the statistical significance score (P-value) of the comparison between coherent data - from two datasets. The data is supposed to, in both datasets: - - be related to the same reaction ID; - - be ordered by sample, such that the item at position i in both lists is related to the - same sample or cell line. - - Args: - dataset1Data : data from the 1st dataset. - dataset2Data : data from the 2nd dataset. - - Returns: - tuple: (P-value, Z-score) - - P-value from the selected test on the provided data. - - Z-score of the difference between means of the two datasets. - """ - match ARGS.test: - case "ks": - # Perform Kolmogorov-Smirnov test - _, p_value = st.ks_2samp(dataset1Data, dataset2Data) - case "ttest_p": - # Datasets should have same size - if len(dataset1Data) != len(dataset2Data): - raise ValueError("Datasets must have the same size for paired t-test.") - # Perform t-test for paired samples - _, p_value = st.ttest_rel(dataset1Data, dataset2Data) - case "ttest_ind": - # Perform t-test for independent samples - _, p_value = st.ttest_ind(dataset1Data, dataset2Data) - case "wilcoxon": - # Datasets should have same size - if len(dataset1Data) != len(dataset2Data): - raise ValueError("Datasets must have the same size for Wilcoxon signed-rank test.") - # Perform Wilcoxon signed-rank test - np.random.seed(42) # Ensure reproducibility since zsplit method is used - _, p_value = st.wilcoxon(dataset1Data, dataset2Data, zero_method='zsplit') - case "mw": - # Perform Mann-Whitney U test - _, p_value = st.mannwhitneyu(dataset1Data, dataset2Data) - case _: - p_value = np.nan # Default value if no valid test is selected - - # Calculate means and standard deviations - mean1 = np.mean(dataset1Data) - mean2 = np.mean(dataset2Data) - std1 = np.std(dataset1Data, ddof=1) - std2 = np.std(dataset2Data, ddof=1) - - n1 = len(dataset1Data) - n2 = len(dataset2Data) - - # Calculate Z-score - z_score = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2)) - - return p_value, z_score - - -def DESeqPValue(comparisonResult :Dict[str, List[Union[float, FoldChange]]], dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> None: - """ - Computes the p-value for each reaction in the comparisonResult dictionary using DESeq2. - - Args: - comparisonResult : dictionary mapping a p-value and a fold-change value (values) to each reaction ID as encoded in the SVG map (keys) - dataset1Data : data from the 1st dataset. - dataset2Data : data from the 2nd dataset. - ids : list of reaction IDs. - - Returns: - None : mutates the comparisonResult dictionary in place with the p-values. - """ - - # pyDESeq2 needs at least 2 replicates per sample so I check this - if len(dataset1Data[0]) < 2 or len(dataset2Data[0]) < 2: - raise ValueError("Datasets must have at least 2 replicates each") - - # pyDESeq2 is based on pandas, so we need to convert the data into a DataFrame and clean it from NaN values - dataframe1 = pd.DataFrame(dataset1Data, index=ids) - dataframe2 = pd.DataFrame(dataset2Data, index=ids) - - # pyDESeq2 requires datasets to be samples x reactions and integer values - dataframe1_clean = dataframe1.dropna(axis=0, how="any").T.astype(int) - dataframe2_clean = dataframe2.dropna(axis=0, how="any").T.astype(int) - dataframe1_clean.index = [f"ds1_rep{i+1}" for i in range(dataframe1_clean.shape[0])] - dataframe2_clean.index = [f"ds2_rep{j+1}" for j in range(dataframe2_clean.shape[0])] - - # pyDESeq2 works on a DataFrame with values and another with infos about how samples are split (like dataset class) - dataframe = pd.concat([dataframe1_clean, dataframe2_clean], axis=0) - metadata = pd.DataFrame({"dataset": (["dataset1"]*dataframe1_clean.shape[0] + ["dataset2"]*dataframe2_clean.shape[0])}, index=dataframe.index) - - # Ensure the index of the metadata matches the index of the dataframe - if not dataframe.index.equals(metadata.index): - raise ValueError("The index of the metadata DataFrame must match the index of the counts DataFrame.") - - # Prepare and run pyDESeq2 - inference = DefaultInference() - dds = DeseqDataSet(counts=dataframe, metadata=metadata, design="~dataset", inference=inference, quiet=True, low_memory=True) - dds.deseq2() - ds = DeseqStats(dds, contrast=["dataset", "dataset1", "dataset2"], inference=inference, quiet=True) - ds.summary() - - # Retrieve the p-values from the DESeq2 results - for reactId in ds.results_df.index: - comparisonResult[reactId][0] = ds.results_df["pvalue"][reactId] - - -# TODO: the net RPS computation should be done in the RPS module -def compareDatasetPair(dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> Tuple[Dict[str, List[Union[float, FoldChange]]], float, Dict[str, Tuple[np.ndarray, np.ndarray]]]: - - netRPS :Dict[str, Tuple[np.ndarray, np.ndarray]] = {} - comparisonResult :Dict[str, List[Union[float, FoldChange]]] = {} - count = 0 - max_z_score = 0 - - for l1, l2 in zip(dataset1Data, dataset2Data): - reactId = ids[count] - count += 1 - if not reactId: continue - - try: #TODO: identify the source of these errors and minimize code in the try block - reactDir = ReactionDirection.fromReactionId(reactId) - # Net score is computed only for reversible reactions when user wants it on arrow tips or when RAS datasets aren't used - if (ARGS.net or not ARGS.using_RAS) and reactDir is not ReactionDirection.Unknown: - try: position = ids.index(reactId[:-1] + ('B' if reactDir is ReactionDirection.Direct else 'F')) - except ValueError: continue # we look for the complementary id, if not found we skip - - nets1 = np.subtract(l1, dataset1Data[position]) - nets2 = np.subtract(l2, dataset2Data[position]) - netRPS[reactId] = (nets1, nets2) - - # Compute p-value and z-score for the RPS scores, if the pyDESeq option is set, p-values will be computed after and this function will return p_value = 0 - p_value, z_score = computePValue(nets1, nets2) - avg1 = sum(nets1) / len(nets1) - avg2 = sum(nets2) / len(nets2) - net = fold_change(avg1, avg2) - - if math.isnan(net): continue - comparisonResult[reactId[:-1] + "RV"] = [p_value, net, z_score, avg1, avg2] - - # vvv complementary directional ids are set to None once processed if net is to be applied to tips - if ARGS.net: # If only using RPS, we cannot delete the inverse, as it's needed to color the arrows - ids[position] = None - continue - - # fallthrough is intended, regular scores need to be computed when tips aren't net but RAS datasets aren't used - # Compute p-value and z-score for the RAS scores, if the pyDESeq option is set, p-values will be computed after and this function will return p_value = 0 - p_value, z_score = computePValue(l1, l2) - avg = fold_change(sum(l1) / len(l1), sum(l2) / len(l2)) - # vvv TODO: Check numpy version compatibility - if np.isfinite(z_score) and max_z_score < abs(z_score): max_z_score = abs(z_score) - comparisonResult[reactId] = [float(p_value), avg, z_score, sum(l1) / len(l1), sum(l2) / len(l2)] - - except (TypeError, ZeroDivisionError): continue - - if ARGS.test == "DESeq": - # Compute p-values using DESeq2 - DESeqPValue(comparisonResult, dataset1Data, dataset2Data, ids) - - # Apply multiple testing correction if set by the user - if ARGS.adjusted: - - # Retrieve the p-values from the comparisonResult dictionary, they have to be different from NaN - validPValues = [(reactId, result[0]) for reactId, result in comparisonResult.items() if not np.isnan(result[0])] - # Unpack the valid p-values - reactIds, pValues = zip(*validPValues) - # Adjust the p-values using the Benjamini-Hochberg method - adjustedPValues = st.false_discovery_control(pValues) - # Update the comparisonResult dictionary with the adjusted p-values - for reactId , adjustedPValue in zip(reactIds, adjustedPValues): - comparisonResult[reactId][0] = adjustedPValue - - return comparisonResult, max_z_score, netRPS - -def computeEnrichment(class_pat: Dict[str, List[List[float]]], ids: List[str], *, fromRAS=True) -> Tuple[List[Tuple[str, str, dict, float]], dict]: - """ - Compares clustered data based on a given comparison mode and applies enrichment-based styling on the - provided metabolic map. - - Args: - class_pat : the clustered data. - ids : ids for data association. - fromRAS : whether the data to enrich consists of RAS scores. - - Returns: - tuple: A tuple containing: - - List[Tuple[str, str, dict, float]]: List of tuples with pairs of dataset names, comparison dictionary and max z-score. - - dict : net RPS values for each dataset's reactions - - Raises: - sys.exit : if there are less than 2 classes for comparison - """ - class_pat = {k.strip(): v for k, v in class_pat.items()} - if (not class_pat) or (len(class_pat.keys()) < 2): - sys.exit('Execution aborted: classes provided for comparisons are less than two\n') - - # { datasetName : { reactId : netRPS, ... }, ... } - netRPSResults :Dict[str, Dict[str, np.ndarray]] = {} - enrichment_results = [] - - if ARGS.comparison == "manyvsmany": - for i, j in it.combinations(class_pat.keys(), 2): - comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(i), class_pat.get(j), ids) - enrichment_results.append((i, j, comparisonDict, max_z_score)) - netRPSResults[i] = { reactId : net[0] for reactId, net in netRPS.items() } - netRPSResults[j] = { reactId : net[1] for reactId, net in netRPS.items() } - - elif ARGS.comparison == "onevsrest": - for single_cluster in class_pat.keys(): - rest = [item for k, v in class_pat.items() if k != single_cluster for item in v] - comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(single_cluster), rest, ids) - enrichment_results.append((single_cluster, "rest", comparisonDict, max_z_score)) - netRPSResults[single_cluster] = { reactId : net[0] for reactId, net in netRPS.items() } - netRPSResults["rest"] = { reactId : net[1] for reactId, net in netRPS.items() } - - elif ARGS.comparison == "onevsmany": - controlItems = class_pat.get(ARGS.control) - for otherDataset in class_pat.keys(): - if otherDataset == ARGS.control: - continue - - #comparisonDict, max_z_score, netRPS = compareDatasetPair(controlItems, class_pat.get(otherDataset), ids) - comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(otherDataset),controlItems, ids) - #enrichment_results.append((ARGS.control, otherDataset, comparisonDict, max_z_score)) - enrichment_results.append(( otherDataset,ARGS.control, comparisonDict, max_z_score)) - netRPSResults[otherDataset] = { reactId : net[0] for reactId, net in netRPS.items() } - netRPSResults[ARGS.control] = { reactId : net[1] for reactId, net in netRPS.items() } - - return enrichment_results, netRPSResults - -def createOutputMaps(dataset1Name: str, dataset2Name: str, core_map: ET.ElementTree) -> None: - svgFilePath = buildOutputPath(dataset1Name, dataset2Name, details="SVG Map", ext=utils.FileFormat.SVG) - utils.writeSvg(svgFilePath, core_map) - - if ARGS.generate_pdf: - pngPath = buildOutputPath(dataset1Name, dataset2Name, details="PNG Map", ext=utils.FileFormat.PNG) - pdfPath = buildOutputPath(dataset1Name, dataset2Name, details="PDF Map", ext=utils.FileFormat.PDF) - svg_to_png_with_background(svgFilePath, pngPath) - try: - image = Image.open(pngPath.show()) - image = image.convert("RGB") - image.save(pdfPath.show(), "PDF", resolution=100.0) - print(f'PDF file {pdfPath.filePath} successfully generated.') - - except Exception as e: - raise utils.DataErr(pdfPath.show(), f'Error generating PDF file: {e}') - - if not ARGS.generate_svg: - os.remove(svgFilePath.show()) - -ClassPat = Dict[str, List[List[float]]] -def getClassesAndIdsFromDatasets(datasetsPaths :List[str], datasetPath :str, classPath :str, names :List[str]) -> Tuple[List[str], ClassPat, Dict[str, List[str]]]: - columnNames :Dict[str, List[str]] = {} # { datasetName : [ columnName, ... ], ... } - class_pat :ClassPat = {} - if ARGS.option == 'datasets': - num = 1 - for path, name in zip(datasetsPaths, names): - name = str(name) - if name == 'Dataset': - name += '_' + str(num) - - values, ids = getDatasetValues(path, name) - if values != None: - class_pat[name] = list(map(list, zip(*values.values()))) # TODO: ??? - columnNames[name] = ["Reactions", *values.keys()] - - num += 1 - - elif ARGS.option == "dataset_class": - classes = read_dataset(classPath, "class") - classes = classes.astype(str) - - values, ids = getDatasetValues(datasetPath, "Dataset Class (not actual name)") - if values != None: - class_pat_with_samples_id = split_class(classes, values) - - for clas, values_and_samples_id in class_pat_with_samples_id.items(): - class_pat[clas] = values_and_samples_id["values"] - columnNames[clas] = ["Reactions", *values_and_samples_id["samples"]] - - return ids, class_pat, columnNames - -def getDatasetValues(datasetPath :str, datasetName :str) -> Tuple[ClassPat, List[str]]: - """ - Opens the dataset at the given path and extracts the values (expected nullable numerics) and the IDs. - - Args: - datasetPath : path to the dataset - datasetName (str): dataset name, used in error reporting - - Returns: - Tuple[ClassPat, List[str]]: values and IDs extracted from the dataset - """ - dataset = read_dataset(datasetPath, datasetName) - IDs = pd.Series.tolist(dataset.iloc[:, 0].astype(str)) - - dataset = dataset.drop(dataset.columns[0], axis = "columns").to_dict("list") - return { id : list(map(utils.Float("Dataset values, not an argument"), values)) for id, values in dataset.items() }, IDs - -############################ MAIN ############################################# -def main(args:List[str] = None) -> None: - """ - Initializes everything and sets the program in motion based on the fronted input arguments. - - Returns: - None - - Raises: - sys.exit : if a user-provided custom map is in the wrong format (ET.XMLSyntaxError, ET.XMLSchemaParseError) - """ - global ARGS - ARGS = process_args(args) - - # Create output folder - if not os.path.isdir(ARGS.output_path): - os.makedirs(ARGS.output_path, exist_ok=True) - - core_map: ET.ElementTree = ARGS.choice_map.getMap( - ARGS.tool_dir, - utils.FilePath.fromStrPath(ARGS.custom_map) if ARGS.custom_map else None) - - # Prepare enrichment results containers - ras_results = [] - rps_results = [] - - # Compute RAS enrichment if requested - if ARGS.using_RAS: - ids_ras, class_pat_ras, _ = getClassesAndIdsFromDatasets( - ARGS.input_datas, ARGS.input_data, ARGS.input_class, ARGS.names) - ras_results, _ = computeEnrichment(class_pat_ras, ids_ras, fromRAS=True) - - - # Compute RPS enrichment if requested - if ARGS.using_RPS: - ids_rps, class_pat_rps, columnNames = getClassesAndIdsFromDatasets( - ARGS.input_datas_rps, ARGS.input_data_rps, ARGS.input_class_rps, ARGS.names_rps) - - rps_results, netRPS = computeEnrichment(class_pat_rps, ids_rps, fromRAS=False) - - # Organize by comparison pairs - comparisons: Dict[Tuple[str, str], Dict[str, Tuple]] = {} - for i, j, comparison_data, max_z_score in ras_results: - comparisons[(i, j)] = {'ras': (comparison_data, max_z_score), 'rps': None} - - for i, j, comparison_data, max_z_score, in rps_results: - comparisons.setdefault((i, j), {}).update({'rps': (comparison_data, max_z_score)}) - - # For each comparison, create a styled map with RAS bodies and RPS heads - for (i, j), res in comparisons.items(): - map_copy = copy.deepcopy(core_map) - - # Apply RAS styling to arrow bodies - if res.get('ras'): - tmp_ras, max_z_ras = res['ras'] - temp_thingsInCommon(tmp_ras, map_copy, max_z_ras, i, j, ras_enrichment=True) - - # Apply RPS styling to arrow heads - if res.get('rps'): - tmp_rps, max_z_rps = res['rps'] - - temp_thingsInCommon(tmp_rps, map_copy, max_z_rps, i, j, ras_enrichment=False) - - # Output both SVG and PDF/PNG as configured - createOutputMaps(i, j, map_copy) - - # Add net RPS output file - if ARGS.net or not ARGS.using_RAS: - for datasetName, rows in netRPS.items(): - writeToCsv( - [[reactId, *netValues] for reactId, netValues in rows.items()], - columnNames.get(datasetName, ["Reactions"]), - utils.FilePath( - "Net_RPS_" + datasetName, - ext = utils.FileFormat.CSV, - prefix = ARGS.output_path)) - - print('Execution succeeded') -############################################################################### -if __name__ == "__main__": - main() +""" +MAREA: Enrichment and map styling for RAS/RPS data. + +This module compares groups of samples using RAS (Reaction Activity Scores) and/or +RPS (Reaction Propensity Scores), computes statistics (p-values, z-scores, fold change), +and applies visual styling to an SVG metabolic map (with optional PDF/PNG export). +""" +from __future__ import division +import csv +from enum import Enum +import re +import sys +import numpy as np +import pandas as pd +import itertools as it +import scipy.stats as st +import lxml.etree as ET +import math +try: + from .utils import general_utils as utils +except: + import utils.general_utils as utils +from PIL import Image +import os +import argparse +import pyvips +from typing import Tuple, Union, Optional, List, Dict +import copy + +from pydeseq2.dds import DeseqDataSet +from pydeseq2.default_inference import DefaultInference +from pydeseq2.ds import DeseqStats + +ERRORS = [] +########################## argparse ########################################## +ARGS :argparse.Namespace +def process_args(args:List[str] = None) -> argparse.Namespace: + """ + Parse command-line arguments exposed by the Galaxy frontend for this module. + + Args: + args: Optional list of arguments, defaults to sys.argv when None. + + Returns: + Namespace: Parsed arguments. + """ + parser = argparse.ArgumentParser( + usage = "%(prog)s [options]", + description = "process some value's genes to create a comparison's map.") + + #General: + parser.add_argument( + '-td', '--tool_dir', + type = str, + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') + + parser.add_argument('-on', '--control', type = str) + parser.add_argument('-ol', '--out_log', help = "Output log") + + #Computation details: + parser.add_argument( + '-co', '--comparison', + type = str, + default = 'manyvsmany', + choices = ['manyvsmany', 'onevsrest', 'onevsmany']) + + parser.add_argument( + '-te' ,'--test', + type = str, + default = 'ks', + choices = ['ks', 'ttest_p', 'ttest_ind', 'wilcoxon', 'mw', 'DESeq'], + help = 'Statistical test to use (default: %(default)s)') + + parser.add_argument( + '-pv' ,'--pValue', + type = float, + default = 0.1, + help = 'P-Value threshold (default: %(default)s)') + + parser.add_argument( + '-adj' ,'--adjusted', + type = utils.Bool("adjusted"), default = False, + help = 'Apply the FDR (Benjamini-Hochberg) correction (default: %(default)s)') + + parser.add_argument( + '-fc', '--fChange', + type = float, + default = 1.5, + help = 'Fold-Change threshold (default: %(default)s)') + + parser.add_argument( + "-ne", "--net", + type = utils.Bool("net"), default = False, + help = "choose if you want net enrichment for RPS") + + parser.add_argument( + '-op', '--option', + type = str, + choices = ['datasets', 'dataset_class'], + help='dataset or dataset and class') + + #RAS: + parser.add_argument( + "-ra", "--using_RAS", + type = utils.Bool("using_RAS"), default = True, + help = "choose whether to use RAS datasets.") + + parser.add_argument( + '-id', '--input_data', + type = str, + help = 'input dataset') + + parser.add_argument( + '-ic', '--input_class', + type = str, + help = 'sample group specification') + + parser.add_argument( + '-ids', '--input_datas', + type = str, + nargs = '+', + help = 'input datasets') + + parser.add_argument( + '-na', '--names', + type = str, + nargs = '+', + help = 'input names') + + #RPS: + parser.add_argument( + "-rp", "--using_RPS", + type = utils.Bool("using_RPS"), default = False, + help = "choose whether to use RPS datasets.") + + parser.add_argument( + '-idr', '--input_data_rps', + type = str, + help = 'input dataset rps') + + parser.add_argument( + '-icr', '--input_class_rps', + type = str, + help = 'sample group specification rps') + + parser.add_argument( + '-idsr', '--input_datas_rps', + type = str, + nargs = '+', + help = 'input datasets rps') + + parser.add_argument( + '-nar', '--names_rps', + type = str, + nargs = '+', + help = 'input names rps') + + #Output: + parser.add_argument( + "-gs", "--generate_svg", + type = utils.Bool("generate_svg"), default = True, + help = "choose whether to use RAS datasets.") + + parser.add_argument( + "-gp", "--generate_pdf", + type = utils.Bool("generate_pdf"), default = True, + help = "choose whether to use RAS datasets.") + + parser.add_argument( + '-cm', '--custom_map', + type = str, + help='custom map to use') + + parser.add_argument( + '-idop', '--output_path', + type = str, + default='result', + help = 'output path for maps') + + parser.add_argument( + '-mc', '--choice_map', + type = utils.Model, default = utils.Model.HMRcore, + choices = [utils.Model.HMRcore, utils.Model.ENGRO2, utils.Model.Custom]) + + args :argparse.Namespace = parser.parse_args(args) + if args.using_RAS and not args.using_RPS: args.net = False + + return args + +############################ dataset input #################################### +def read_dataset(data :str, name :str) -> pd.DataFrame: + """ + Tries to read the dataset from its path (data) as a tsv and turns it into a DataFrame. + + Args: + data : filepath of a dataset (from frontend input params or literals upon calling) + name : name associated with the dataset (from frontend input params or literals upon calling) + + Returns: + pd.DataFrame : dataset in a runtime operable shape + + Raises: + sys.exit : if there's no data (pd.errors.EmptyDataError) or if the dataset has less than 2 columns + """ + try: + dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') + except pd.errors.EmptyDataError: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + if len(dataset.columns) < 2: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + return dataset + +############################ map_methods ###################################### +FoldChange = Union[float, int, str] # Union[float, Literal[0, "-INF", "INF"]] +def fold_change(avg1 :float, avg2 :float) -> FoldChange: + """ + Calculates the fold change between two gene expression values. + + Args: + avg1 : average expression value from one dataset avg2 : average expression value from the other dataset + + Returns: + FoldChange : + 0 : when both input values are 0 + "-INF" : when avg1 is 0 + "INF" : when avg2 is 0 + float : for any other combination of values + """ + if avg1 == 0 and avg2 == 0: + return 0 + + if avg1 == 0: + return '-INF' # TODO: maybe fix + + if avg2 == 0: + return 'INF' + + # (threshold_F_C - 1) / (abs(threshold_F_C) + 1) con threshold_F_C > 1 + return (avg1 - avg2) / (abs(avg1) + abs(avg2)) + +# TODO: I would really like for this one to get the Thanos treatment +def fix_style(l :str, col :Optional[str], width :str, dash :str) -> str: + """ + Produces a "fixed" style string to assign to a reaction arrow in the SVG map, assigning style properties to the corresponding values passed as input params. + + Args: + l : current style string of an SVG element + col : new value for the "stroke" style property + width : new value for the "stroke-width" style property + dash : new value for the "stroke-dasharray" style property + + Returns: + str : the fixed style string + """ + tmp = l.split(';') + flag_col = False + flag_width = False + flag_dash = False + for i in range(len(tmp)): + if tmp[i].startswith('stroke:'): + tmp[i] = 'stroke:' + col + flag_col = True + if tmp[i].startswith('stroke-width:'): + tmp[i] = 'stroke-width:' + width + flag_width = True + if tmp[i].startswith('stroke-dasharray:'): + tmp[i] = 'stroke-dasharray:' + dash + flag_dash = True + if not flag_col: + tmp.append('stroke:' + col) + if not flag_width: + tmp.append('stroke-width:' + width) + if not flag_dash: + tmp.append('stroke-dasharray:' + dash) + return ';'.join(tmp) + +def fix_map(d :Dict[str, List[Union[float, FoldChange]]], core_map :ET.ElementTree, threshold_P_V :float, threshold_F_C :float, max_z_score :float) -> ET.ElementTree: + """ + Edits the selected SVG map based on the p-value and fold change data (d) and some significance thresholds also passed as inputs. + + Args: + d : dictionary mapping a p-value and a fold-change value (values) to each reaction ID as encoded in the SVG map (keys) + core_map : SVG map to modify + threshold_P_V : threshold for a p-value to be considered significant + threshold_F_C : threshold for a fold change value to be considered significant + max_z_score : highest z-score (absolute value) + + Returns: + ET.ElementTree : the modified core_map + + Side effects: + core_map : mut + """ + maxT = 12 + minT = 2 + grey = '#BEBEBE' + blue = '#6495ed' + red = '#ecac68' + for el in core_map.iter(): + el_id = str(el.get('id')) + if el_id.startswith('R_'): + tmp = d.get(el_id[2:]) + if tmp != None: + p_val, f_c, z_score, avg1, avg2 = tmp + + if math.isnan(p_val) or (isinstance(f_c, float) and math.isnan(f_c)): continue + + if p_val <= threshold_P_V: # p-value is OK + if not isinstance(f_c, str): # FC is finite + if abs(f_c) < ((threshold_F_C - 1) / (abs(threshold_F_C) + 1)): # FC is not OK + col = grey + width = str(minT) + else: # FC is OK + if f_c < 0: + col = blue + elif f_c > 0: + col = red + width = str( + min( + max(abs(z_score * maxT) / max_z_score, minT), + maxT)) + + else: # FC is infinite + if f_c == '-INF': + col = blue + elif f_c == 'INF': + col = red + width = str(maxT) + dash = 'none' + else: # p-value is not OK + dash = '5,5' + col = grey + width = str(minT) + el.set('style', fix_style(el.get('style', ""), col, width, dash)) + return core_map + +def getElementById(reactionId :str, metabMap :ET.ElementTree) -> utils.Result[ET.Element, utils.Result.ResultErr]: + """ + Finds any element in the given map with the given ID. ID uniqueness in an svg file is recommended but + not enforced, if more than one element with the exact ID is found only the first will be returned. + + Args: + reactionId (str): exact ID of the requested element. + metabMap (ET.ElementTree): metabolic map containing the element. + + Returns: + utils.Result[ET.Element, ResultErr]: result of the search, either the first match found or a ResultErr. + """ + return utils.Result.Ok( + f"//*[@id=\"{reactionId}\"]").map( + lambda xPath : metabMap.xpath(xPath)[0]).mapErr( + lambda _ : utils.Result.ResultErr(f"No elements with ID \"{reactionId}\" found in map")) + +def styleMapElement(element :ET.Element, styleStr :str) -> None: + """Append/override stroke-related styles on a given SVG element.""" + currentStyles :str = element.get("style", "") + if re.search(r";stroke:[^;]+;stroke-width:[^;]+;stroke-dasharray:[^;]+$", currentStyles): + currentStyles = ';'.join(currentStyles.split(';')[:-3]) + + element.set("style", currentStyles + styleStr) + +class ReactionDirection(Enum): + Unknown = "" + Direct = "_F" + Inverse = "_B" + + @classmethod + def fromDir(cls, s :str) -> "ReactionDirection": + # vvv as long as there's so few variants I actually condone the if spam: + if s == ReactionDirection.Direct.value: return ReactionDirection.Direct + if s == ReactionDirection.Inverse.value: return ReactionDirection.Inverse + return ReactionDirection.Unknown + + @classmethod + def fromReactionId(cls, reactionId :str) -> "ReactionDirection": + return ReactionDirection.fromDir(reactionId[-2:]) + +def getArrowBodyElementId(reactionId :str) -> str: + """Return the SVG element id for a reaction arrow body, normalizing direction tags.""" + if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV + elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: reactionId = reactionId[:-2] + return f"R_{reactionId}" + +def getArrowHeadElementId(reactionId :str) -> Tuple[str, str]: + """ + We attempt extracting the direction information from the provided reaction ID, if unsuccessful we provide the IDs of both directions. + + Args: + reactionId : the provided reaction ID. + + Returns: + Tuple[str, str]: either a single str ID for the correct arrow head followed by an empty string or both options to try. + """ + if reactionId.endswith("_RV"): reactionId = reactionId[:-3] #TODO: standardize _RV + elif ReactionDirection.fromReactionId(reactionId) is not ReactionDirection.Unknown: + return reactionId[:-3:-1] + reactionId[:-2], "" # ^^^ Invert _F to F_ + + return f"F_{reactionId}", f"B_{reactionId}" + +class ArrowColor(Enum): + """ + Encodes possible arrow colors based on their meaning in the enrichment process. + """ + Invalid = "#BEBEBE" # gray, fold-change under treshold or not significant p-value + Transparent = "#ffffff00" # transparent, to make some arrow segments disappear + UpRegulated = "#ecac68" # orange, up-regulated reaction + DownRegulated = "#6495ed" # lightblue, down-regulated reaction + + UpRegulatedInv = "#FF0000" # bright red for reversible with conflicting directions + + DownRegulatedInv = "#0000FF" # bright blue for reversible with conflicting directions + + @classmethod + def fromFoldChangeSign(cls, foldChange :float, *, useAltColor = False) -> "ArrowColor": + colors = (cls.DownRegulated, cls.DownRegulatedInv) if foldChange < 0 else (cls.UpRegulated, cls.UpRegulatedInv) + return colors[useAltColor] + + def __str__(self) -> str: return self.value + +class Arrow: + """ + Models the properties of a reaction arrow that change based on enrichment. + """ + MIN_W = 2 + MAX_W = 12 + + def __init__(self, width :int, col: ArrowColor, *, isDashed = False) -> None: + """ + (Private) Initializes an instance of Arrow. + + Args: + width : width of the arrow, ideally to be kept within Arrow.MIN_W and Arrow.MAX_W (not enforced). + col : color of the arrow. + isDashed : whether the arrow should be dashed, meaning the associated pValue resulted not significant. + + Returns: + None : practically, a Arrow instance. + """ + self.w = width + self.col = col + self.dash = isDashed + + def applyTo(self, reactionId :str, metabMap :ET.ElementTree, styleStr :str) -> None: + if getElementById(reactionId, metabMap).map(lambda el : styleMapElement(el, styleStr)).isErr: + ERRORS.append(reactionId) + + def styleReactionElements(self, metabMap :ET.ElementTree, reactionId :str, *, mindReactionDir = True) -> None: + # If direction is irrelevant (e.g., RAS), style only the arrow body + if not mindReactionDir: + return self.applyTo(getArrowBodyElementId(reactionId), metabMap, self.toStyleStr()) + + # Now we style the arrow head(s): + idOpt1, idOpt2 = getArrowHeadElementId(reactionId) + self.applyTo(idOpt1, metabMap, self.toStyleStr(downSizedForTips = True)) + if idOpt2: self.applyTo(idOpt2, metabMap, self.toStyleStr(downSizedForTips = True)) + + def toStyleStr(self, *, downSizedForTips = False) -> str: + """ + Collapses the styles of this Arrow into a str, ready to be applied as part of the "style" property on an svg element. + + Returns: + str : the styles string. + """ + width = self.w + if downSizedForTips: width *= 0.8 + return f";stroke:{self.col};stroke-width:{width};stroke-dasharray:{'5,5' if self.dash else 'none'}" + +# Default arrows used for different significance states +INVALID_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid) +INSIGNIFICANT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Invalid, isDashed = True) +TRANSPARENT_ARROW = Arrow(Arrow.MIN_W, ArrowColor.Transparent) # Who cares how big it is if it's transparent + +def applyRpsEnrichmentToMap(rpsEnrichmentRes :Dict[str, Union[Tuple[float, FoldChange], Tuple[float, FoldChange, float, float]]], metabMap :ET.ElementTree, maxNumericZScore :float) -> None: + """ + Applies RPS enrichment results to the provided metabolic map. + + Args: + rpsEnrichmentRes : RPS enrichment results. + metabMap : the metabolic map to edit. + maxNumericZScore : biggest finite z-score value found. + + Side effects: + metabMap : mut + + Returns: + None + """ + for reactionId, values in rpsEnrichmentRes.items(): + pValue = values[0] + foldChange = values[1] + z_score = values[2] + + if math.isnan(pValue) or (isinstance(foldChange, float) and math.isnan(foldChange)): continue + + if isinstance(foldChange, str): foldChange = float(foldChange) + if pValue > ARGS.pValue: # pValue above tresh: dashed arrow + INSIGNIFICANT_ARROW.styleReactionElements(metabMap, reactionId) + continue + + if abs(foldChange) < (ARGS.fChange - 1) / (abs(ARGS.fChange) + 1): + INVALID_ARROW.styleReactionElements(metabMap, reactionId) + continue + + width = Arrow.MAX_W + if not math.isinf(z_score): + try: width = min( + max(abs(z_score * Arrow.MAX_W) / maxNumericZScore, Arrow.MIN_W), + Arrow.MAX_W) + + except ZeroDivisionError: pass + + if not reactionId.endswith("_RV"): # RV stands for reversible reactions + Arrow(width, ArrowColor.fromFoldChangeSign(foldChange)).styleReactionElements(metabMap, reactionId) + continue + + reactionId = reactionId[:-3] # Remove "_RV" + + inversionScore = (values[3] < 0) + (values[4] < 0) # Compacts the signs of averages into 1 easy to check score + if inversionScore == 2: foldChange *= -1 + + # If the score is 1 (opposite signs) we use alternative colors vvv + arrow = Arrow(width, ArrowColor.fromFoldChangeSign(foldChange, useAltColor = inversionScore == 1)) + + # vvv These 2 if statements can both be true and can both happen + if ARGS.net: # style arrow head(s): + arrow.styleReactionElements(metabMap, reactionId + ("_B" if inversionScore == 2 else "_F")) + + if not ARGS.using_RAS: # style arrow body + arrow.styleReactionElements(metabMap, reactionId, mindReactionDir = False) + +############################ split class ###################################### +def split_class(classes :pd.DataFrame, dataset_values :Dict[str, List[float]]) -> Dict[str, List[List[float]]]: + """ + Generates a :dict that groups together data from a :DataFrame based on classes the data is related to. + + Args: + classes : a :DataFrame of only string values, containing class information (rows) and keys to query the resolve_rules :dict + dataset_values : a :dict containing :float data + + Returns: + dict : the dict with data grouped by class + + Side effects: + classes : mut + """ + class_pat :Dict[str, List[List[float]]] = {} + for i in range(len(classes)): + classe :str = classes.iloc[i, 1] + if pd.isnull(classe): continue + + l :List[List[float]] = [] + sample_ids: List[str] = [] + + for j in range(i, len(classes)): + if classes.iloc[j, 1] == classe: + pat_id :str = classes.iloc[j, 0] # sample name + values = dataset_values.get(pat_id, None) # the column of values for that sample + if values != None: + l.append(values) + sample_ids.append(pat_id) + classes.iloc[j, 1] = None # TODO: problems? + + if l: + class_pat[classe] = { + "values": list(map(list, zip(*l))), # transpose + "samples": sample_ids + } + continue + + utils.logWarning( + f"Warning: no sample found in class \"{classe}\", the class has been disregarded", ARGS.out_log) + + return class_pat + +############################ conversion ############################################## +# Conversion from SVG to PNG +def svg_to_png_with_background(svg_path :utils.FilePath, png_path :utils.FilePath, dpi :int = 72, scale :int = 1, size :Optional[float] = None) -> None: + """ + Internal utility to convert an SVG to PNG (forced opaque) to aid in PDF conversion. + + Args: + svg_path : path to SVG file + png_path : path for new PNG file + dpi : dots per inch of the generated PNG + scale : scaling factor for the generated PNG, computed internally when a size is provided + size : final effective width of the generated PNG + + Returns: + None + """ + if size: + image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=1) + scale = size / image.width + image = image.resize(scale) + else: + image = pyvips.Image.new_from_file(svg_path.show(), dpi=dpi, scale=scale) + + white_background = pyvips.Image.black(image.width, image.height).new_from_image([255, 255, 255]) + white_background = white_background.affine([scale, 0, 0, scale]) + + if white_background.bands != image.bands: + white_background = white_background.extract_band(0) + + composite_image = white_background.composite2(image, 'over') + composite_image.write_to_file(png_path.show()) + +def convert_to_pdf(file_svg :utils.FilePath, file_png :utils.FilePath, file_pdf :utils.FilePath) -> None: + """ + Converts the SVG map at the provided path to PDF. + + Args: + file_svg : path to SVG file + file_png : path to PNG file + file_pdf : path to new PDF file + + Returns: + None + """ + svg_to_png_with_background(file_svg, file_png) + try: + image = Image.open(file_png.show()) + image = image.convert("RGB") + image.save(file_pdf.show(), "PDF", resolution=100.0) + print(f'PDF file {file_pdf.filePath} successfully generated.') + + except Exception as e: + raise utils.DataErr(file_pdf.show(), f'Error generating PDF file: {e}') + +############################ map ############################################## +def buildOutputPath(dataset1Name :str, dataset2Name = "rest", *, details = "", ext :utils.FileFormat) -> utils.FilePath: + """ + Builds a FilePath instance from the names of confronted datasets ready to point to a location in the + "result/" folder, used by this tool for output files in collections. + + Args: + dataset1Name : _description_ + dataset2Name : _description_. Defaults to "rest". + details : _description_ + ext : _description_ + + Returns: + utils.FilePath : _description_ + """ + return utils.FilePath( + f"{dataset1Name}_vs_{dataset2Name}" + (f" ({details})" if details else ""), + ext, + prefix = ARGS.output_path) + +FIELD_NOT_AVAILABLE = '/' +def writeToCsv(rows: List[list], fieldNames :List[str], outPath :utils.FilePath) -> None: + fieldsAmt = len(fieldNames) + with open(outPath.show(), "w", newline = "") as fd: + writer = csv.DictWriter(fd, fieldnames = fieldNames, delimiter = '\t') + writer.writeheader() + + for row in rows: + sizeMismatch = fieldsAmt - len(row) + if sizeMismatch > 0: row.extend([FIELD_NOT_AVAILABLE] * sizeMismatch) + writer.writerow({ field : data for field, data in zip(fieldNames, row) }) + +OldEnrichedScores = Dict[str, List[Union[float, FoldChange]]] +def temp_thingsInCommon(tmp :OldEnrichedScores, core_map :ET.ElementTree, max_z_score :float, dataset1Name :str, dataset2Name = "rest", ras_enrichment = True) -> None: + suffix = "RAS" if ras_enrichment else "RPS" + writeToCsv( + [ [reactId] + values for reactId, values in tmp.items() ], + ["ids", "P_Value", "fold change", "z-score", "average_1", "average_2"], + buildOutputPath(dataset1Name, dataset2Name, details = f"Tabular Result ({suffix})", ext = utils.FileFormat.TSV)) + + if ras_enrichment: + fix_map(tmp, core_map, ARGS.pValue, ARGS.fChange, max_z_score) + return + + for reactId, enrichData in tmp.items(): tmp[reactId] = tuple(enrichData) + applyRpsEnrichmentToMap(tmp, core_map, max_z_score) + +def computePValue(dataset1Data: List[float], dataset2Data: List[float]) -> Tuple[float, float]: + """ + Computes the statistical significance score (P-value) of the comparison between coherent data + from two datasets. The data is supposed to, in both datasets: + - be related to the same reaction ID; + - be ordered by sample, such that the item at position i in both lists is related to the + same sample or cell line. + + Args: + dataset1Data : data from the 1st dataset. + dataset2Data : data from the 2nd dataset. + + Returns: + tuple: (P-value, Z-score) + - P-value from the selected test on the provided data. + - Z-score of the difference between means of the two datasets. + """ + match ARGS.test: + case "ks": + # Perform Kolmogorov-Smirnov test + _, p_value = st.ks_2samp(dataset1Data, dataset2Data) + case "ttest_p": + # Datasets should have same size + if len(dataset1Data) != len(dataset2Data): + raise ValueError("Datasets must have the same size for paired t-test.") + # Perform t-test for paired samples + _, p_value = st.ttest_rel(dataset1Data, dataset2Data) + case "ttest_ind": + # Perform t-test for independent samples + _, p_value = st.ttest_ind(dataset1Data, dataset2Data) + case "wilcoxon": + # Datasets should have same size + if len(dataset1Data) != len(dataset2Data): + raise ValueError("Datasets must have the same size for Wilcoxon signed-rank test.") + # Perform Wilcoxon signed-rank test + np.random.seed(42) # Ensure reproducibility since zsplit method is used + _, p_value = st.wilcoxon(dataset1Data, dataset2Data, zero_method='zsplit') + case "mw": + # Perform Mann-Whitney U test + _, p_value = st.mannwhitneyu(dataset1Data, dataset2Data) + case _: + p_value = np.nan # Default value if no valid test is selected + + # Calculate means and standard deviations + mean1 = np.mean(dataset1Data) + mean2 = np.mean(dataset2Data) + std1 = np.std(dataset1Data, ddof=1) + std2 = np.std(dataset2Data, ddof=1) + + n1 = len(dataset1Data) + n2 = len(dataset2Data) + + # Calculate Z-score + z_score = (mean1 - mean2) / np.sqrt((std1**2 / n1) + (std2**2 / n2)) + + return p_value, z_score + + +def DESeqPValue(comparisonResult :Dict[str, List[Union[float, FoldChange]]], dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> None: + """ + Computes the p-value for each reaction in the comparisonResult dictionary using DESeq2. + + Args: + comparisonResult : dictionary mapping a p-value and a fold-change value (values) to each reaction ID as encoded in the SVG map (keys) + dataset1Data : data from the 1st dataset. + dataset2Data : data from the 2nd dataset. + ids : list of reaction IDs. + + Returns: + None : mutates the comparisonResult dictionary in place with the p-values. + """ + + # pyDESeq2 needs at least 2 replicates per sample so I check this + if len(dataset1Data[0]) < 2 or len(dataset2Data[0]) < 2: + raise ValueError("Datasets must have at least 2 replicates each") + + # pyDESeq2 is based on pandas, so we need to convert the data into a DataFrame and clean it from NaN values + dataframe1 = pd.DataFrame(dataset1Data, index=ids) + dataframe2 = pd.DataFrame(dataset2Data, index=ids) + + # pyDESeq2 requires datasets to be samples x reactions and integer values + dataframe1_clean = dataframe1.dropna(axis=0, how="any").T.astype(int) + dataframe2_clean = dataframe2.dropna(axis=0, how="any").T.astype(int) + dataframe1_clean.index = [f"ds1_rep{i+1}" for i in range(dataframe1_clean.shape[0])] + dataframe2_clean.index = [f"ds2_rep{j+1}" for j in range(dataframe2_clean.shape[0])] + + # pyDESeq2 works on a DataFrame with values and another with infos about how samples are split (like dataset class) + dataframe = pd.concat([dataframe1_clean, dataframe2_clean], axis=0) + metadata = pd.DataFrame({"dataset": (["dataset1"]*dataframe1_clean.shape[0] + ["dataset2"]*dataframe2_clean.shape[0])}, index=dataframe.index) + + # Ensure the index of the metadata matches the index of the dataframe + if not dataframe.index.equals(metadata.index): + raise ValueError("The index of the metadata DataFrame must match the index of the counts DataFrame.") + + # Prepare and run pyDESeq2 + inference = DefaultInference() + dds = DeseqDataSet(counts=dataframe, metadata=metadata, design="~dataset", inference=inference, quiet=True, low_memory=True) + dds.deseq2() + ds = DeseqStats(dds, contrast=["dataset", "dataset1", "dataset2"], inference=inference, quiet=True) + ds.summary() + + # Retrieve the p-values from the DESeq2 results + for reactId in ds.results_df.index: + comparisonResult[reactId][0] = ds.results_df["pvalue"][reactId] + + +# TODO: the net RPS computation should be done in the RPS module +def compareDatasetPair(dataset1Data :List[List[float]], dataset2Data :List[List[float]], ids :List[str]) -> Tuple[Dict[str, List[Union[float, FoldChange]]], float, Dict[str, Tuple[np.ndarray, np.ndarray]]]: + + netRPS :Dict[str, Tuple[np.ndarray, np.ndarray]] = {} + comparisonResult :Dict[str, List[Union[float, FoldChange]]] = {} + count = 0 + max_z_score = 0 + + for l1, l2 in zip(dataset1Data, dataset2Data): + reactId = ids[count] + count += 1 + if not reactId: continue + + try: #TODO: identify the source of these errors and minimize code in the try block + reactDir = ReactionDirection.fromReactionId(reactId) + # Net score is computed only for reversible reactions when user wants it on arrow tips or when RAS datasets aren't used + if (ARGS.net or not ARGS.using_RAS) and reactDir is not ReactionDirection.Unknown: + try: position = ids.index(reactId[:-1] + ('B' if reactDir is ReactionDirection.Direct else 'F')) + except ValueError: continue # we look for the complementary id, if not found we skip + + nets1 = np.subtract(l1, dataset1Data[position]) + nets2 = np.subtract(l2, dataset2Data[position]) + netRPS[reactId] = (nets1, nets2) + + # Compute p-value and z-score for the RPS scores, if the pyDESeq option is set, p-values will be computed after and this function will return p_value = 0 + p_value, z_score = computePValue(nets1, nets2) + avg1 = sum(nets1) / len(nets1) + avg2 = sum(nets2) / len(nets2) + net = fold_change(avg1, avg2) + + if math.isnan(net): continue + comparisonResult[reactId[:-1] + "RV"] = [p_value, net, z_score, avg1, avg2] + + # vvv complementary directional ids are set to None once processed if net is to be applied to tips + if ARGS.net: # If only using RPS, we cannot delete the inverse, as it's needed to color the arrows + ids[position] = None + continue + + # fallthrough is intended, regular scores need to be computed when tips aren't net but RAS datasets aren't used + # Compute p-value and z-score for the RAS scores, if the pyDESeq option is set, p-values will be computed after and this function will return p_value = 0 + p_value, z_score = computePValue(l1, l2) + avg = fold_change(sum(l1) / len(l1), sum(l2) / len(l2)) + # vvv TODO: Check numpy version compatibility + if np.isfinite(z_score) and max_z_score < abs(z_score): max_z_score = abs(z_score) + comparisonResult[reactId] = [float(p_value), avg, z_score, sum(l1) / len(l1), sum(l2) / len(l2)] + + except (TypeError, ZeroDivisionError): continue + + if ARGS.test == "DESeq": + # Compute p-values using DESeq2 + DESeqPValue(comparisonResult, dataset1Data, dataset2Data, ids) + + # Apply multiple testing correction if set by the user + if ARGS.adjusted: + + # Retrieve the p-values from the comparisonResult dictionary, they have to be different from NaN + validPValues = [(reactId, result[0]) for reactId, result in comparisonResult.items() if not np.isnan(result[0])] + # Unpack the valid p-values + reactIds, pValues = zip(*validPValues) + # Adjust the p-values using the Benjamini-Hochberg method + adjustedPValues = st.false_discovery_control(pValues) + # Update the comparisonResult dictionary with the adjusted p-values + for reactId , adjustedPValue in zip(reactIds, adjustedPValues): + comparisonResult[reactId][0] = adjustedPValue + + return comparisonResult, max_z_score, netRPS + +def computeEnrichment(class_pat: Dict[str, List[List[float]]], ids: List[str], *, fromRAS=True) -> Tuple[List[Tuple[str, str, dict, float]], dict]: + """ + Compares clustered data based on a given comparison mode and applies enrichment-based styling on the + provided metabolic map. + + Args: + class_pat : the clustered data. + ids : ids for data association. + fromRAS : whether the data to enrich consists of RAS scores. + + Returns: + tuple: A tuple containing: + - List[Tuple[str, str, dict, float]]: List of tuples with pairs of dataset names, comparison dictionary and max z-score. + - dict : net RPS values for each dataset's reactions + + Raises: + sys.exit : if there are less than 2 classes for comparison + """ + class_pat = {k.strip(): v for k, v in class_pat.items()} + if (not class_pat) or (len(class_pat.keys()) < 2): + sys.exit('Execution aborted: classes provided for comparisons are less than two\n') + + # { datasetName : { reactId : netRPS, ... }, ... } + netRPSResults :Dict[str, Dict[str, np.ndarray]] = {} + enrichment_results = [] + + if ARGS.comparison == "manyvsmany": + for i, j in it.combinations(class_pat.keys(), 2): + comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(i), class_pat.get(j), ids) + enrichment_results.append((i, j, comparisonDict, max_z_score)) + netRPSResults[i] = { reactId : net[0] for reactId, net in netRPS.items() } + netRPSResults[j] = { reactId : net[1] for reactId, net in netRPS.items() } + + elif ARGS.comparison == "onevsrest": + for single_cluster in class_pat.keys(): + rest = [item for k, v in class_pat.items() if k != single_cluster for item in v] + comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(single_cluster), rest, ids) + enrichment_results.append((single_cluster, "rest", comparisonDict, max_z_score)) + netRPSResults[single_cluster] = { reactId : net[0] for reactId, net in netRPS.items() } + netRPSResults["rest"] = { reactId : net[1] for reactId, net in netRPS.items() } + + elif ARGS.comparison == "onevsmany": + controlItems = class_pat.get(ARGS.control) + for otherDataset in class_pat.keys(): + if otherDataset == ARGS.control: + continue + + #comparisonDict, max_z_score, netRPS = compareDatasetPair(controlItems, class_pat.get(otherDataset), ids) + comparisonDict, max_z_score, netRPS = compareDatasetPair(class_pat.get(otherDataset),controlItems, ids) + #enrichment_results.append((ARGS.control, otherDataset, comparisonDict, max_z_score)) + enrichment_results.append(( otherDataset,ARGS.control, comparisonDict, max_z_score)) + netRPSResults[otherDataset] = { reactId : net[0] for reactId, net in netRPS.items() } + netRPSResults[ARGS.control] = { reactId : net[1] for reactId, net in netRPS.items() } + + return enrichment_results, netRPSResults + +def createOutputMaps(dataset1Name: str, dataset2Name: str, core_map: ET.ElementTree) -> None: + svgFilePath = buildOutputPath(dataset1Name, dataset2Name, details="SVG Map", ext=utils.FileFormat.SVG) + utils.writeSvg(svgFilePath, core_map) + + if ARGS.generate_pdf: + pngPath = buildOutputPath(dataset1Name, dataset2Name, details="PNG Map", ext=utils.FileFormat.PNG) + pdfPath = buildOutputPath(dataset1Name, dataset2Name, details="PDF Map", ext=utils.FileFormat.PDF) + svg_to_png_with_background(svgFilePath, pngPath) + try: + image = Image.open(pngPath.show()) + image = image.convert("RGB") + image.save(pdfPath.show(), "PDF", resolution=100.0) + print(f'PDF file {pdfPath.filePath} successfully generated.') + + except Exception as e: + raise utils.DataErr(pdfPath.show(), f'Error generating PDF file: {e}') + + if not ARGS.generate_svg: + os.remove(svgFilePath.show()) + +ClassPat = Dict[str, List[List[float]]] +def getClassesAndIdsFromDatasets(datasetsPaths :List[str], datasetPath :str, classPath :str, names :List[str]) -> Tuple[List[str], ClassPat, Dict[str, List[str]]]: + columnNames :Dict[str, List[str]] = {} # { datasetName : [ columnName, ... ], ... } + class_pat :ClassPat = {} + if ARGS.option == 'datasets': + num = 1 + for path, name in zip(datasetsPaths, names): + name = str(name) + if name == 'Dataset': + name += '_' + str(num) + + values, ids = getDatasetValues(path, name) + if values != None: + class_pat[name] = list(map(list, zip(*values.values()))) # TODO: ??? + columnNames[name] = ["Reactions", *values.keys()] + + num += 1 + + elif ARGS.option == "dataset_class": + classes = read_dataset(classPath, "class") + classes = classes.astype(str) + + values, ids = getDatasetValues(datasetPath, "Dataset Class (not actual name)") + if values != None: + class_pat_with_samples_id = split_class(classes, values) + + for clas, values_and_samples_id in class_pat_with_samples_id.items(): + class_pat[clas] = values_and_samples_id["values"] + columnNames[clas] = ["Reactions", *values_and_samples_id["samples"]] + + return ids, class_pat, columnNames + +def getDatasetValues(datasetPath :str, datasetName :str) -> Tuple[ClassPat, List[str]]: + """ + Opens the dataset at the given path and extracts the values (expected nullable numerics) and the IDs. + + Args: + datasetPath : path to the dataset + datasetName (str): dataset name, used in error reporting + + Returns: + Tuple[ClassPat, List[str]]: values and IDs extracted from the dataset + """ + dataset = read_dataset(datasetPath, datasetName) + IDs = pd.Series.tolist(dataset.iloc[:, 0].astype(str)) + + dataset = dataset.drop(dataset.columns[0], axis = "columns").to_dict("list") + return { id : list(map(utils.Float("Dataset values, not an argument"), values)) for id, values in dataset.items() }, IDs + +############################ MAIN ############################################# +def main(args:List[str] = None) -> None: + """ + Initializes everything and sets the program in motion based on the fronted input arguments. + + Returns: + None + + Raises: + sys.exit : if a user-provided custom map is in the wrong format (ET.XMLSyntaxError, ET.XMLSchemaParseError) + """ + global ARGS + ARGS = process_args(args) + + # Create output folder + if not os.path.isdir(ARGS.output_path): + os.makedirs(ARGS.output_path, exist_ok=True) + + core_map: ET.ElementTree = ARGS.choice_map.getMap( + ARGS.tool_dir, + utils.FilePath.fromStrPath(ARGS.custom_map) if ARGS.custom_map else None) + + # Prepare enrichment results containers + ras_results = [] + rps_results = [] + + # Compute RAS enrichment if requested + if ARGS.using_RAS: + ids_ras, class_pat_ras, _ = getClassesAndIdsFromDatasets( + ARGS.input_datas, ARGS.input_data, ARGS.input_class, ARGS.names) + ras_results, _ = computeEnrichment(class_pat_ras, ids_ras, fromRAS=True) + + + # Compute RPS enrichment if requested + if ARGS.using_RPS: + ids_rps, class_pat_rps, columnNames = getClassesAndIdsFromDatasets( + ARGS.input_datas_rps, ARGS.input_data_rps, ARGS.input_class_rps, ARGS.names_rps) + + rps_results, netRPS = computeEnrichment(class_pat_rps, ids_rps, fromRAS=False) + + # Organize by comparison pairs + comparisons: Dict[Tuple[str, str], Dict[str, Tuple]] = {} + for i, j, comparison_data, max_z_score in ras_results: + comparisons[(i, j)] = {'ras': (comparison_data, max_z_score), 'rps': None} + + for i, j, comparison_data, max_z_score, in rps_results: + comparisons.setdefault((i, j), {}).update({'rps': (comparison_data, max_z_score)}) + + # For each comparison, create a styled map with RAS bodies and RPS heads + for (i, j), res in comparisons.items(): + map_copy = copy.deepcopy(core_map) + + # Apply RAS styling to arrow bodies + if res.get('ras'): + tmp_ras, max_z_ras = res['ras'] + temp_thingsInCommon(tmp_ras, map_copy, max_z_ras, i, j, ras_enrichment=True) + + # Apply RPS styling to arrow heads + if res.get('rps'): + tmp_rps, max_z_rps = res['rps'] + + temp_thingsInCommon(tmp_rps, map_copy, max_z_rps, i, j, ras_enrichment=False) + + # Output both SVG and PDF/PNG as configured + createOutputMaps(i, j, map_copy) + + # Add net RPS output file + if ARGS.net or not ARGS.using_RAS: + for datasetName, rows in netRPS.items(): + writeToCsv( + [[reactId, *netValues] for reactId, netValues in rows.items()], + columnNames.get(datasetName, ["Reactions"]), + utils.FilePath( + "Net_RPS_" + datasetName, + ext = utils.FileFormat.CSV, + prefix = ARGS.output_path)) + + print('Execution succeeded') +############################################################################### +if __name__ == "__main__": + main()
--- a/COBRAxy/src/marea_cluster.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/marea_cluster.py Sun Oct 26 19:27:41 2025 +0000 @@ -77,8 +77,8 @@ parser.add_argument('-td', '--tool_dir', type = str, - required = True, - help = 'your tool directory') + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') parser.add_argument('-ms', '--min_samples', type = int,
--- a/COBRAxy/src/ras_generator.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/ras_generator.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,574 +1,578 @@ -""" -Generate Reaction Activity Scores (RAS) from a gene expression dataset and GPR rules. - -The script reads a tabular dataset (genes x samples) and a rules file (GPRs), -computes RAS per reaction for each sample/cell line, and writes a tabular output. -""" -from __future__ import division -import sys -import argparse -import pandas as pd -import numpy as np -import utils.general_utils as utils -from typing import List, Dict -import ast - -# Optional imports for AnnData mode (not used in ras_generator.py) -try: - from progressbar import ProgressBar, Bar, Percentage - from scanpy import AnnData - from cobra.flux_analysis.variability import find_essential_reactions, find_essential_genes -except ImportError: - # These are only needed for AnnData mode, not for ras_generator.py - pass - -ERRORS = [] -########################## argparse ########################################## -ARGS :argparse.Namespace -def process_args(args:List[str] = None) -> argparse.Namespace: - """ - Processes command-line arguments. - - Args: - args (list): List of command-line arguments. - - Returns: - Namespace: An object containing parsed arguments. - """ - parser = argparse.ArgumentParser( - usage = '%(prog)s [options]', - description = "process some value's genes to create a comparison's map.") - - parser.add_argument("-rl", "--model_upload", type = str, - help = "path to input file containing the rules") - - parser.add_argument("-rn", "--model_upload_name", type = str, help = "custom rules name") - # Galaxy converts files into .dat, this helps infer the original extension when needed. - - parser.add_argument( - '-n', '--none', - type = utils.Bool("none"), default = True, - help = 'compute Nan values') - - parser.add_argument( - '-td', '--tool_dir', - type = str, - required = True, help = 'your tool directory') - - parser.add_argument( - '-ol', '--out_log', - type = str, - help = "Output log") - - parser.add_argument( - '-in', '--input', - type = str, - help = 'input dataset') - - parser.add_argument( - '-ra', '--ras_output', - type = str, - required = True, help = 'ras output') - - - return parser.parse_args(args) - -############################ dataset input #################################### -def read_dataset(data :str, name :str) -> pd.DataFrame: - """ - Read a dataset from a CSV file and return it as a pandas DataFrame. - - Args: - data (str): Path to the CSV file containing the dataset. - name (str): Name of the dataset, used in error messages. - - Returns: - pandas.DataFrame: DataFrame containing the dataset. - - Raises: - pd.errors.EmptyDataError: If the CSV file is empty. - sys.exit: If the CSV file has the wrong format, the execution is aborted. - """ - try: - dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python', index_col=0) - dataset = dataset.astype(float) - except pd.errors.EmptyDataError: - sys.exit('Execution aborted: wrong file format of ' + name + '\n') - if len(dataset.columns) < 2: - sys.exit('Execution aborted: wrong file format of ' + name + '\n') - return dataset - - -def load_custom_rules() -> Dict[str,str]: - """ - Opens custom rules file and extracts the rules. If the file is in .csv format an additional parsing step will be - performed, significantly impacting the runtime. - - Returns: - Dict[str, ruleUtils.OpList] : dict mapping reaction IDs to rules. - """ - datFilePath = utils.FilePath.fromStrPath(ARGS.model_upload) # actual file, stored in Galaxy as a .dat - - dict_rule = {} - - try: - rows = utils.readCsv(datFilePath, delimiter = "\t", skipHeader=False) - if len(rows) <= 1: - raise ValueError("Model tabular with 1 column is not supported.") - - if not rows: - raise ValueError("Model tabular is file is empty.") - - id_idx, idx_gpr = utils.findIdxByName(rows[0], "GPR") - - # First, try using a tab delimiter - for line in rows[1:]: - if len(line) <= idx_gpr: - utils.logWarning(f"Skipping malformed line: {line}", ARGS.out_log) - continue - - dict_rule[line[id_idx]] = line[idx_gpr] - - except Exception as e: - # If parsing with tabs fails, try comma delimiter - try: - rows = utils.readCsv(datFilePath, delimiter = ",", skipHeader=False) - - if len(rows) <= 1: - raise ValueError("Model tabular with 1 column is not supported.") - - if not rows: - raise ValueError("Model tabular is file is empty.") - - id_idx, idx_gpr = utils.findIdxByName(rows[0], "GPR") - - # Try again parsing row content with the GPR column using comma-separated values - for line in rows[1:]: - if len(line) <= idx_gpr: - utils.logWarning(f"Skipping malformed line: {line}", ARGS.out_log) - continue - - dict_rule[line[id_idx]] =line[idx_gpr] - - except Exception as e2: - raise ValueError(f"Unable to parse rules file. Tried both tab and comma delimiters. Original errors: Tab: {e}, Comma: {e2}") - - if not dict_rule: - raise ValueError("No valid rules found in the uploaded file. Please check the file format.") - # csv rules need to be parsed, those in a pickle format are taken to be pre-parsed. - return dict_rule - - -""" -Class to compute the RAS values - -""" - -class RAS_computation: - - def __init__(self, adata=None, model=None, dataset=None, gene_rules=None, rules_total_string=None): - """ - Initialize RAS computation with two possible input modes: - - Mode 1 (Original - for sampling_main.py): - adata: AnnData object with gene expression (cells × genes) - model: COBRApy model object with reactions and GPRs - - Mode 2 (New - for ras_generator.py): - dataset: pandas DataFrame with gene expression (genes × samples) - gene_rules: dict mapping reaction IDs to GPR strings - rules_total_string: list of all gene names in GPRs (for validation) - """ - self._logic_operators = ['and', 'or', '(', ')'] - self.val_nan = np.nan - - # Determine which mode we're in - if adata is not None and model is not None: - # Mode 1: AnnData + COBRApy model (original) - self._init_from_anndata(adata, model) - elif dataset is not None and gene_rules is not None: - # Mode 2: DataFrame + rules dict (ras_generator style) - self._init_from_dataframe(dataset, gene_rules, rules_total_string) - else: - raise ValueError( - "Invalid initialization. Provide either:\n" - " - adata + model (for AnnData input), or\n" - " - dataset + gene_rules (for DataFrame input)" - ) - - def _normalize_gene_name(self, gene_name): - """Normalize gene names by replacing special characters.""" - return gene_name.replace("-", "_").replace(":", "_") - - def _normalize_rule(self, rule): - """Normalize GPR rule: lowercase operators, add spaces around parentheses, normalize gene names.""" - rule = rule.replace("OR", "or").replace("AND", "and") - rule = rule.replace("(", "( ").replace(")", " )") - # Normalize gene names in the rule - tokens = rule.split() - normalized_tokens = [token if token in self._logic_operators else self._normalize_gene_name(token) for token in tokens] - return " ".join(normalized_tokens) - - def _init_from_anndata(self, adata, model): - """Initialize from AnnData and COBRApy model (original mode).""" - # Build the dictionary for the GPRs - df_reactions = pd.DataFrame(index=[reaction.id for reaction in model.reactions]) - gene_rules = [self._normalize_rule(reaction.gene_reaction_rule) for reaction in model.reactions] - df_reactions['rule'] = gene_rules - df_reactions = df_reactions.reset_index() - df_reactions = df_reactions.groupby('rule').agg(lambda x: sorted(list(x))) - - self.dict_rule_reactions = df_reactions.to_dict()['index'] - - # build useful structures for RAS computation - self.model = model - self.count_adata = adata.copy() - - # Normalize gene names in both model and dataset - model_genes = [self._normalize_gene_name(gene.id) for gene in model.genes] - dataset_genes = [self._normalize_gene_name(gene) for gene in self.count_adata.var.index] - self.genes = pd.Index(dataset_genes).intersection(model_genes) - - if len(self.genes) == 0: - raise ValueError("ERROR: No genes from the count matrix match the metabolic model. Check that gene annotations are consistent between model and dataset.") - - self.cell_ids = list(self.count_adata.obs.index.values) - # Get expression data with normalized gene names - self.count_df_filtered = self.count_adata.to_df().T - self.count_df_filtered.index = [self._normalize_gene_name(g) for g in self.count_df_filtered.index] - self.count_df_filtered = self.count_df_filtered.loc[self.genes] - - def _init_from_dataframe(self, dataset, gene_rules, rules_total_string): - """Initialize from DataFrame and rules dict (ras_generator mode).""" - reactions = list(gene_rules.keys()) - - # Build the dictionary for the GPRs - df_reactions = pd.DataFrame(index=reactions) - gene_rules_list = [self._normalize_rule(gene_rules[reaction_id]) for reaction_id in reactions] - df_reactions['rule'] = gene_rules_list - df_reactions = df_reactions.reset_index() - df_reactions = df_reactions.groupby('rule').agg(lambda x: sorted(list(x))) - - self.dict_rule_reactions = df_reactions.to_dict()['index'] - - # build useful structures for RAS computation - self.model = None - self.count_adata = None - - # Normalize gene names in dataset - dataset_normalized = dataset.copy() - dataset_normalized.index = [self._normalize_gene_name(g) for g in dataset_normalized.index] - - # Determine which genes are in both dataset and GPRs - if rules_total_string is not None: - rules_genes = [self._normalize_gene_name(g) for g in rules_total_string] - self.genes = dataset_normalized.index.intersection(rules_genes) - else: - # Extract all genes from rules - all_genes_in_rules = set() - for rule in gene_rules_list: - tokens = rule.split() - for token in tokens: - if token not in self._logic_operators: - all_genes_in_rules.add(token) - self.genes = dataset_normalized.index.intersection(all_genes_in_rules) - - if len(self.genes) == 0: - raise ValueError("ERROR: No genes from the count matrix match the metabolic model. Check that gene annotations are consistent between model and dataset.") - - self.cell_ids = list(dataset_normalized.columns) - self.count_df_filtered = dataset_normalized.loc[self.genes] - - def compute(self, - or_expression=np.sum, # type of operation to do in case of an or expression (sum, max, mean) - and_expression=np.min, # type of operation to do in case of an and expression(min, sum) - drop_na_rows=False, # if True remove the nan rows of the ras matrix - drop_duplicates=False, # if true, remove duplicates rows - ignore_nan=True, # if True, ignore NaN values in GPR evaluation (e.g., A or NaN -> A) - print_progressbar=True, # if True, print the progress bar - add_count_metadata=True, # if True add metadata of cells in the ras adata - add_met_metadata=True, # if True add metadata from the metabolic model (gpr and compartments of reactions) - add_essential_reactions=False, - add_essential_genes=False - ): - - self.or_function = or_expression - self.and_function = and_expression - - ras_df = np.full((len(self.dict_rule_reactions), len(self.cell_ids)), np.nan) - genes_not_mapped = set() # Track genes not in dataset - - if print_progressbar: - pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(self.dict_rule_reactions)).start() - - # Process each unique GPR rule - for ind, (rule, reaction_ids) in enumerate(self.dict_rule_reactions.items()): - if len(rule) == 0: - # Empty rule - keep as NaN - pass - else: - # Extract genes from rule - rule_genes = [token for token in rule.split() if token not in self._logic_operators] - rule_genes_unique = list(set(rule_genes)) - - # Which genes are in the dataset? - genes_present = [g for g in rule_genes_unique if g in self.genes] - genes_missing = [g for g in rule_genes_unique if g not in self.genes] - - if genes_missing: - genes_not_mapped.update(genes_missing) - - if len(genes_present) == 0: - # No genes in dataset - keep as NaN - pass - elif len(genes_missing) > 0 and not ignore_nan: - # Some genes missing and we don't ignore NaN - set to NaN - pass - else: - # Evaluate the GPR expression using AST - # For single gene, AST handles it fine: ast.parse("GENE_A") works - # more genes in the formula - check_only_and=("and" in rule and "or" not in rule) #only and - check_only_or=("or" in rule and "and" not in rule) #only or - if check_only_and or check_only_or: - #or/and sequence - matrix = self.count_df_filtered.loc[genes_present].values - #compute for all cells - if check_only_and: - ras_df[ind] = self.and_function(matrix, axis=0) - else: - ras_df[ind] = self.or_function(matrix, axis=0) - else: - # complex expression (e.g. A or (B and C)) - data = self.count_df_filtered.loc[genes_present] # dataframe of genes in the GPRs - tree = ast.parse(rule, mode="eval").body - values_by_cell = [dict(zip(data.index, data[col].values)) for col in data.columns] - for j, values in enumerate(values_by_cell): - ras_df[ind, j] =self._evaluate_ast(tree, values, self.or_function, self.and_function, ignore_nan) - - if print_progressbar: - pbar.update(ind + 1) - - if print_progressbar: - pbar.finish() - - # Store genes not mapped for later use - self.genes_not_mapped = sorted(genes_not_mapped) - - # create the dataframe of ras (rules x samples) - ras_df = pd.DataFrame(data=ras_df, index=range(len(self.dict_rule_reactions)), columns=self.cell_ids) - ras_df['Reactions'] = [reaction_ids for rule, reaction_ids in self.dict_rule_reactions.items()] - - reactions_common = pd.DataFrame() - reactions_common["Reactions"] = ras_df['Reactions'] - reactions_common["proof2"] = ras_df['Reactions'] - reactions_common = reactions_common.explode('Reactions') - reactions_common = reactions_common.set_index("Reactions") - - ras_df = ras_df.explode("Reactions") - ras_df = ras_df.set_index("Reactions") - - if drop_na_rows: - ras_df = ras_df.dropna(how="all") - - if drop_duplicates: - ras_df = ras_df.drop_duplicates() - - # If initialized from DataFrame (ras_generator mode), return DataFrame instead of AnnData - if self.count_adata is None: - return ras_df, self.genes_not_mapped - - # Original AnnData mode: create AnnData structure for RAS - ras_adata = AnnData(ras_df.T) - - #add metadata - if add_count_metadata: - ras_adata.var["common_gprs"] = reactions_common.loc[ras_df.index] - ras_adata.var["common_gprs"] = ras_adata.var["common_gprs"].apply(lambda x: ",".join(x)) - for el in self.count_adata.obs.columns: - ras_adata.obs["countmatrix_"+el]=self.count_adata.obs[el] - - if add_met_metadata: - if self.model is not None and len(self.model.compartments)>0: - ras_adata.var['compartments']=[list(self.model.reactions.get_by_id(reaction).compartments) for reaction in ras_adata.var.index] - ras_adata.var['compartments']=ras_adata.var["compartments"].apply(lambda x: ",".join(x)) - - if self.model is not None: - ras_adata.var['GPR rule'] = [self.model.reactions.get_by_id(reaction).gene_reaction_rule for reaction in ras_adata.var.index] - - if add_essential_reactions: - if self.model is not None: - essential_reactions=find_essential_reactions(self.model) - essential_reactions=[el.id for el in essential_reactions] - ras_adata.var['essential reactions']=["yes" if el in essential_reactions else "no" for el in ras_adata.var.index] - - if add_essential_genes: - if self.model is not None: - essential_genes=find_essential_genes(self.model) - essential_genes=[el.id for el in essential_genes] - ras_adata.var['essential genes']=[" ".join([gene for gene in genes.split() if gene in essential_genes]) for genes in ras_adata.var["GPR rule"]] - - return ras_adata - - def _evaluate_ast(self, node, values, or_function, and_function, ignore_nan): - """ - Evaluate a boolean expression using AST (Abstract Syntax Tree). - Handles all GPR types: single gene, simple (A and B), nested (A or (B and C)). - - Args: - node: AST node to evaluate - values: Dictionary mapping gene names to their expression values - or_function: Function to apply for OR operations - and_function: Function to apply for AND operations - ignore_nan: If True, ignore None/NaN values (e.g., A or None -> A) - - Returns: - Evaluated expression result (float or np.nan) - """ - if isinstance(node, ast.BoolOp): - # Boolean operation (and/or) - vals = [self._evaluate_ast(v, values, or_function, and_function, ignore_nan) for v in node.values] - - if ignore_nan: - # Filter out None/NaN values - vals = [v for v in vals if v is not None and not (isinstance(v, float) and np.isnan(v))] - - if not vals: - return np.nan - - if isinstance(node.op, ast.Or): - return or_function(vals) - elif isinstance(node.op, ast.And): - return and_function(vals) - - elif isinstance(node, ast.Name): - # Variable (gene name) - return values.get(node.id, None) - elif isinstance(node, ast.Constant): - # Constant (shouldn't happen in GPRs, but handle it) - return values.get(str(node.value), None) - else: - raise ValueError(f"Unexpected node type in GPR: {ast.dump(node)}") - - -# ============================================================================ -# STANDALONE FUNCTION FOR RAS_GENERATOR COMPATIBILITY -# ============================================================================ - -def computeRAS( - dataset, - gene_rules, - rules_total_string, - or_function=np.sum, - and_function=np.min, - ignore_nan=True -): - """ - Compute RAS from tabular data and GPR rules (ras_generator.py compatible). - - This is a standalone function that wraps the RAS_computation class - to provide the same interface as ras_generator.py. - - Args: - dataset: pandas DataFrame with gene expression (genes × samples) - gene_rules: dict mapping reaction IDs to GPR strings - rules_total_string: list of all gene names in GPRs - or_function: function for OR operations (default: np.sum) - and_function: function for AND operations (default: np.min) - ignore_nan: if True, ignore NaN in GPR evaluation (default: True) - - Returns: - tuple: (ras_df, genes_not_mapped) - - ras_df: DataFrame with RAS values (reactions × samples) - - genes_not_mapped: list of genes in GPRs not found in dataset - """ - # Create RAS computation object in DataFrame mode - ras_obj = RAS_computation( - dataset=dataset, - gene_rules=gene_rules, - rules_total_string=rules_total_string - ) - - # Compute RAS - result = ras_obj.compute( - or_expression=or_function, - and_expression=and_function, - ignore_nan=ignore_nan, - print_progressbar=False, # No progress bar for ras_generator - add_count_metadata=False, # No metadata in DataFrame mode - add_met_metadata=False, - add_essential_reactions=False, - add_essential_genes=False - ) - - # Result is a tuple (ras_df, genes_not_mapped) in DataFrame mode - return result - -def main(args:List[str] = None) -> None: - """ - Initializes everything and sets the program in motion based on the fronted input arguments. - - Returns: - None - """ - # get args from frontend (related xml) - global ARGS - ARGS = process_args(args) - - # read dataset and remove versioning from gene names - dataset = read_dataset(ARGS.input, "dataset") - orig_gene_list=dataset.index.copy() - dataset.index = [str(el.split(".")[0]) for el in dataset.index] - - #load GPR rules - rules = load_custom_rules() - - #create a list of all the gpr - rules_total_string="" - for id,rule in rules.items(): - rules_total_string+=rule.replace("(","").replace(")","") + " " - rules_total_string=list(set(rules_total_string.split(" "))) - - if any(dataset.index.duplicated(keep=False)): - genes_duplicates=orig_gene_list[dataset.index.duplicated(keep=False)] - genes_duplicates_in_model=[elem for elem in genes_duplicates if elem in rules_total_string] - - if len(genes_duplicates_in_model)>0:#metabolic genes have duplicated entries in the dataset - list_str=", ".join(genes_duplicates_in_model) - list_genes=f"ERROR: Duplicate entries in the gene dataset present in one or more GPR. The following metabolic genes are duplicated: "+list_str - raise ValueError(list_genes) - else: - list_str=", ".join(genes_duplicates) - list_genes=f"INFO: Duplicate entries in the gene dataset. The following genes are duplicated in the dataset but not mentioned in the GPRs: "+list_str - utils.logWarning(list_genes,ARGS.out_log) - - #check if nan value must be ignored in the GPR - if ARGS.none: - # #e.g. (A or nan --> A) - ignore_nan = True - else: - #e.g. (A or nan --> nan) - ignore_nan = False - - #compure ras - ras_df,genes_not_mapped=computeRAS(dataset,rules, - rules_total_string, - or_function=np.sum, # type of operation to do in case of an or expression (max, sum, mean) - and_function=np.min, - ignore_nan=ignore_nan) - - #save to csv and replace nan with None - ras_df.replace([np.nan,None],"None").to_csv(ARGS.ras_output, sep = '\t') - - #report genes not present in the data - if len(genes_not_mapped)>0: - genes_not_mapped_str=", ".join(genes_not_mapped) - utils.logWarning( - f"INFO: The following genes are mentioned in the GPR rules but don't appear in the dataset: "+genes_not_mapped_str, - ARGS.out_log) - - print("Execution succeeded") - -############################################################################### -if __name__ == "__main__": +""" +Generate Reaction Activity Scores (RAS) from a gene expression dataset and GPR rules. + +The script reads a tabular dataset (genes x samples) and a rules file (GPRs), +computes RAS per reaction for each sample/cell line, and writes a tabular output. +""" +from __future__ import division +import sys +import argparse +import pandas as pd +import numpy as np +try: + from .utils import general_utils as utils +except: + import utils.general_utils as utils +from typing import List, Dict +import ast + +# Optional imports for AnnData mode (not used in ras_generator.py) +try: + from progressbar import ProgressBar, Bar, Percentage + from scanpy import AnnData + from cobra.flux_analysis.variability import find_essential_reactions, find_essential_genes +except ImportError: + # These are only needed for AnnData mode, not for ras_generator.py + pass + +ERRORS = [] +########################## argparse ########################################## +ARGS :argparse.Namespace +def process_args(args:List[str] = None) -> argparse.Namespace: + """ + Processes command-line arguments. + + Args: + args (list): List of command-line arguments. + + Returns: + Namespace: An object containing parsed arguments. + """ + parser = argparse.ArgumentParser( + usage = '%(prog)s [options]', + description = "process some value's genes to create a comparison's map.") + + parser.add_argument("-rl", "--model_upload", type = str, + help = "path to input file containing the rules") + + parser.add_argument("-rn", "--model_upload_name", type = str, help = "custom rules name") + # Galaxy converts files into .dat, this helps infer the original extension when needed. + + parser.add_argument( + '-n', '--none', + type = utils.Bool("none"), default = True, + help = 'compute Nan values') + + parser.add_argument( + '-td', '--tool_dir', + type = str, + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') + + parser.add_argument( + '-ol', '--out_log', + type = str, + help = "Output log") + + parser.add_argument( + '-in', '--input', + type = str, + help = 'input dataset') + + parser.add_argument( + '-ra', '--ras_output', + type = str, + required = True, help = 'ras output') + + + return parser.parse_args(args) + +############################ dataset input #################################### +def read_dataset(data :str, name :str) -> pd.DataFrame: + """ + Read a dataset from a CSV file and return it as a pandas DataFrame. + + Args: + data (str): Path to the CSV file containing the dataset. + name (str): Name of the dataset, used in error messages. + + Returns: + pandas.DataFrame: DataFrame containing the dataset. + + Raises: + pd.errors.EmptyDataError: If the CSV file is empty. + sys.exit: If the CSV file has the wrong format, the execution is aborted. + """ + try: + dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python', index_col=0) + dataset = dataset.astype(float) + except pd.errors.EmptyDataError: + sys.exit('Execution aborted: wrong file format of ' + name + '\n') + if len(dataset.columns) < 2: + sys.exit('Execution aborted: wrong file format of ' + name + '\n') + return dataset + + +def load_custom_rules() -> Dict[str,str]: + """ + Opens custom rules file and extracts the rules. If the file is in .csv format an additional parsing step will be + performed, significantly impacting the runtime. + + Returns: + Dict[str, ruleUtils.OpList] : dict mapping reaction IDs to rules. + """ + datFilePath = utils.FilePath.fromStrPath(ARGS.model_upload) # actual file, stored in Galaxy as a .dat + + dict_rule = {} + + try: + rows = utils.readCsv(datFilePath, delimiter = "\t", skipHeader=False) + if len(rows) <= 1: + raise ValueError("Model tabular with 1 column is not supported.") + + if not rows: + raise ValueError("Model tabular is file is empty.") + + id_idx, idx_gpr = utils.findIdxByName(rows[0], "GPR") + + # First, try using a tab delimiter + for line in rows[1:]: + if len(line) <= idx_gpr: + utils.logWarning(f"Skipping malformed line: {line}", ARGS.out_log) + continue + + dict_rule[line[id_idx]] = line[idx_gpr] + + except Exception as e: + # If parsing with tabs fails, try comma delimiter + try: + rows = utils.readCsv(datFilePath, delimiter = ",", skipHeader=False) + + if len(rows) <= 1: + raise ValueError("Model tabular with 1 column is not supported.") + + if not rows: + raise ValueError("Model tabular is file is empty.") + + id_idx, idx_gpr = utils.findIdxByName(rows[0], "GPR") + + # Try again parsing row content with the GPR column using comma-separated values + for line in rows[1:]: + if len(line) <= idx_gpr: + utils.logWarning(f"Skipping malformed line: {line}", ARGS.out_log) + continue + + dict_rule[line[id_idx]] =line[idx_gpr] + + except Exception as e2: + raise ValueError(f"Unable to parse rules file. Tried both tab and comma delimiters. Original errors: Tab: {e}, Comma: {e2}") + + if not dict_rule: + raise ValueError("No valid rules found in the uploaded file. Please check the file format.") + # csv rules need to be parsed, those in a pickle format are taken to be pre-parsed. + return dict_rule + + +""" +Class to compute the RAS values + +""" + +class RAS_computation: + + def __init__(self, adata=None, model=None, dataset=None, gene_rules=None, rules_total_string=None): + """ + Initialize RAS computation with two possible input modes: + + Mode 1 (Original - for sampling_main.py): + adata: AnnData object with gene expression (cells × genes) + model: COBRApy model object with reactions and GPRs + + Mode 2 (New - for ras_generator.py): + dataset: pandas DataFrame with gene expression (genes × samples) + gene_rules: dict mapping reaction IDs to GPR strings + rules_total_string: list of all gene names in GPRs (for validation) + """ + self._logic_operators = ['and', 'or', '(', ')'] + self.val_nan = np.nan + + # Determine which mode we're in + if adata is not None and model is not None: + # Mode 1: AnnData + COBRApy model (original) + self._init_from_anndata(adata, model) + elif dataset is not None and gene_rules is not None: + # Mode 2: DataFrame + rules dict (ras_generator style) + self._init_from_dataframe(dataset, gene_rules, rules_total_string) + else: + raise ValueError( + "Invalid initialization. Provide either:\n" + " - adata + model (for AnnData input), or\n" + " - dataset + gene_rules (for DataFrame input)" + ) + + def _normalize_gene_name(self, gene_name): + """Normalize gene names by replacing special characters.""" + return gene_name.replace("-", "_").replace(":", "_") + + def _normalize_rule(self, rule): + """Normalize GPR rule: lowercase operators, add spaces around parentheses, normalize gene names.""" + rule = rule.replace("OR", "or").replace("AND", "and") + rule = rule.replace("(", "( ").replace(")", " )") + # Normalize gene names in the rule + tokens = rule.split() + normalized_tokens = [token if token in self._logic_operators else self._normalize_gene_name(token) for token in tokens] + return " ".join(normalized_tokens) + + def _init_from_anndata(self, adata, model): + """Initialize from AnnData and COBRApy model (original mode).""" + # Build the dictionary for the GPRs + df_reactions = pd.DataFrame(index=[reaction.id for reaction in model.reactions]) + gene_rules = [self._normalize_rule(reaction.gene_reaction_rule) for reaction in model.reactions] + df_reactions['rule'] = gene_rules + df_reactions = df_reactions.reset_index() + df_reactions = df_reactions.groupby('rule').agg(lambda x: sorted(list(x))) + + self.dict_rule_reactions = df_reactions.to_dict()['index'] + + # build useful structures for RAS computation + self.model = model + self.count_adata = adata.copy() + + # Normalize gene names in both model and dataset + model_genes = [self._normalize_gene_name(gene.id) for gene in model.genes] + dataset_genes = [self._normalize_gene_name(gene) for gene in self.count_adata.var.index] + self.genes = pd.Index(dataset_genes).intersection(model_genes) + + if len(self.genes) == 0: + raise ValueError("ERROR: No genes from the count matrix match the metabolic model. Check that gene annotations are consistent between model and dataset.") + + self.cell_ids = list(self.count_adata.obs.index.values) + # Get expression data with normalized gene names + self.count_df_filtered = self.count_adata.to_df().T + self.count_df_filtered.index = [self._normalize_gene_name(g) for g in self.count_df_filtered.index] + self.count_df_filtered = self.count_df_filtered.loc[self.genes] + + def _init_from_dataframe(self, dataset, gene_rules, rules_total_string): + """Initialize from DataFrame and rules dict (ras_generator mode).""" + reactions = list(gene_rules.keys()) + + # Build the dictionary for the GPRs + df_reactions = pd.DataFrame(index=reactions) + gene_rules_list = [self._normalize_rule(gene_rules[reaction_id]) for reaction_id in reactions] + df_reactions['rule'] = gene_rules_list + df_reactions = df_reactions.reset_index() + df_reactions = df_reactions.groupby('rule').agg(lambda x: sorted(list(x))) + + self.dict_rule_reactions = df_reactions.to_dict()['index'] + + # build useful structures for RAS computation + self.model = None + self.count_adata = None + + # Normalize gene names in dataset + dataset_normalized = dataset.copy() + dataset_normalized.index = [self._normalize_gene_name(g) for g in dataset_normalized.index] + + # Determine which genes are in both dataset and GPRs + if rules_total_string is not None: + rules_genes = [self._normalize_gene_name(g) for g in rules_total_string] + self.genes = dataset_normalized.index.intersection(rules_genes) + else: + # Extract all genes from rules + all_genes_in_rules = set() + for rule in gene_rules_list: + tokens = rule.split() + for token in tokens: + if token not in self._logic_operators: + all_genes_in_rules.add(token) + self.genes = dataset_normalized.index.intersection(all_genes_in_rules) + + if len(self.genes) == 0: + raise ValueError("ERROR: No genes from the count matrix match the metabolic model. Check that gene annotations are consistent between model and dataset.") + + self.cell_ids = list(dataset_normalized.columns) + self.count_df_filtered = dataset_normalized.loc[self.genes] + + def compute(self, + or_expression=np.sum, # type of operation to do in case of an or expression (sum, max, mean) + and_expression=np.min, # type of operation to do in case of an and expression(min, sum) + drop_na_rows=False, # if True remove the nan rows of the ras matrix + drop_duplicates=False, # if true, remove duplicates rows + ignore_nan=True, # if True, ignore NaN values in GPR evaluation (e.g., A or NaN -> A) + print_progressbar=True, # if True, print the progress bar + add_count_metadata=True, # if True add metadata of cells in the ras adata + add_met_metadata=True, # if True add metadata from the metabolic model (gpr and compartments of reactions) + add_essential_reactions=False, + add_essential_genes=False + ): + + self.or_function = or_expression + self.and_function = and_expression + + ras_df = np.full((len(self.dict_rule_reactions), len(self.cell_ids)), np.nan) + genes_not_mapped = set() # Track genes not in dataset + + if print_progressbar: + pbar = ProgressBar(widgets=[Percentage(), Bar()], maxval=len(self.dict_rule_reactions)).start() + + # Process each unique GPR rule + for ind, (rule, reaction_ids) in enumerate(self.dict_rule_reactions.items()): + if len(rule) == 0: + # Empty rule - keep as NaN + pass + else: + # Extract genes from rule + rule_genes = [token for token in rule.split() if token not in self._logic_operators] + rule_genes_unique = list(set(rule_genes)) + + # Which genes are in the dataset? + genes_present = [g for g in rule_genes_unique if g in self.genes] + genes_missing = [g for g in rule_genes_unique if g not in self.genes] + + if genes_missing: + genes_not_mapped.update(genes_missing) + + if len(genes_present) == 0: + # No genes in dataset - keep as NaN + pass + elif len(genes_missing) > 0 and not ignore_nan: + # Some genes missing and we don't ignore NaN - set to NaN + pass + else: + # Evaluate the GPR expression using AST + # For single gene, AST handles it fine: ast.parse("GENE_A") works + # more genes in the formula + check_only_and=("and" in rule and "or" not in rule) #only and + check_only_or=("or" in rule and "and" not in rule) #only or + if check_only_and or check_only_or: + #or/and sequence + matrix = self.count_df_filtered.loc[genes_present].values + #compute for all cells + if check_only_and: + ras_df[ind] = self.and_function(matrix, axis=0) + else: + ras_df[ind] = self.or_function(matrix, axis=0) + else: + # complex expression (e.g. A or (B and C)) + data = self.count_df_filtered.loc[genes_present] # dataframe of genes in the GPRs + tree = ast.parse(rule, mode="eval").body + values_by_cell = [dict(zip(data.index, data[col].values)) for col in data.columns] + for j, values in enumerate(values_by_cell): + ras_df[ind, j] =self._evaluate_ast(tree, values, self.or_function, self.and_function, ignore_nan) + + if print_progressbar: + pbar.update(ind + 1) + + if print_progressbar: + pbar.finish() + + # Store genes not mapped for later use + self.genes_not_mapped = sorted(genes_not_mapped) + + # create the dataframe of ras (rules x samples) + ras_df = pd.DataFrame(data=ras_df, index=range(len(self.dict_rule_reactions)), columns=self.cell_ids) + ras_df['Reactions'] = [reaction_ids for rule, reaction_ids in self.dict_rule_reactions.items()] + + reactions_common = pd.DataFrame() + reactions_common["Reactions"] = ras_df['Reactions'] + reactions_common["proof2"] = ras_df['Reactions'] + reactions_common = reactions_common.explode('Reactions') + reactions_common = reactions_common.set_index("Reactions") + + ras_df = ras_df.explode("Reactions") + ras_df = ras_df.set_index("Reactions") + + if drop_na_rows: + ras_df = ras_df.dropna(how="all") + + if drop_duplicates: + ras_df = ras_df.drop_duplicates() + + # If initialized from DataFrame (ras_generator mode), return DataFrame instead of AnnData + if self.count_adata is None: + return ras_df, self.genes_not_mapped + + # Original AnnData mode: create AnnData structure for RAS + ras_adata = AnnData(ras_df.T) + + #add metadata + if add_count_metadata: + ras_adata.var["common_gprs"] = reactions_common.loc[ras_df.index] + ras_adata.var["common_gprs"] = ras_adata.var["common_gprs"].apply(lambda x: ",".join(x)) + for el in self.count_adata.obs.columns: + ras_adata.obs["countmatrix_"+el]=self.count_adata.obs[el] + + if add_met_metadata: + if self.model is not None and len(self.model.compartments)>0: + ras_adata.var['compartments']=[list(self.model.reactions.get_by_id(reaction).compartments) for reaction in ras_adata.var.index] + ras_adata.var['compartments']=ras_adata.var["compartments"].apply(lambda x: ",".join(x)) + + if self.model is not None: + ras_adata.var['GPR rule'] = [self.model.reactions.get_by_id(reaction).gene_reaction_rule for reaction in ras_adata.var.index] + + if add_essential_reactions: + if self.model is not None: + essential_reactions=find_essential_reactions(self.model) + essential_reactions=[el.id for el in essential_reactions] + ras_adata.var['essential reactions']=["yes" if el in essential_reactions else "no" for el in ras_adata.var.index] + + if add_essential_genes: + if self.model is not None: + essential_genes=find_essential_genes(self.model) + essential_genes=[el.id for el in essential_genes] + ras_adata.var['essential genes']=[" ".join([gene for gene in genes.split() if gene in essential_genes]) for genes in ras_adata.var["GPR rule"]] + + return ras_adata + + def _evaluate_ast(self, node, values, or_function, and_function, ignore_nan): + """ + Evaluate a boolean expression using AST (Abstract Syntax Tree). + Handles all GPR types: single gene, simple (A and B), nested (A or (B and C)). + + Args: + node: AST node to evaluate + values: Dictionary mapping gene names to their expression values + or_function: Function to apply for OR operations + and_function: Function to apply for AND operations + ignore_nan: If True, ignore None/NaN values (e.g., A or None -> A) + + Returns: + Evaluated expression result (float or np.nan) + """ + if isinstance(node, ast.BoolOp): + # Boolean operation (and/or) + vals = [self._evaluate_ast(v, values, or_function, and_function, ignore_nan) for v in node.values] + + if ignore_nan: + # Filter out None/NaN values + vals = [v for v in vals if v is not None and not (isinstance(v, float) and np.isnan(v))] + + if not vals: + return np.nan + + if isinstance(node.op, ast.Or): + return or_function(vals) + elif isinstance(node.op, ast.And): + return and_function(vals) + + elif isinstance(node, ast.Name): + # Variable (gene name) + return values.get(node.id, None) + elif isinstance(node, ast.Constant): + # Constant (shouldn't happen in GPRs, but handle it) + return values.get(str(node.value), None) + else: + raise ValueError(f"Unexpected node type in GPR: {ast.dump(node)}") + + +# ============================================================================ +# STANDALONE FUNCTION FOR RAS_GENERATOR COMPATIBILITY +# ============================================================================ + +def computeRAS( + dataset, + gene_rules, + rules_total_string, + or_function=np.sum, + and_function=np.min, + ignore_nan=True +): + """ + Compute RAS from tabular data and GPR rules (ras_generator.py compatible). + + This is a standalone function that wraps the RAS_computation class + to provide the same interface as ras_generator.py. + + Args: + dataset: pandas DataFrame with gene expression (genes × samples) + gene_rules: dict mapping reaction IDs to GPR strings + rules_total_string: list of all gene names in GPRs + or_function: function for OR operations (default: np.sum) + and_function: function for AND operations (default: np.min) + ignore_nan: if True, ignore NaN in GPR evaluation (default: True) + + Returns: + tuple: (ras_df, genes_not_mapped) + - ras_df: DataFrame with RAS values (reactions × samples) + - genes_not_mapped: list of genes in GPRs not found in dataset + """ + # Create RAS computation object in DataFrame mode + ras_obj = RAS_computation( + dataset=dataset, + gene_rules=gene_rules, + rules_total_string=rules_total_string + ) + + # Compute RAS + result = ras_obj.compute( + or_expression=or_function, + and_expression=and_function, + ignore_nan=ignore_nan, + print_progressbar=False, # No progress bar for ras_generator + add_count_metadata=False, # No metadata in DataFrame mode + add_met_metadata=False, + add_essential_reactions=False, + add_essential_genes=False + ) + + # Result is a tuple (ras_df, genes_not_mapped) in DataFrame mode + return result + +def main(args:List[str] = None) -> None: + """ + Initializes everything and sets the program in motion based on the fronted input arguments. + + Returns: + None + """ + # get args from frontend (related xml) + global ARGS + ARGS = process_args(args) + + # read dataset and remove versioning from gene names + dataset = read_dataset(ARGS.input, "dataset") + orig_gene_list=dataset.index.copy() + dataset.index = [str(el.split(".")[0]) for el in dataset.index] + + #load GPR rules + rules = load_custom_rules() + + #create a list of all the gpr + rules_total_string="" + for id,rule in rules.items(): + rules_total_string+=rule.replace("(","").replace(")","") + " " + rules_total_string=list(set(rules_total_string.split(" "))) + + if any(dataset.index.duplicated(keep=False)): + genes_duplicates=orig_gene_list[dataset.index.duplicated(keep=False)] + genes_duplicates_in_model=[elem for elem in genes_duplicates if elem in rules_total_string] + + if len(genes_duplicates_in_model)>0:#metabolic genes have duplicated entries in the dataset + list_str=", ".join(genes_duplicates_in_model) + list_genes=f"ERROR: Duplicate entries in the gene dataset present in one or more GPR. The following metabolic genes are duplicated: "+list_str + raise ValueError(list_genes) + else: + list_str=", ".join(genes_duplicates) + list_genes=f"INFO: Duplicate entries in the gene dataset. The following genes are duplicated in the dataset but not mentioned in the GPRs: "+list_str + utils.logWarning(list_genes,ARGS.out_log) + + #check if nan value must be ignored in the GPR + if ARGS.none: + # #e.g. (A or nan --> A) + ignore_nan = True + else: + #e.g. (A or nan --> nan) + ignore_nan = False + + #compure ras + ras_df,genes_not_mapped=computeRAS(dataset,rules, + rules_total_string, + or_function=np.sum, # type of operation to do in case of an or expression (max, sum, mean) + and_function=np.min, + ignore_nan=ignore_nan) + + #save to csv and replace nan with None + ras_df.replace([np.nan,None],"None").to_csv(ARGS.ras_output, sep = '\t') + + #report genes not present in the data + if len(genes_not_mapped)>0: + genes_not_mapped_str=", ".join(genes_not_mapped) + utils.logWarning( + f"INFO: The following genes are mentioned in the GPR rules but don't appear in the dataset: "+genes_not_mapped_str, + ARGS.out_log) + + print("Execution succeeded") + +############################################################################### +if __name__ == "__main__": main() \ No newline at end of file
--- a/COBRAxy/src/ras_to_bounds.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/ras_to_bounds.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,355 +1,360 @@ -""" -Apply RAS-based scaling to reaction bounds and optionally save updated models. - -Workflow: -- Read one or more RAS matrices (patients/samples x reactions) -- Normalize and merge them, optionally adding class suffixes to sample IDs -- Build a COBRA model from a tabular CSV -- Run FVA to initialize bounds, then scale per-sample based on RAS values -- Save bounds per sample and optionally export updated models in chosen formats -""" -import argparse -import utils.general_utils as utils -from typing import Optional, Dict, Set, List, Tuple, Union -import os -import numpy as np -import pandas as pd -import cobra -from cobra import Model -import sys -from joblib import Parallel, delayed, cpu_count -import utils.model_utils as modelUtils - -################################# process args ############################### -def process_args(args :List[str] = None) -> argparse.Namespace: - """ - Processes command-line arguments. - - Args: - args (list): List of command-line arguments. - - Returns: - Namespace: An object containing parsed arguments. - """ - parser = argparse.ArgumentParser(usage = '%(prog)s [options]', - description = 'process some value\'s') - - - parser.add_argument("-mo", "--model_upload", type = str, - help = "path to input file with custom rules, if provided") - - parser.add_argument('-ol', '--out_log', - help = "Output log") - - parser.add_argument('-td', '--tool_dir', - type = str, - required = True, - help = 'your tool directory') - - parser.add_argument('-ir', '--input_ras', - type=str, - required = False, - help = 'input ras') - - parser.add_argument('-rn', '--name', - type=str, - help = 'ras class names') - - parser.add_argument('-cc', '--cell_class', - type = str, - help = 'output of cell class') - parser.add_argument( - '-idop', '--output_path', - type = str, - default='ras_to_bounds/', - help = 'output path for maps') - - parser.add_argument('-sm', '--save_models', - type=utils.Bool("save_models"), - default=False, - help = 'whether to save models with applied bounds') - - parser.add_argument('-smp', '--save_models_path', - type = str, - default='saved_models/', - help = 'output path for saved models') - - parser.add_argument('-smf', '--save_models_format', - type = str, - default='csv', - help = 'format for saved models (csv, xml, json, mat, yaml, tabular)') - - - ARGS = parser.parse_args(args) - return ARGS - -########################### warning ########################################### -def warning(s :str) -> None: - """ - Log a warning message to an output log file and print it to the console. - - Args: - s (str): The warning message to be logged and printed. - - Returns: - None - """ - if ARGS.out_log: - with open(ARGS.out_log, 'a') as log: - log.write(s + "\n\n") - print(s) - -############################ dataset input #################################### -def read_dataset(data :str, name :str) -> pd.DataFrame: - """ - Read a dataset from a CSV file and return it as a pandas DataFrame. - - Args: - data (str): Path to the CSV file containing the dataset. - name (str): Name of the dataset, used in error messages. - - Returns: - pandas.DataFrame: DataFrame containing the dataset. - - Raises: - pd.errors.EmptyDataError: If the CSV file is empty. - sys.exit: If the CSV file has the wrong format, the execution is aborted. - """ - try: - dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') - except pd.errors.EmptyDataError: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - if len(dataset.columns) < 2: - sys.exit('Execution aborted: wrong format of ' + name + '\n') - return dataset - - -def apply_ras_bounds(bounds, ras_row): - """ - Adjust the bounds of reactions in the model based on RAS values. - - Args: - bounds (pd.DataFrame): Model bounds. - ras_row (pd.Series): A row from a RAS DataFrame containing scaling factors for reaction bounds. - Returns: - new_bounds (pd.DataFrame): integrated bounds. - """ - new_bounds = bounds.copy() - for reaction in ras_row.index: - scaling_factor = ras_row[reaction] - if not np.isnan(scaling_factor): - lower_bound=bounds.loc[reaction, "lower_bound"] - upper_bound=bounds.loc[reaction, "upper_bound"] - valMax=float((upper_bound)*scaling_factor) - valMin=float((lower_bound)*scaling_factor) - if upper_bound!=0 and lower_bound==0: - new_bounds.loc[reaction, "upper_bound"] = valMax - if upper_bound==0 and lower_bound!=0: - new_bounds.loc[reaction, "lower_bound"] = valMin - if upper_bound!=0 and lower_bound!=0: - new_bounds.loc[reaction, "lower_bound"] = valMin - new_bounds.loc[reaction, "upper_bound"] = valMax - return new_bounds - - -def save_model(model, filename, output_folder, file_format='csv'): - """ - Save a COBRA model to file in the specified format. - - Args: - model (cobra.Model): The model to save. - filename (str): Base filename (without extension). - output_folder (str): Output directory. - file_format (str): File format ('xml', 'json', 'mat', 'yaml', 'tabular', 'csv'). - - Returns: - None - """ - if not os.path.exists(output_folder): - os.makedirs(output_folder) - - try: - if file_format == 'tabular' or file_format == 'csv': - # Special handling for tabular format using utils functions - filepath = os.path.join(output_folder, f"{filename}.csv") - - # Use unified function for tabular export - merged = modelUtils.export_model_to_tabular( - model=model, - output_path=filepath, - include_objective=True - ) - - else: - # Standard COBRA formats - filepath = os.path.join(output_folder, f"{filename}.{file_format}") - - if file_format == 'xml': - cobra.io.write_sbml_model(model, filepath) - elif file_format == 'json': - cobra.io.save_json_model(model, filepath) - elif file_format == 'mat': - cobra.io.save_matlab_model(model, filepath) - elif file_format == 'yaml': - cobra.io.save_yaml_model(model, filepath) - else: - raise ValueError(f"Unsupported format: {file_format}") - - print(f"Model saved: {filepath}") - - except Exception as e: - warning(f"Error saving model {filename}: {str(e)}") - -def apply_bounds_to_model(model, bounds): - """ - Apply bounds from a DataFrame to a COBRA model. - - Args: - model (cobra.Model): The metabolic model to modify. - bounds (pd.DataFrame): DataFrame with reaction bounds. - - Returns: - cobra.Model: Modified model with new bounds. - """ - model_copy = model.copy() - for reaction_id in bounds.index: - try: - reaction = model_copy.reactions.get_by_id(reaction_id) - reaction.lower_bound = bounds.loc[reaction_id, "lower_bound"] - reaction.upper_bound = bounds.loc[reaction_id, "upper_bound"] - except KeyError: - # Reaction not found in model, skip - continue - return model_copy - -def process_ras_cell(cellName, ras_row, model, rxns_ids, output_folder, save_models=False, save_models_path='saved_models/', save_models_format='csv'): - """ - Process a single RAS cell, apply bounds, and save the bounds to a CSV file. - - Args: - cellName (str): The name of the RAS cell (used for naming the output file). - ras_row (pd.Series): A row from a RAS DataFrame containing scaling factors for reaction bounds. - model (cobra.Model): The metabolic model to be modified. - rxns_ids (list of str): List of reaction IDs to which the scaling factors will be applied. - output_folder (str): Folder path where the output CSV file will be saved. - save_models (bool): Whether to save models with applied bounds. - save_models_path (str): Path where to save models. - save_models_format (str): Format for saved models. - - Returns: - None - """ - bounds = pd.DataFrame([(rxn.lower_bound, rxn.upper_bound) for rxn in model.reactions], index=rxns_ids, columns=["lower_bound", "upper_bound"]) - new_bounds = apply_ras_bounds(bounds, ras_row) - new_bounds.to_csv(output_folder + cellName + ".csv", sep='\t', index=True) - - # Save model if requested - if save_models: - modified_model = apply_bounds_to_model(model, new_bounds) - save_model(modified_model, cellName, save_models_path, save_models_format) - - return - -def generate_bounds_model(model: cobra.Model, ras=None, output_folder='output/', save_models=False, save_models_path='saved_models/', save_models_format='csv') -> pd.DataFrame: - """ - Generate reaction bounds for a metabolic model based on medium conditions and optional RAS adjustments. - - Args: - model (cobra.Model): The metabolic model for which bounds will be generated. - ras (pd.DataFrame, optional): RAS pandas dataframe. Defaults to None. - output_folder (str, optional): Folder path where output CSV files will be saved. Defaults to 'output/'. - save_models (bool): Whether to save models with applied bounds. - save_models_path (str): Path where to save models. - save_models_format (str): Format for saved models. - - Returns: - pd.DataFrame: DataFrame containing the bounds of reactions in the model. - """ - rxns_ids = [rxn.id for rxn in model.reactions] - - # Perform Flux Variability Analysis (FVA) on this medium - df_FVA = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=0, processes=1).round(8) - - # Set FVA bounds - for reaction in rxns_ids: - model.reactions.get_by_id(reaction).lower_bound = float(df_FVA.loc[reaction, "minimum"]) - model.reactions.get_by_id(reaction).upper_bound = float(df_FVA.loc[reaction, "maximum"]) - - if ras is not None: - Parallel(n_jobs=cpu_count())(delayed(process_ras_cell)( - cellName, ras_row, model, rxns_ids, output_folder, - save_models, save_models_path, save_models_format - ) for cellName, ras_row in ras.iterrows()) - else: - raise ValueError("RAS DataFrame is None. Cannot generate bounds without RAS data.") - return - -############################# main ########################################### -def main(args:List[str] = None) -> None: - """ - Initialize and execute RAS-to-bounds pipeline based on the frontend input arguments. - - Returns: - None - """ - if not os.path.exists('ras_to_bounds'): - os.makedirs('ras_to_bounds') - - global ARGS - ARGS = process_args(args) - - - ras_file_list = ARGS.input_ras.split(",") - ras_file_names = ARGS.name.split(",") - if len(ras_file_names) != len(set(ras_file_names)): - error_message = "Duplicated file names in the uploaded RAS matrices." - warning(error_message) - raise ValueError(error_message) - - ras_class_names = [] - for file in ras_file_names: - ras_class_names.append(file.rsplit(".", 1)[0]) - ras_list = [] - class_assignments = pd.DataFrame(columns=["Patient_ID", "Class"]) - for ras_matrix, ras_class_name in zip(ras_file_list, ras_class_names): - ras = read_dataset(ras_matrix, "ras dataset") - ras.replace("None", None, inplace=True) - ras.set_index("Reactions", drop=True, inplace=True) - ras = ras.T - ras = ras.astype(float) - if(len(ras_file_list)>1): - # Append class name to patient id (DataFrame index) - ras.index = [f"{idx}_{ras_class_name}" for idx in ras.index] - else: - ras.index = [f"{idx}" for idx in ras.index] - ras_list.append(ras) - for patient_id in ras.index: - class_assignments.loc[class_assignments.shape[0]] = [patient_id, ras_class_name] - - - # Concatenate all RAS DataFrames into a single DataFrame - ras_combined = pd.concat(ras_list, axis=0) - # Normalize RAS values column-wise by max RAS - ras_combined = ras_combined.div(ras_combined.max(axis=0)) - ras_combined.dropna(axis=1, how='all', inplace=True) - - model = modelUtils.build_cobra_model_from_csv(ARGS.model_upload) - - validation = modelUtils.validate_model(model) - - print("\n=== MODEL VALIDATION ===") - for key, value in validation.items(): - print(f"{key}: {value}") - - - generate_bounds_model(model, ras=ras_combined, output_folder=ARGS.output_path, - save_models=ARGS.save_models, save_models_path=ARGS.save_models_path, - save_models_format=ARGS.save_models_format) - class_assignments.to_csv(ARGS.cell_class, sep='\t', index=False) - - - return - -############################################################################## -if __name__ == "__main__": +""" +Apply RAS-based scaling to reaction bounds and optionally save updated models. + +Workflow: +- Read one or more RAS matrices (patients/samples x reactions) +- Normalize and merge them, optionally adding class suffixes to sample IDs +- Build a COBRA model from a tabular CSV +- Run FVA to initialize bounds, then scale per-sample based on RAS values +- Save bounds per sample and optionally export updated models in chosen formats +""" +import argparse +from typing import Optional, Dict, Set, List, Tuple, Union +import os +import numpy as np +import pandas as pd +import cobra +from cobra import Model +import sys +from joblib import Parallel, delayed, cpu_count + +try: + from .utils import general_utils as utils + from .utils import model_utils as modelUtils +except: + import utils.general_utils as utils + import utils.model_utils as modelUtils + +################################# process args ############################### +def process_args(args :List[str] = None) -> argparse.Namespace: + """ + Processes command-line arguments. + + Args: + args (list): List of command-line arguments. + + Returns: + Namespace: An object containing parsed arguments. + """ + parser = argparse.ArgumentParser(usage = '%(prog)s [options]', + description = 'process some value\'s') + + + parser.add_argument("-mo", "--model_upload", type = str, + help = "path to input file with custom rules, if provided") + + parser.add_argument('-ol', '--out_log', + help = "Output log") + + parser.add_argument('-td', '--tool_dir', + type = str, + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') + + parser.add_argument('-ir', '--input_ras', + type=str, + required = False, + help = 'input ras') + + parser.add_argument('-rn', '--name', + type=str, + help = 'ras class names') + + parser.add_argument('-cc', '--cell_class', + type = str, + help = 'output of cell class') + parser.add_argument( + '-idop', '--output_path', + type = str, + default='ras_to_bounds/', + help = 'output path for maps') + + parser.add_argument('-sm', '--save_models', + type=utils.Bool("save_models"), + default=False, + help = 'whether to save models with applied bounds') + + parser.add_argument('-smp', '--save_models_path', + type = str, + default='saved_models/', + help = 'output path for saved models') + + parser.add_argument('-smf', '--save_models_format', + type = str, + default='csv', + help = 'format for saved models (csv, xml, json, mat, yaml, tabular)') + + + ARGS = parser.parse_args(args) + return ARGS + +########################### warning ########################################### +def warning(s :str) -> None: + """ + Log a warning message to an output log file and print it to the console. + + Args: + s (str): The warning message to be logged and printed. + + Returns: + None + """ + if ARGS.out_log: + with open(ARGS.out_log, 'a') as log: + log.write(s + "\n\n") + print(s) + +############################ dataset input #################################### +def read_dataset(data :str, name :str) -> pd.DataFrame: + """ + Read a dataset from a CSV file and return it as a pandas DataFrame. + + Args: + data (str): Path to the CSV file containing the dataset. + name (str): Name of the dataset, used in error messages. + + Returns: + pandas.DataFrame: DataFrame containing the dataset. + + Raises: + pd.errors.EmptyDataError: If the CSV file is empty. + sys.exit: If the CSV file has the wrong format, the execution is aborted. + """ + try: + dataset = pd.read_csv(data, sep = '\t', header = 0, engine='python') + except pd.errors.EmptyDataError: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + if len(dataset.columns) < 2: + sys.exit('Execution aborted: wrong format of ' + name + '\n') + return dataset + + +def apply_ras_bounds(bounds, ras_row): + """ + Adjust the bounds of reactions in the model based on RAS values. + + Args: + bounds (pd.DataFrame): Model bounds. + ras_row (pd.Series): A row from a RAS DataFrame containing scaling factors for reaction bounds. + Returns: + new_bounds (pd.DataFrame): integrated bounds. + """ + new_bounds = bounds.copy() + for reaction in ras_row.index: + scaling_factor = ras_row[reaction] + if not np.isnan(scaling_factor): + lower_bound=bounds.loc[reaction, "lower_bound"] + upper_bound=bounds.loc[reaction, "upper_bound"] + valMax=float((upper_bound)*scaling_factor) + valMin=float((lower_bound)*scaling_factor) + if upper_bound!=0 and lower_bound==0: + new_bounds.loc[reaction, "upper_bound"] = valMax + if upper_bound==0 and lower_bound!=0: + new_bounds.loc[reaction, "lower_bound"] = valMin + if upper_bound!=0 and lower_bound!=0: + new_bounds.loc[reaction, "lower_bound"] = valMin + new_bounds.loc[reaction, "upper_bound"] = valMax + return new_bounds + + +def save_model(model, filename, output_folder, file_format='csv'): + """ + Save a COBRA model to file in the specified format. + + Args: + model (cobra.Model): The model to save. + filename (str): Base filename (without extension). + output_folder (str): Output directory. + file_format (str): File format ('xml', 'json', 'mat', 'yaml', 'tabular', 'csv'). + + Returns: + None + """ + if not os.path.exists(output_folder): + os.makedirs(output_folder) + + try: + if file_format == 'tabular' or file_format == 'csv': + # Special handling for tabular format using utils functions + filepath = os.path.join(output_folder, f"{filename}.csv") + + # Use unified function for tabular export + merged = modelUtils.export_model_to_tabular( + model=model, + output_path=filepath, + include_objective=True + ) + + else: + # Standard COBRA formats + filepath = os.path.join(output_folder, f"{filename}.{file_format}") + + if file_format == 'xml': + cobra.io.write_sbml_model(model, filepath) + elif file_format == 'json': + cobra.io.save_json_model(model, filepath) + elif file_format == 'mat': + cobra.io.save_matlab_model(model, filepath) + elif file_format == 'yaml': + cobra.io.save_yaml_model(model, filepath) + else: + raise ValueError(f"Unsupported format: {file_format}") + + print(f"Model saved: {filepath}") + + except Exception as e: + warning(f"Error saving model {filename}: {str(e)}") + +def apply_bounds_to_model(model, bounds): + """ + Apply bounds from a DataFrame to a COBRA model. + + Args: + model (cobra.Model): The metabolic model to modify. + bounds (pd.DataFrame): DataFrame with reaction bounds. + + Returns: + cobra.Model: Modified model with new bounds. + """ + model_copy = model.copy() + for reaction_id in bounds.index: + try: + reaction = model_copy.reactions.get_by_id(reaction_id) + reaction.lower_bound = bounds.loc[reaction_id, "lower_bound"] + reaction.upper_bound = bounds.loc[reaction_id, "upper_bound"] + except KeyError: + # Reaction not found in model, skip + continue + return model_copy + +def process_ras_cell(cellName, ras_row, model, rxns_ids, output_folder, save_models=False, save_models_path='saved_models/', save_models_format='csv'): + """ + Process a single RAS cell, apply bounds, and save the bounds to a CSV file. + + Args: + cellName (str): The name of the RAS cell (used for naming the output file). + ras_row (pd.Series): A row from a RAS DataFrame containing scaling factors for reaction bounds. + model (cobra.Model): The metabolic model to be modified. + rxns_ids (list of str): List of reaction IDs to which the scaling factors will be applied. + output_folder (str): Folder path where the output CSV file will be saved. + save_models (bool): Whether to save models with applied bounds. + save_models_path (str): Path where to save models. + save_models_format (str): Format for saved models. + + Returns: + None + """ + bounds = pd.DataFrame([(rxn.lower_bound, rxn.upper_bound) for rxn in model.reactions], index=rxns_ids, columns=["lower_bound", "upper_bound"]) + new_bounds = apply_ras_bounds(bounds, ras_row) + new_bounds.to_csv(output_folder + cellName + ".csv", sep='\t', index=True) + + # Save model if requested + if save_models: + modified_model = apply_bounds_to_model(model, new_bounds) + save_model(modified_model, cellName, save_models_path, save_models_format) + + return + +def generate_bounds_model(model: cobra.Model, ras=None, output_folder='output/', save_models=False, save_models_path='saved_models/', save_models_format='csv') -> pd.DataFrame: + """ + Generate reaction bounds for a metabolic model based on medium conditions and optional RAS adjustments. + + Args: + model (cobra.Model): The metabolic model for which bounds will be generated. + ras (pd.DataFrame, optional): RAS pandas dataframe. Defaults to None. + output_folder (str, optional): Folder path where output CSV files will be saved. Defaults to 'output/'. + save_models (bool): Whether to save models with applied bounds. + save_models_path (str): Path where to save models. + save_models_format (str): Format for saved models. + + Returns: + pd.DataFrame: DataFrame containing the bounds of reactions in the model. + """ + rxns_ids = [rxn.id for rxn in model.reactions] + + # Perform Flux Variability Analysis (FVA) on this medium + df_FVA = cobra.flux_analysis.flux_variability_analysis(model, fraction_of_optimum=0, processes=1).round(8) + + # Set FVA bounds + for reaction in rxns_ids: + model.reactions.get_by_id(reaction).lower_bound = float(df_FVA.loc[reaction, "minimum"]) + model.reactions.get_by_id(reaction).upper_bound = float(df_FVA.loc[reaction, "maximum"]) + + if ras is not None: + Parallel(n_jobs=cpu_count())(delayed(process_ras_cell)( + cellName, ras_row, model, rxns_ids, output_folder, + save_models, save_models_path, save_models_format + ) for cellName, ras_row in ras.iterrows()) + else: + raise ValueError("RAS DataFrame is None. Cannot generate bounds without RAS data.") + return + +############################# main ########################################### +def main(args:List[str] = None) -> None: + """ + Initialize and execute RAS-to-bounds pipeline based on the frontend input arguments. + + Returns: + None + """ + if not os.path.exists('ras_to_bounds'): + os.makedirs('ras_to_bounds') + + global ARGS + ARGS = process_args(args) + + + ras_file_list = ARGS.input_ras.split(",") + ras_file_names = ARGS.name.split(",") + if len(ras_file_names) != len(set(ras_file_names)): + error_message = "Duplicated file names in the uploaded RAS matrices." + warning(error_message) + raise ValueError(error_message) + + ras_class_names = [] + for file in ras_file_names: + ras_class_names.append(file.rsplit(".", 1)[0]) + ras_list = [] + class_assignments = pd.DataFrame(columns=["Patient_ID", "Class"]) + for ras_matrix, ras_class_name in zip(ras_file_list, ras_class_names): + ras = read_dataset(ras_matrix, "ras dataset") + ras.replace("None", None, inplace=True) + ras.set_index("Reactions", drop=True, inplace=True) + ras = ras.T + ras = ras.astype(float) + if(len(ras_file_list)>1): + # Append class name to patient id (DataFrame index) + ras.index = [f"{idx}_{ras_class_name}" for idx in ras.index] + else: + ras.index = [f"{idx}" for idx in ras.index] + ras_list.append(ras) + for patient_id in ras.index: + class_assignments.loc[class_assignments.shape[0]] = [patient_id, ras_class_name] + + + # Concatenate all RAS DataFrames into a single DataFrame + ras_combined = pd.concat(ras_list, axis=0) + # Normalize RAS values column-wise by max RAS + ras_combined = ras_combined.div(ras_combined.max(axis=0)) + ras_combined.dropna(axis=1, how='all', inplace=True) + + model = modelUtils.build_cobra_model_from_csv(ARGS.model_upload) + + validation = modelUtils.validate_model(model) + + print("\n=== MODEL VALIDATION ===") + for key, value in validation.items(): + print(f"{key}: {value}") + + + generate_bounds_model(model, ras=ras_combined, output_folder=ARGS.output_path, + save_models=ARGS.save_models, save_models_path=ARGS.save_models_path, + save_models_format=ARGS.save_models_format) + class_assignments.to_csv(ARGS.cell_class, sep='\t', index=False) + + + return + +############################################################################## +if __name__ == "__main__": main() \ No newline at end of file
--- a/COBRAxy/src/rps_generator.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/rps_generator.py Sun Oct 26 19:27:41 2025 +0000 @@ -14,8 +14,12 @@ from typing import Optional, List, Dict -import utils.general_utils as utils -import utils.reaction_parsing as reactionUtils +try: + from .utils import general_utils as utils + from .utils import reaction_parsing as reactionUtils +except: + import utils.general_utils as utils + import utils.reaction_parsing as reactionUtils ########################## argparse ########################################## ARGS :argparse.Namespace @@ -38,9 +42,10 @@ help = "path to input file containing the reactions") parser.add_argument('-td', '--tool_dir', - type = str, - required = True, - help = 'your tool directory') + type = str, + default = os.path.dirname(os.path.abspath(__file__)), + help = 'your tool directory (default: auto-detected package location)') + parser.add_argument('-ol', '--out_log', help = "Output log") parser.add_argument('-id', '--input',
--- a/COBRAxy/src/setup.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/setup.py Sun Oct 26 19:27:41 2025 +0000 @@ -1,59 +1,66 @@ from setuptools import setup, find_packages +import os + +# Get the path to README.md in the parent directory +readme_path = os.path.join(os.path.dirname(__file__), '..', 'README.md') setup( name='cobraxy', version='0.1.0', description='A collection of tools for metabolic flux analysis in Galaxy.', - long_description=open('README.md',encoding="utf-8").read(), + long_description=open(readme_path, encoding="utf-8").read(), long_description_content_type='text/markdown', - author='', - author_email='', + author='Francesco Lapi', + author_email='f.lapi@campus.unimib.it', url='https://github.com/CompBtBs/COBRAxy.git', license='', - packages=find_packages(include=["utils", "utils.*"]), - py_modules=[ - 'ras_generator', - 'rps_generator', - 'marea_cluster', - 'marea', - 'metabolic_model_setting', - 'ras_to_bounds', - 'flux_simulation', - 'flux_to_map' - ], + package_dir={'cobraxy': '.'}, # Mappa il package 'cobraxy' alla directory corrente + packages=['cobraxy', 'cobraxy.utils', 'cobraxy.local'], # Solo packages sotto cobraxy + package_data={ + 'cobraxy': ['*.py'], # Include i moduli Python principali + 'cobraxy.local': ['**/*'], # Include all files in local directory + 'cobraxy.utils': ['**/*'], # Include all files in utils directory + }, include_package_data=True, install_requires=[ - 'cairosvg==2.7.1', - 'cobra==0.29.0', - 'joblib==1.4.2', - 'lxml==5.2.2', - 'matplotlib==3.7.3', - 'numpy==1.24.4', - 'pandas==2.0.3', - 'pyvips==2.2.3', - 'scikit-learn==1.3.2', - 'scipy==1.11', - 'seaborn==0.13.0', - 'svglib==1.5.1', - 'anndata==0.8.0', - 'pydeseq2==0.5.1' + 'cairosvg>=2.7.0', + 'cobra>=0.29.0', + 'joblib>=1.3.0', + 'lxml>=5.0.0', + 'matplotlib>=3.7.0', + 'numpy>=1.24.0', + 'pandas>=2.0.0', + 'pyvips>=2.2.0', + 'scikit-learn>=1.3.0', + 'scipy>=1.11.0', + 'seaborn>=0.13.0', + 'svglib>=1.5.0', + 'anndata>=0.8.0', + 'pydeseq2>=0.4.0' ], entry_points={ 'console_scripts': [ - 'metabolic_model_setting=metabolic_model_setting:main', - 'ras_generator=ras_generator:main', - 'rps_generator=rps_generator:main', - 'marea_cluster=marea_cluster:main', - 'marea=marea:main', - 'ras_to_bounds=ras_to_bounds:main', - 'flux_simulation=flux_simulation:main', - 'flux_to_map=flux_to_map:main' + 'importMetabolicModel=cobraxy.importMetabolicModel:main', + 'exportMetabolicModel=cobraxy.exportMetabolicModel:main', + 'ras_generator=cobraxy.ras_generator:main', + 'rps_generator=cobraxy.rps_generator:main', + 'marea_cluster=cobraxy.marea_cluster:main', + 'marea=cobraxy.marea:main', + 'ras_to_bounds=cobraxy.ras_to_bounds:main', + 'flux_simulation=cobraxy.flux_simulation:main', + 'flux_to_map=cobraxy.flux_to_map:main' ], }, classifiers=[ 'Programming Language :: Python :: 3', + 'Programming Language :: Python :: 3.8', + 'Programming Language :: Python :: 3.9', + 'Programming Language :: Python :: 3.10', + 'Programming Language :: Python :: 3.11', + 'Programming Language :: Python :: 3.12', + 'Programming Language :: Python :: 3.13', 'License :: OSI Approved :: MIT License', 'Operating System :: OS Independent', ], - python_requires='>=3.8.20,<3.12', + python_requires='>=3.8,<3.14', )
--- a/COBRAxy/src/shed.yml Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/shed.yml Sun Oct 26 19:27:41 2025 +0000 @@ -9,7 +9,9 @@ tools: - tool_panel_section_label: "COBRA Toolbox" - file: ./metabolic_model_setting.xml + file: ./importMetabolicModel.xml + - tool_panel_section_label: "COBRA Toolbox" + file: ./exportMetabolicModel.xml - tool_panel_section_label: "COBRA Toolbox" file: ./ras_generator.xml - tool_panel_section_label: "COBRA Toolbox"
--- a/COBRAxy/src/test/README.md Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,198 +0,0 @@ -# COBRAxy Test Suite - -This directory contains comprehensive unit tests for all COBRAxy modules. - -## Test Files - -- **test_utils.py** - Tests for utility modules (general_utils, rule_parsing, reaction_parsing, model_utils, CBS_backend) -- **test_generators.py** - Tests for RAS and RPS generator modules -- **test_marea.py** - Tests for MAREA, flux_simulation, flux_to_map, and visualization modules -- **test_clustering.py** - Tests for marea_cluster and clustering algorithms -- **testing.py** - Original testing framework (legacy) - -## Running Tests - -### Option 1: Using pytest (Recommended) - -Install pytest if not already installed: -```bash -pip install pytest pytest-cov -``` - -Run all tests: -```bash -cd /hdd/home/flapi/COBRAxy/src/test -pytest -v -``` - -Run specific test file: -```bash -pytest test_utils.py -v -pytest test_generators.py -v -pytest test_marea.py -v -pytest test_clustering.py -v -``` - -Run tests with coverage: -```bash -pytest --cov=../ --cov-report=html -``` - -### Option 2: Run individual test files - -Each test file can be run standalone: -```bash -python test_utils.py -python test_generators.py -python test_marea.py -python test_clustering.py -``` - -### Option 3: Run all tests with the run script - -```bash -python run_all_tests.py -``` - -## Test Structure - -Each test file is organized into classes that group related tests: - -```python -class TestModuleName: - """Tests for module_name""" - - def test_specific_feature(self): - """Test description""" - # Test code - assert result == expected -``` - -## Adding New Tests - -To add new tests: - -1. Choose the appropriate test file or create a new one -2. Create a test class if needed -3. Add test methods (must start with `test_`) -4. Use assertions to verify behavior - -Example: -```python -class TestMyFeature: - def test_my_new_function(self): - """Test my new function""" - result = my_function(input_data) - assert result == expected_output -``` - -## Test Coverage - -Current test coverage includes: - -### Utils Module -- ✓ Bool validator -- ✓ CustomErr class -- ✓ FilePath creation and validation -- ✓ Model enum -- ✓ Rule parsing -- ✓ Reaction parsing -- ✓ Model utilities - -### Generators -- ✓ RAS calculation with AND/OR rules -- ✓ RPS calculation with metabolite abundances -- ✓ Missing metabolite handling -- ✓ Blacklist functionality -- ✓ Complex nested rules - -### MAREA and Visualization -- ✓ Argument processing -- ✓ Data format validation -- ✓ Statistical operations (fold change, p-values) -- ✓ SVG map visualization -- ✓ Model conversion tools - -### Clustering -- ✓ K-means clustering -- ✓ DBSCAN clustering -- ✓ Hierarchical clustering -- ✓ Data scaling/normalization -- ✓ Cluster evaluation metrics -- ✓ Visualization preparation - -## Dependencies - -Required packages for running tests: -- pytest (optional but recommended) -- pandas -- numpy -- scipy -- scikit-learn -- cobra -- lxml - -All dependencies are listed in the main setup.py file. - -## Continuous Integration - -Tests can be integrated into CI/CD pipelines: - -```yaml -# Example GitHub Actions workflow -name: Tests -on: [push, pull_request] -jobs: - test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v2 - - name: Set up Python - uses: actions/setup-python@v2 - with: - python-version: '3.10' - - name: Install dependencies - run: | - pip install -e . - pip install pytest pytest-cov - - name: Run tests - run: | - cd src/test - pytest -v --cov -``` - -## Troubleshooting - -### Import Errors -If you get import errors, make sure you're running from the test directory: -```bash -cd /hdd/home/flapi/COBRAxy/src/test -python test_utils.py -``` - -### Missing Files -If tests fail due to missing pickle files or models, verify that the `local/` directory structure is intact: -``` -src/ -├── local/ -│ ├── pickle files/ -│ ├── svg metabolic maps/ -│ ├── models/ -│ ├── mappings/ -│ └── medium/ -``` - -### Path Issues -Tests automatically set up paths relative to the test directory. If you encounter path issues, check that `TOOL_DIR` is set correctly in the test file. - -## Contributing - -When adding new features to COBRAxy: -1. Write tests for the new functionality -2. Ensure all existing tests still pass -3. Aim for >80% code coverage -4. Document any new test files in this README - -## License - -Same license as COBRAxy main project.
--- a/COBRAxy/src/test/run_all_tests.py Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,160 +0,0 @@ -#!/usr/bin/env python3 -""" -Run all COBRAxy tests. - -This script runs all test files and provides a summary of results. -Can be run with or without pytest. -""" - -import sys -import os -import subprocess -from pathlib import Path - -# Colors for terminal output -class Colors: - HEADER = '\033[95m' - OKBLUE = '\033[94m' - OKCYAN = '\033[96m' - OKGREEN = '\033[92m' - WARNING = '\033[93m' - FAIL = '\033[91m' - ENDC = '\033[0m' - BOLD = '\033[1m' - UNDERLINE = '\033[4m' - - -def print_header(text): - """Print a formatted header""" - print(f"\n{Colors.HEADER}{Colors.BOLD}{'='*70}{Colors.ENDC}") - print(f"{Colors.HEADER}{Colors.BOLD}{text:^70}{Colors.ENDC}") - print(f"{Colors.HEADER}{Colors.BOLD}{'='*70}{Colors.ENDC}\n") - - -def run_with_pytest(): - """Run tests using pytest""" - print_header("Running Tests with pytest") - - test_dir = Path(__file__).parent - - # Run pytest with coverage - cmd = [ - sys.executable, "-m", "pytest", - str(test_dir), - "-v", - "--tb=short", - f"--cov={test_dir.parent}", - "--cov-report=term-missing", - "--cov-report=html" - ] - - print(f"{Colors.OKCYAN}Command: {' '.join(cmd)}{Colors.ENDC}\n") - - result = subprocess.run(cmd) - - if result.returncode == 0: - print(f"\n{Colors.OKGREEN}{Colors.BOLD}✓ All tests passed!{Colors.ENDC}") - print(f"{Colors.OKGREEN}Coverage report generated in htmlcov/index.html{Colors.ENDC}") - else: - print(f"\n{Colors.FAIL}{Colors.BOLD}✗ Some tests failed!{Colors.ENDC}") - - return result.returncode - - -def run_without_pytest(): - """Run tests without pytest""" - print_header("Running Tests (without pytest)") - print(f"{Colors.WARNING}Note: pytest not found. Running basic test execution.{Colors.ENDC}\n") - - test_dir = Path(__file__).parent - test_files = [ - "test_utils.py", - "test_generators.py", - "test_marea.py", - "test_clustering.py" - ] - - total_passed = 0 - total_failed = 0 - failed_files = [] - - for test_file in test_files: - test_path = test_dir / test_file - - if not test_path.exists(): - print(f"{Colors.WARNING}⊘ {test_file}: Not found{Colors.ENDC}") - continue - - print(f"\n{Colors.OKBLUE}{Colors.BOLD}Running {test_file}...{Colors.ENDC}") - print(f"{Colors.OKBLUE}{'─'*70}{Colors.ENDC}") - - result = subprocess.run( - [sys.executable, str(test_path)], - capture_output=False - ) - - if result.returncode == 0: - print(f"{Colors.OKGREEN}✓ {test_file} passed{Colors.ENDC}") - else: - print(f"{Colors.FAIL}✗ {test_file} failed{Colors.ENDC}") - failed_files.append(test_file) - - # Print summary - print_header("Test Summary") - - if not failed_files: - print(f"{Colors.OKGREEN}{Colors.BOLD}✓ All test files passed!{Colors.ENDC}") - return 0 - else: - print(f"{Colors.FAIL}{Colors.BOLD}✗ Failed test files:{Colors.ENDC}") - for file in failed_files: - print(f"{Colors.FAIL} - {file}{Colors.ENDC}") - return 1 - - -def check_dependencies(): - """Check if required dependencies are installed""" - print_header("Checking Dependencies") - - required = [ - "pandas", "numpy", "scipy", "sklearn", - "cobra", "lxml", "matplotlib", "seaborn" - ] - - missing = [] - - for package in required: - try: - __import__(package) - print(f"{Colors.OKGREEN}✓ {package:20} installed{Colors.ENDC}") - except ImportError: - print(f"{Colors.FAIL}✗ {package:20} missing{Colors.ENDC}") - missing.append(package) - - if missing: - print(f"\n{Colors.WARNING}Missing packages: {', '.join(missing)}{Colors.ENDC}") - print(f"{Colors.WARNING}Install with: pip install {' '.join(missing)}{Colors.ENDC}") - return False - - return True - - -def main(): - """Main entry point""" - print_header("COBRAxy Test Suite") - - # Check dependencies - if not check_dependencies(): - print(f"\n{Colors.FAIL}Please install missing dependencies first.{Colors.ENDC}") - return 1 - - # Try to use pytest if available - try: - import pytest - return run_with_pytest() - except ImportError: - return run_without_pytest() - - -if __name__ == "__main__": - sys.exit(main())
--- a/COBRAxy/src/test/test_clustering.py Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,390 +0,0 @@ -""" -Unit tests for marea_cluster module. - -Run with: python -m pytest test_clustering.py -v -Or: python test_clustering.py -""" - -import sys -import os -import pandas as pd -import numpy as np -import tempfile -from pathlib import Path - -# Try to import pytest, but don't fail if not available -try: - import pytest - HAS_PYTEST = True -except ImportError: - HAS_PYTEST = False - class _DummyPytest: - class raises: - def __init__(self, *args, **kwargs): - self.expected_exceptions = args - def __enter__(self): - return self - def __exit__(self, exc_type, exc_val, exc_tb): - if exc_type is None: - raise AssertionError("Expected an exception but none was raised") - if not any(issubclass(exc_type, e) for e in self.expected_exceptions): - return False - return True - pytest = _DummyPytest() - -# Add parent directory to path -sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) - -import marea_cluster - -# Get the tool directory -TOOL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..')) - - -class TestMAREACluster: - """Tests for marea_cluster module""" - - def test_process_args(self): - """Test argument processing""" - args = marea_cluster.process_args([ - '-td', TOOL_DIR, - '-cy', 'kmeans' - ]) - assert hasattr(args, 'tool_dir') - assert hasattr(args, 'cluster_type') - - def test_clustering_types(self): - """Test that all clustering types are available""" - args = marea_cluster.process_args(['-cy', 'kmeans']) - assert args.cluster_type == 'kmeans' - - args = marea_cluster.process_args(['-cy', 'dbscan']) - assert args.cluster_type == 'dbscan' - - args = marea_cluster.process_args(['-cy', 'hierarchy']) - assert args.cluster_type == 'hierarchy' - - -class TestKMeansClustering: - """Tests for K-means clustering""" - - def test_kmeans_basic(self): - """Test basic K-means clustering""" - from sklearn.cluster import KMeans - - # Create sample data - data = np.array([ - [1.0, 2.0], - [1.5, 1.8], - [5.0, 8.0], - [8.0, 8.0], - [1.0, 0.6], - [9.0, 11.0] - ]) - - # Perform clustering - kmeans = KMeans(n_clusters=2, random_state=42) - labels = kmeans.fit_predict(data) - - assert len(labels) == len(data) - assert len(set(labels)) == 2 # Should have 2 clusters - - def test_kmeans_with_dataframe(self): - """Test K-means clustering with DataFrame""" - from sklearn.cluster import KMeans - - # Create sample DataFrame - df = pd.DataFrame({ - 'feature1': [1.0, 1.5, 5.0, 8.0, 1.0, 9.0], - 'feature2': [2.0, 1.8, 8.0, 8.0, 0.6, 11.0] - }) - - kmeans = KMeans(n_clusters=2, random_state=42) - df['cluster'] = kmeans.fit_predict(df[['feature1', 'feature2']]) - - assert 'cluster' in df.columns - assert len(df['cluster'].unique()) == 2 - - -class TestDBSCANClustering: - """Tests for DBSCAN clustering""" - - def test_dbscan_basic(self): - """Test basic DBSCAN clustering""" - from sklearn.cluster import DBSCAN - - # Create sample data with clear clusters - data = np.array([ - [1.0, 2.0], - [1.5, 1.8], - [1.2, 2.1], - [8.0, 8.0], - [8.5, 8.3], - [8.2, 8.1] - ]) - - # Perform clustering - dbscan = DBSCAN(eps=1.0, min_samples=2) - labels = dbscan.fit_predict(data) - - assert len(labels) == len(data) - # DBSCAN should find at least 2 clusters (excluding noise as -1) - unique_labels = set(labels) - unique_labels.discard(-1) # Remove noise label - assert len(unique_labels) >= 1 - - -class TestHierarchicalClustering: - """Tests for Hierarchical clustering""" - - def test_hierarchical_basic(self): - """Test basic hierarchical clustering""" - from sklearn.cluster import AgglomerativeClustering - - # Create sample data - data = np.array([ - [1.0, 2.0], - [1.5, 1.8], - [5.0, 8.0], - [8.0, 8.0], - [1.0, 0.6], - [9.0, 11.0] - ]) - - # Perform clustering - hierarchical = AgglomerativeClustering(n_clusters=2) - labels = hierarchical.fit_predict(data) - - assert len(labels) == len(data) - assert len(set(labels)) == 2 - - -class TestScaling: - """Tests for data scaling/normalization""" - - def test_standard_scaling(self): - """Test standard scaling""" - from sklearn.preprocessing import StandardScaler - - data = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) - - scaler = StandardScaler() - scaled_data = scaler.fit_transform(data) - - # Check that mean is close to 0 - assert np.abs(scaled_data.mean(axis=0)).max() < 1e-10 - - # Check that std is close to 1 - assert np.abs(scaled_data.std(axis=0) - 1.0).max() < 1e-10 - - def test_minmax_scaling(self): - """Test min-max scaling""" - from sklearn.preprocessing import MinMaxScaler - - data = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) - - scaler = MinMaxScaler() - scaled_data = scaler.fit_transform(data) - - # Check that min is 0 and max is 1 - assert scaled_data.min() == 0.0 - assert scaled_data.max() == 1.0 - - -class TestClusterEvaluation: - """Tests for cluster evaluation metrics""" - - def test_silhouette_score(self): - """Test silhouette score calculation""" - from sklearn.cluster import KMeans - from sklearn.metrics import silhouette_score - - # Create well-separated data - data = np.array([ - [1.0, 1.0], - [1.5, 1.5], - [1.2, 1.3], - [10.0, 10.0], - [10.5, 10.5], - [10.2, 10.3] - ]) - - kmeans = KMeans(n_clusters=2, random_state=42) - labels = kmeans.fit_predict(data) - - score = silhouette_score(data, labels) - - # Well-separated clusters should have high silhouette score - assert score > 0.5 - assert score <= 1.0 - - -class TestDataPreparation: - """Tests for data preparation for clustering""" - - def test_remove_constant_features(self): - """Test removal of constant features""" - df = pd.DataFrame({ - 'var_feature': [1, 2, 3, 4, 5], - 'const_feature': [1, 1, 1, 1, 1], - 'var_feature2': [5, 4, 3, 2, 1] - }) - - # Remove constant columns - df_filtered = df.loc[:, df.std() > 0] - - assert 'const_feature' not in df_filtered.columns - assert 'var_feature' in df_filtered.columns - assert 'var_feature2' in df_filtered.columns - - def test_handle_missing_values(self): - """Test handling of missing values""" - df = pd.DataFrame({ - 'feature1': [1.0, 2.0, np.nan, 4.0], - 'feature2': [5.0, np.nan, 7.0, 8.0] - }) - - # Drop rows with NaN - df_clean = df.dropna() - assert len(df_clean) == 2 - - # Fill NaN with mean - df_filled = df.fillna(df.mean()) - assert not df_filled.isnull().any().any() - - -class TestVisualization: - """Tests for clustering visualization""" - - def test_dendrogram_data(self): - """Test that we can create dendrogram data""" - from scipy.cluster.hierarchy import linkage - - data = np.array([ - [1.0, 2.0], - [1.5, 1.8], - [5.0, 8.0], - [8.0, 8.0] - ]) - - # Create linkage matrix - Z = linkage(data, method='ward') - - assert Z.shape[0] == len(data) - 1 - assert Z.shape[1] == 4 - - def test_elbow_method_data(self): - """Test data preparation for elbow method""" - from sklearn.cluster import KMeans - - data = np.array([ - [1.0, 2.0], - [1.5, 1.8], - [5.0, 8.0], - [8.0, 8.0], - [1.0, 0.6], - [9.0, 11.0] - ]) - - inertias = [] - K_range = range(1, 4) - - for k in K_range: - kmeans = KMeans(n_clusters=k, random_state=42) - kmeans.fit(data) - inertias.append(kmeans.inertia_) - - # Inertia should decrease as K increases - assert inertias[0] > inertias[1] > inertias[2] - - -class TestRealWorldScenarios: - """Tests with realistic metabolic data scenarios""" - - def test_cluster_reactions_by_flux(self): - """Test clustering reactions based on flux patterns""" - # Create sample flux data for different conditions - df = pd.DataFrame({ - 'reaction': ['r1', 'r2', 'r3', 'r4', 'r5'], - 'condition1': [1.5, 1.6, 0.1, 0.2, 2.0], - 'condition2': [1.4, 1.7, 0.15, 0.18, 2.1], - 'condition3': [1.6, 1.5, 0.12, 0.22, 1.9] - }) - - # Extract numeric features - features = df[['condition1', 'condition2', 'condition3']] - - # Cluster - from sklearn.cluster import KMeans - kmeans = KMeans(n_clusters=2, random_state=42) - df['cluster'] = kmeans.fit_predict(features) - - # r1, r2, r5 should be in one cluster (high flux) - # r3, r4 should be in another cluster (low flux) - high_flux_reactions = df[df['condition1'] > 1.0]['reaction'].tolist() - low_flux_reactions = df[df['condition1'] < 1.0]['reaction'].tolist() - - assert len(high_flux_reactions) == 3 - assert len(low_flux_reactions) == 2 - - def test_cluster_samples_by_metabolic_profile(self): - """Test clustering samples based on metabolic profiles""" - # Create sample data: samples x reactions - df = pd.DataFrame({ - 'sample': ['normal1', 'normal2', 'cancer1', 'cancer2'], - 'r1': [1.5, 1.6, 3.0, 3.1], - 'r2': [0.5, 0.6, 0.1, 0.15], - 'r3': [2.0, 2.1, 4.0, 4.2] - }) - - features = df[['r1', 'r2', 'r3']] - - from sklearn.cluster import KMeans - kmeans = KMeans(n_clusters=2, random_state=42) - df['cluster'] = kmeans.fit_predict(features) - - # Normal samples should cluster separately from cancer samples - assert len(df['cluster'].unique()) == 2 - - -if __name__ == "__main__": - # Run tests with pytest if available - if HAS_PYTEST: - pytest.main([__file__, "-v"]) - else: - print("pytest not available, running basic tests...") - - test_classes = [ - TestMAREACluster(), - TestKMeansClustering(), - TestDBSCANClustering(), - TestHierarchicalClustering(), - TestScaling(), - TestClusterEvaluation(), - TestDataPreparation(), - TestVisualization(), - TestRealWorldScenarios() - ] - - failed = 0 - passed = 0 - - for test_class in test_classes: - class_name = test_class.__class__.__name__ - print(f"\n{class_name}:") - - for method_name in dir(test_class): - if method_name.startswith("test_"): - try: - method = getattr(test_class, method_name) - method() - print(f" ✓ {method_name}") - passed += 1 - except Exception as e: - print(f" ✗ {method_name}: {str(e)}") - failed += 1 - - print(f"\n{'='*60}") - print(f"Results: {passed} passed, {failed} failed") - if failed > 0: - sys.exit(1)
--- a/COBRAxy/src/test/test_generators.py Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,388 +0,0 @@ -""" -Unit tests for RAS and RPS generator modules. - -Run with: python -m pytest test_generators.py -v -Or: python test_generators.py -""" - -import sys -import os -import pandas as pd -import numpy as np -import math -from pathlib import Path -import tempfile - -# Try to import pytest, but don't fail if not available -try: - import pytest - HAS_PYTEST = True -except ImportError: - HAS_PYTEST = False - class _DummyPytest: - class raises: - def __init__(self, *args, **kwargs): - self.expected_exceptions = args - def __enter__(self): - return self - def __exit__(self, exc_type, exc_val, exc_tb): - if exc_type is None: - raise AssertionError("Expected an exception but none was raised") - if not any(issubclass(exc_type, e) for e in self.expected_exceptions): - return False - return True - pytest = _DummyPytest() - -# Add parent directory to path -sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) - -import ras_generator -import rps_generator -import utils.general_utils as utils -import utils.rule_parsing as ruleUtils - -# Get the tool directory -TOOL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..')) - - -class TestRASGenerator: - """Tests for ras_generator module""" - - def test_ras_op_list_and(self): - """Test RAS calculation with AND rule""" - # Create a mock args object - class MockArgs: - none = False - - ras_generator.ARGS = MockArgs() - - # Create an OpList with AND operator - rule = ruleUtils.OpList(ruleUtils.RuleOp.AND) - rule.extend(["gene1", "gene2", "gene3"]) - - # Create dataset - dataset = { - "gene1": 5.0, - "gene2": 2.0, - "gene3": 3.0 - } - - # Should return minimum value (AND logic) - result = ras_generator.ras_op_list(rule, dataset) - assert result == 2.0 - - def test_ras_op_list_or(self): - """Test RAS calculation with OR rule""" - class MockArgs: - none = False - - ras_generator.ARGS = MockArgs() - - # Create an OpList with OR operator - rule = ruleUtils.OpList(ruleUtils.RuleOp.OR) - rule.extend(["gene1", "gene2", "gene3"]) - - dataset = { - "gene1": 5.0, - "gene2": 2.0, - "gene3": 3.0 - } - - # Should return maximum value (OR logic) - result = ras_generator.ras_op_list(rule, dataset) - assert result == 5.0 - - def test_ras_op_list_with_none(self): - """Test RAS calculation with None values""" - class MockArgs: - none = True - - ras_generator.ARGS = MockArgs() - - rule = ruleUtils.OpList(ruleUtils.RuleOp.AND) - rule.extend(["gene1", "gene2"]) - - dataset = { - "gene1": 5.0, - "gene2": None - } - - # Should return None when none=True and a gene is None - result = ras_generator.ras_op_list(rule, dataset) - assert result is None - - def test_process_args(self): - """Test argument processing""" - # Test that process_args returns a valid Namespace - args = ras_generator.process_args([]) - assert hasattr(args, 'tool_dir') - - -class TestRPSGenerator: - """Tests for rps_generator module""" - - def test_get_abund_data(self): - """Test extracting abundance data from dataset""" - dataset = pd.DataFrame({ - "cell_lines": ["normal", "cancer"], - "pyruvate": [5.3, 7.01], - "glucose": [8.2, 4.0], - "atp": [7.05, 8.83] - }) - - # Get first row (normal) - result = rps_generator.get_abund_data(dataset, 0) - assert result is not None - assert result["pyruvate"] == 5.3 - assert result["glucose"] == 8.2 - assert result["atp"] == 7.05 - assert result["name"] == "normal" - - # Get second row (cancer) - result = rps_generator.get_abund_data(dataset, 1) - assert result is not None - assert result["pyruvate"] == 7.01 - assert result["name"] == "cancer" - - def test_get_abund_data_invalid_index(self): - """Test extracting abundance data with invalid index""" - dataset = pd.DataFrame({ - "cell_lines": ["normal", "cancer"], - "pyruvate": [5.3, 7.01] - }) - - # Try to get invalid index - result = rps_generator.get_abund_data(dataset, -1) - assert result is None - - result = rps_generator.get_abund_data(dataset, 999) - assert result is None - - def test_clean_metabolite_name(self): - """Test metabolite name cleaning""" - # Test removing special characters - result = rps_generator.clean_metabolite_name("4,4'-diphenylmethane diisocyanate") - assert "," not in result - assert "'" not in result - assert " " not in result - assert result == "44diphenylmethanediisocyanate" - - # Test with parentheses - result = rps_generator.clean_metabolite_name("(S)-lactate") - assert "(" not in result - assert ")" not in result - - def test_check_missing_metab(self): - """Test checking for missing metabolites""" - reactions_dict = { - "r1": {"glc__D": 1, "atp": 1}, - "r2": {"co2": 2, "pyr": 3} - } - - abundances = { - "glc__D": 8.2, - "atp": 7.05, - "pyr": 5.3 - # co2 is missing - } - - updated_abundances, missing = rps_generator.check_missing_metab( - reactions_dict, - abundances.copy() - ) - - # Should have added co2 with value 1 - assert "co2" in updated_abundances - assert updated_abundances["co2"] == 1 - - # Should report co2 as missing - assert "co2" in missing - - def test_calculate_rps(self): - """Test RPS calculation""" - reactions_dict = { - "r1": {"glc__D": 1}, - "r2": {"co2": 2, "pyr": 3}, - "r3": {"atp": 2, "glc__D": 4} - } - - abundances = { - "glc__D": 8.2, - "pyr": 5.3, - "atp": 7.05, - "co2": 1.0 - } - - black_list = [] - missing_in_dataset = ["co2"] - - result = rps_generator.calculate_rps( - reactions_dict, - abundances, - black_list, - missing_in_dataset - ) - - # Check that RPS values are calculated - assert "r1" in result - assert result["r1"] == 8.2 ** 1 - - assert "r2" in result - assert result["r2"] == (1.0 ** 2) * (5.3 ** 3) - - assert "r3" in result - assert result["r3"] == (8.2 ** 4) * (7.05 ** 2) - - def test_calculate_rps_with_blacklist(self): - """Test RPS calculation with blacklisted metabolites""" - reactions_dict = { - "r1": {"atp": 3}, # Only has blacklisted metabolite - "r2": {"glc__D": 2, "atp": 1} # Has both - } - - abundances = { - "glc__D": 8.2, - "atp": 7.05 - } - - black_list = ["atp"] - missing_in_dataset = [] - - result = rps_generator.calculate_rps( - reactions_dict, - abundances, - black_list, - missing_in_dataset - ) - - # r1 should be NaN (only has blacklisted metabolite) - assert "r1" in result - assert math.isnan(result["r1"]) - - # r2 should only use glc__D (atp is blacklisted) - assert "r2" in result - assert result["r2"] == 8.2 ** 2 - - def test_process_args(self): - """Test argument processing""" - args = rps_generator.process_args([]) - assert hasattr(args, 'tool_dir') - - -class TestGeneratorIntegration: - """Integration tests for generators with real data structures""" - - def test_ras_with_complex_rule(self): - """Test RAS with complex nested rules""" - class MockArgs: - none = False - - ras_generator.ARGS = MockArgs() - - # Create complex rule: (A and B) or (C and D) - rule = ruleUtils.OpList(ruleUtils.RuleOp.OR) - - sub_rule1 = ruleUtils.OpList(ruleUtils.RuleOp.AND) - sub_rule1.extend(["geneA", "geneB"]) - - sub_rule2 = ruleUtils.OpList(ruleUtils.RuleOp.AND) - sub_rule2.extend(["geneC", "geneD"]) - - rule.extend([sub_rule1, sub_rule2]) - - dataset = { - "geneA": 5.0, - "geneB": 3.0, - "geneC": 8.0, - "geneD": 2.0 - } - - # sub_rule1 (A and B) = min(5.0, 3.0) = 3.0 - # sub_rule2 (C and D) = min(8.0, 2.0) = 2.0 - # final (OR) = max(3.0, 2.0) = 3.0 - result = ras_generator.ras_op_list(rule, dataset) - assert result == 3.0 - - def test_rps_with_multiple_cell_lines(self): - """Test RPS calculation with multiple cell lines""" - dataset = pd.DataFrame({ - "cell_lines": ["normal", "cancer", "treated"], - "glucose": [8.2, 4.0, 6.5], - "pyruvate": [5.3, 7.0, 6.0] - }) - - # Test that we can extract data for all cell lines - for i in range(len(dataset)): - result = rps_generator.get_abund_data(dataset, i) - assert result is not None - assert "name" in result - assert result["glucose"] == dataset.iloc[i]["glucose"] - - -class TestFileStructure: - """Test that required files and directories exist""" - - def test_pickle_files_accessible(self): - """Test that pickle files are accessible""" - pickle_dir = os.path.join(TOOL_DIR, "local", "pickle files") - - # Check synonyms pickle - synonyms_path = os.path.join(pickle_dir, "synonyms.pickle") - assert os.path.exists(synonyms_path), f"Synonyms file not found at {synonyms_path}" - - # Check blacklist pickle - blacklist_path = os.path.join(pickle_dir, "black_list.pickle") - assert os.path.exists(blacklist_path), f"Blacklist file not found at {blacklist_path}" - - def test_can_load_synonyms(self): - """Test that we can load the synonyms dictionary""" - pickle_path = utils.FilePath( - "synonyms", - utils.FileFormat.PICKLE, - prefix=os.path.join(TOOL_DIR, "local", "pickle files") - ) - - try: - syns_dict = utils.readPickle(pickle_path) - assert syns_dict is not None - assert isinstance(syns_dict, dict) - except Exception as e: - pytest.skip(f"Could not load synonyms pickle: {e}") - - -if __name__ == "__main__": - # Run tests with pytest if available - if HAS_PYTEST: - pytest.main([__file__, "-v"]) - else: - print("pytest not available, running basic tests...") - - test_classes = [ - TestRASGenerator(), - TestRPSGenerator(), - TestGeneratorIntegration(), - TestFileStructure() - ] - - failed = 0 - passed = 0 - - for test_class in test_classes: - class_name = test_class.__class__.__name__ - print(f"\n{class_name}:") - - for method_name in dir(test_class): - if method_name.startswith("test_"): - try: - method = getattr(test_class, method_name) - method() - print(f" ✓ {method_name}") - passed += 1 - except Exception as e: - print(f" ✗ {method_name}: {str(e)}") - failed += 1 - - print(f"\n{'='*60}") - print(f"Results: {passed} passed, {failed} failed") - if failed > 0: - sys.exit(1)
--- a/COBRAxy/src/test/test_marea.py Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,370 +0,0 @@ -""" -Unit tests for MAREA, flux_simulation, and related visualization modules. - -Run with: python -m pytest test_marea.py -v -Or: python test_marea.py -""" - -import sys -import os -import pandas as pd -import numpy as np -import tempfile -from pathlib import Path - -# Try to import pytest, but don't fail if not available -try: - import pytest - HAS_PYTEST = True -except ImportError: - HAS_PYTEST = False - class _DummyPytest: - class raises: - def __init__(self, *args, **kwargs): - self.expected_exceptions = args - def __enter__(self): - return self - def __exit__(self, exc_type, exc_val, exc_tb): - if exc_type is None: - raise AssertionError("Expected an exception but none was raised") - if not any(issubclass(exc_type, e) for e in self.expected_exceptions): - return False - return True - pytest = _DummyPytest() - -# Add parent directory to path -sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) - -import marea -import flux_simulation -import flux_to_map -import ras_to_bounds -import utils.general_utils as utils - -# Get the tool directory -TOOL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..')) - - -class TestMAREA: - """Tests for marea module""" - - def test_process_args(self): - """Test argument processing for MAREA""" - # Create minimal args for testing - with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: - f.write("reaction_id,value\n") - f.write("r1,1.5\n") - temp_file = f.name - - try: - args = marea.process_args([ - '-td', TOOL_DIR, - '--tool_dir', TOOL_DIR - ]) - assert hasattr(args, 'tool_dir') - assert args.tool_dir == TOOL_DIR - finally: - if os.path.exists(temp_file): - os.unlink(temp_file) - - def test_comparison_types(self): - """Test that comparison type enum exists and is correct""" - # Check that the ComparisonType enum has expected values - assert hasattr(marea, 'ComparisonType') or hasattr(marea, 'GroupingCriterion') - - def test_ras_transformation(self): - """Test RAS transformation logic""" - # Create sample RAS data - ras_data = pd.DataFrame({ - 'reaction': ['r1', 'r2', 'r3'], - 'value': [1.5, 0.5, 2.0] - }) - - # Test that data can be processed - assert len(ras_data) == 3 - assert ras_data['value'].max() == 2.0 - - -class TestFluxSimulation: - """Tests for flux_simulation module""" - - def test_process_args(self): - """Test argument processing for flux simulation""" - args = flux_simulation.process_args([ - '-td', TOOL_DIR - ]) - assert hasattr(args, 'tool_dir') - - def test_flux_balance_setup(self): - """Test that FBA setup functions exist""" - # Check that key functions exist - assert hasattr(flux_simulation, 'process_args') - assert hasattr(flux_simulation, 'main') - - -class TestFluxToMap: - """Tests for flux_to_map module""" - - def test_process_args(self): - """Test argument processing for flux to map""" - args = flux_to_map.process_args([ - '-td', TOOL_DIR - ]) - assert hasattr(args, 'tool_dir') - - def test_color_map_options(self): - """Test that color map options are available""" - # The module should have color map functionality - assert hasattr(flux_to_map, 'process_args') - - -class TestRasToBounds: - """Tests for ras_to_bounds module""" - - def test_process_args(self): - """Test argument processing for RAS to bounds""" - args = ras_to_bounds.process_args([ - '-td', TOOL_DIR - ]) - assert hasattr(args, 'tool_dir') - - def test_bounds_conversion(self): - """Test that bounds conversion logic exists""" - # Create sample RAS data - ras_data = { - 'r1': 1.5, - 'r2': 0.5, - 'r3': 2.0 - } - - # Test basic transformation logic - # Reactions with higher RAS should have higher bounds - assert ras_data['r3'] > ras_data['r1'] > ras_data['r2'] - - -class TestModelConversion: - """Tests for model conversion tools""" - - def test_tabular_to_model(self): - """Test tabular to model conversion""" - import COBRAxy.src.exportMetabolicModel as exportMetabolicModel - - args = exportMetabolicModel.process_args([]) - assert hasattr(args, 'tool_dir') - - def test_model_to_tabular(self): - """Test model to tabular conversion""" - import COBRAxy.src.importMetabolicModel as importMetabolicModel - - args = importMetabolicModel.process_args([]) - assert hasattr(args, 'tool_dir') - - -class TestDataProcessing: - """Tests for data processing utilities used across tools""" - - def test_ras_data_format(self): - """Test RAS data format validation""" - # Create valid RAS data - ras_df = pd.DataFrame({ - 'reaction_id': ['r1', 'r2', 'r3'], - 'group1': [1.5, 0.5, 2.0], - 'group2': [1.8, 0.3, 2.2] - }) - - assert 'reaction_id' in ras_df.columns - assert len(ras_df) > 0 - - def test_rps_data_format(self): - """Test RPS data format validation""" - # Create valid RPS data - rps_df = pd.DataFrame({ - 'reaction_id': ['r1', 'r2', 'r3'], - 'sample1': [100.5, 50.3, 200.1], - 'sample2': [150.2, 30.8, 250.5] - }) - - assert 'reaction_id' in rps_df.columns - assert len(rps_df) > 0 - - def test_flux_data_format(self): - """Test flux data format validation""" - # Create valid flux data - flux_df = pd.DataFrame({ - 'reaction_id': ['r1', 'r2', 'r3'], - 'flux': [1.5, -0.5, 2.0], - 'lower_bound': [-10, -10, 0], - 'upper_bound': [10, 10, 10] - }) - - assert 'reaction_id' in flux_df.columns - assert 'flux' in flux_df.columns - - -class TestStatistics: - """Tests for statistical operations in MAREA""" - - def test_fold_change_calculation(self): - """Test fold change calculation""" - # Simple fold change test - group1_mean = 2.0 - group2_mean = 4.0 - fold_change = group2_mean / group1_mean - - assert fold_change == 2.0 - - def test_log_fold_change(self): - """Test log fold change calculation""" - group1_mean = 2.0 - group2_mean = 8.0 - log_fc = np.log2(group2_mean / group1_mean) - - assert log_fc == 2.0 # log2(8/2) = log2(4) = 2 - - def test_pvalue_correction(self): - """Test that statistical functions handle edge cases""" - # Test with identical values (should give p-value close to 1) - group1 = [1.0, 1.0, 1.0] - group2 = [1.0, 1.0, 1.0] - - from scipy import stats - t_stat, p_value = stats.ttest_ind(group1, group2) - - # p-value should be NaN or close to 1 for identical groups - assert np.isnan(p_value) or p_value > 0.9 - - -class TestMapVisualization: - """Tests for SVG map visualization""" - - def test_svg_maps_exist(self): - """Test that SVG maps exist""" - map_dir = os.path.join(TOOL_DIR, "local", "svg metabolic maps") - assert os.path.exists(map_dir) - - # Check for at least one map - maps = [f for f in os.listdir(map_dir) if f.endswith('.svg')] - assert len(maps) > 0, "No SVG maps found" - - def test_model_has_map(self): - """Test that models have associated maps""" - # ENGRO2 should have a map - engro2_map = os.path.join(TOOL_DIR, "local", "svg metabolic maps", "ENGRO2_map.svg") - if os.path.exists(engro2_map): - assert os.path.getsize(engro2_map) > 0 - - def test_color_gradient(self): - """Test color gradient generation""" - # Test that we can generate colors for a range of values - values = [-2.0, -1.0, 0.0, 1.0, 2.0] - - # All values should be processable - for val in values: - # Simple color mapping test - if val < 0: - # Negative values should map to one color scheme - assert val < 0 - elif val > 0: - # Positive values should map to another - assert val > 0 - else: - # Zero should be neutral - assert val == 0 - - -class TestIntegration: - """Integration tests for complete workflows""" - - def test_ras_to_marea_workflow(self): - """Test that RAS data can flow into MAREA""" - # Create sample RAS data - ras_data = pd.DataFrame({ - 'reaction_id': ['r1', 'r2', 'r3'], - 'control': [1.5, 0.8, 1.2], - 'treatment': [2.0, 0.5, 1.8] - }) - - # Calculate fold changes - ras_data['fold_change'] = ras_data['treatment'] / ras_data['control'] - - assert 'fold_change' in ras_data.columns - assert len(ras_data) == 3 - - def test_rps_to_flux_workflow(self): - """Test that RPS data can be used for flux simulation""" - # Create sample RPS data - rps_data = pd.DataFrame({ - 'reaction_id': ['r1', 'r2', 'r3'], - 'rps': [100.0, 50.0, 200.0] - }) - - # RPS can be used to set bounds - rps_data['upper_bound'] = rps_data['rps'] / 10 - - assert 'upper_bound' in rps_data.columns - - -class TestErrorHandling: - """Tests for error handling across modules""" - - def test_invalid_model_name(self): - """Test handling of invalid model names""" - with pytest.raises((ValueError, KeyError, AttributeError)): - utils.Model("INVALID_MODEL") - - def test_missing_required_column(self): - """Test handling of missing required columns""" - # Create incomplete data - incomplete_data = pd.DataFrame({ - 'wrong_column': [1, 2, 3] - }) - - # Should fail when looking for required columns - with pytest.raises(KeyError): - value = incomplete_data['reaction_id'] - - -if __name__ == "__main__": - # Run tests with pytest if available - if HAS_PYTEST: - pytest.main([__file__, "-v"]) - else: - print("pytest not available, running basic tests...") - - test_classes = [ - TestMAREA(), - TestFluxSimulation(), - TestFluxToMap(), - TestRasToBounds(), - TestModelConversion(), - TestDataProcessing(), - TestStatistics(), - TestMapVisualization(), - TestIntegration(), - TestErrorHandling() - ] - - failed = 0 - passed = 0 - - for test_class in test_classes: - class_name = test_class.__class__.__name__ - print(f"\n{class_name}:") - - for method_name in dir(test_class): - if method_name.startswith("test_"): - try: - method = getattr(test_class, method_name) - method() - print(f" ✓ {method_name}") - passed += 1 - except Exception as e: - print(f" ✗ {method_name}: {str(e)}") - import traceback - traceback.print_exc() - failed += 1 - - print(f"\n{'='*60}") - print(f"Results: {passed} passed, {failed} failed") - if failed > 0: - sys.exit(1)
--- a/COBRAxy/src/test/test_utils.py Sat Oct 25 15:20:55 2025 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,280 +0,0 @@ -""" -Unit tests for the utils modules (general_utils, rule_parsing, reaction_parsing, model_utils, CBS_backend). - -Run with: python -m pytest test_utils.py -v -Or: python test_utils.py -""" - -import sys -import os -import pandas as pd -import numpy as np -from pathlib import Path - -# Try to import pytest, but don't fail if not available -try: - import pytest - HAS_PYTEST = True -except ImportError: - HAS_PYTEST = False - # Create a dummy pytest.raises for compatibility - class _DummyPytest: - class raises: - def __init__(self, *args, **kwargs): - self.expected_exceptions = args - def __enter__(self): - return self - def __exit__(self, exc_type, exc_val, exc_tb): - if exc_type is None: - raise AssertionError("Expected an exception but none was raised") - if not any(issubclass(exc_type, e) for e in self.expected_exceptions): - return False - return True - pytest = _DummyPytest() - -# Add parent directory to path -sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..')) - -import utils.general_utils as utils -import utils.rule_parsing as ruleUtils -import utils.reaction_parsing as reactionUtils -import utils.model_utils as modelUtils - -# Get the tool directory (one level up from test directory) -TOOL_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), '..')) - -class TestGeneralUtils: - """Tests for utils.general_utils module""" - - def test_bool_check_true(self): - """Test Bool validator with true values""" - bool_checker = utils.Bool("testArg") - assert bool_checker.check("true") == True - assert bool_checker.check("True") == True - assert bool_checker.check("TRUE") == True - - def test_bool_check_false(self): - """Test Bool validator with false values""" - bool_checker = utils.Bool("testArg") - assert bool_checker.check("false") == False - assert bool_checker.check("False") == False - assert bool_checker.check("FALSE") == False - - def test_bool_check_invalid(self): - """Test Bool validator with invalid values""" - bool_checker = utils.Bool("testArg") - with pytest.raises((ValueError, utils.ArgsErr)): - bool_checker.check("foo") - with pytest.raises((ValueError, utils.ArgsErr)): - bool_checker.check("1") - - def test_custom_error(self): - """Test CustomErr class""" - err = utils.CustomErr("Test message", "Test details") - assert err.msg == "Test message" - assert err.details == "Test details" - assert isinstance(err.id, int) - - def test_custom_error_with_id(self): - """Test CustomErr class with custom ID""" - err = utils.CustomErr("Test message", "Test details", explicitErrCode=42) - assert err.msg == "Test message" - assert err.details == "Test details" - assert err.id == 42 - - def test_filepath_creation(self): - """Test FilePath creation""" - fp = utils.FilePath("test", utils.FileFormat.CSV) - assert "test" in fp.show() - assert ".csv" in fp.show() or ".tsv" in fp.show() - - def test_filepath_with_prefix(self): - """Test FilePath with prefix""" - fp = utils.FilePath("test", utils.FileFormat.CSV, prefix="/tmp") - path = fp.show() - assert "/tmp" in path - assert "test" in path - - def test_model_enum(self): - """Test Model enum""" - assert utils.Model.ENGRO2 is not None - assert utils.Model.Recon is not None - assert utils.Model.Custom is not None - - -class TestRuleParsing: - """Tests for utils.rule_parsing module""" - - def test_parse_single_gene(self): - """Test parsing a single gene""" - result = ruleUtils.parseRuleToNestedList("GENE1") - assert "GENE1" in str(result) - - def test_parse_or_rule(self): - """Test parsing OR rule""" - result = ruleUtils.parseRuleToNestedList("A or B") - assert result.op == ruleUtils.RuleOp.OR - assert len(result) == 2 # OpList is a list itself - - def test_parse_and_rule(self): - """Test parsing AND rule""" - result = ruleUtils.parseRuleToNestedList("A and B") - assert result.op == ruleUtils.RuleOp.AND - assert len(result) == 2 # OpList is a list itself - - def test_parse_complex_rule(self): - """Test parsing complex nested rule""" - result = ruleUtils.parseRuleToNestedList("A or (B and C)") - assert result.op == ruleUtils.RuleOp.OR - - def test_parse_invalid_rule(self): - """Test parsing invalid rule""" - with pytest.raises(Exception): - ruleUtils.parseRuleToNestedList("A foo B") - - def test_parse_mismatched_parentheses(self): - """Test parsing rule with mismatched parentheses""" - with pytest.raises(Exception): - ruleUtils.parseRuleToNestedList("A)") - - def test_rule_op_enum(self): - """Test RuleOp enum""" - assert ruleUtils.RuleOp("or") == ruleUtils.RuleOp.OR - assert ruleUtils.RuleOp("and") == ruleUtils.RuleOp.AND - - def test_rule_op_is_operator(self): - """Test RuleOp.isOperator""" - assert ruleUtils.RuleOp.isOperator("or") == True - assert ruleUtils.RuleOp.isOperator("and") == True - assert ruleUtils.RuleOp.isOperator("foo") == False - - -class TestReactionParsing: - """Tests for utils.reaction_parsing module""" - - def test_reaction_dir_reversible(self): - """Test ReactionDir detection for reversible reactions""" - result = reactionUtils.ReactionDir.fromReaction("atp <=> adp + pi") - assert result == reactionUtils.ReactionDir.REVERSIBLE - - def test_reaction_dir_forward(self): - """Test ReactionDir detection for forward reactions""" - result = reactionUtils.ReactionDir.fromReaction("atp --> adp + pi") - assert result == reactionUtils.ReactionDir.FORWARD - - def test_reaction_dir_backward(self): - """Test ReactionDir detection for backward reactions""" - result = reactionUtils.ReactionDir.fromReaction("atp <-- adp + pi") - assert result == reactionUtils.ReactionDir.BACKWARD - - def test_reaction_dir_invalid(self): - """Test ReactionDir with invalid arrow""" - with pytest.raises(Exception): - reactionUtils.ReactionDir.fromReaction("atp ??? adp + pi") - - def test_create_reaction_dict(self): - """Test creating reaction dictionary""" - reactions = { - 'r1': '2 pyruvate + 1 h2o <=> 1 h2o + 2 acetate', - 'r2': '2 co2 + 6 h2o --> 3 atp' - } - result = reactionUtils.create_reaction_dict(reactions) - - # Check that we have the expected reactions - assert 'r1_B' in result or 'r1_F' in result - assert 'r2' in result - - -class TestModelUtils: - """Tests for utils.model_utils module""" - - def test_gene_type_detection(self): - """Test gene type detection""" - # Test with entrez ID (numeric) - assert modelUtils.gene_type("123456", "test") == "entrez_id" - - # Test with Ensembl ID - assert modelUtils.gene_type("ENSG00000123456", "test") == "ENSG" - - # Test with symbol - assert modelUtils.gene_type("TP53", "test") == "HGNC_symbol" - - -class TestModelLoading: - """Tests for model loading functionality""" - - def test_engro2_model_exists(self): - """Test that ENGRO2 model files exist""" - model_path = os.path.join(TOOL_DIR, "local", "models", "ENGRO2.xml") - assert os.path.exists(model_path), f"ENGRO2 model not found at {model_path}" - - def test_recon_model_exists(self): - """Test that Recon model files exist""" - model_path = os.path.join(TOOL_DIR, "local", "models", "Recon.xml") - assert os.path.exists(model_path), f"Recon model not found at {model_path}" - - def test_pickle_files_exist(self): - """Test that pickle files exist""" - pickle_dir = os.path.join(TOOL_DIR, "local", "pickle files") - assert os.path.exists(pickle_dir), f"Pickle directory not found at {pickle_dir}" - - # Check for some expected pickle files - expected_files = ["synonyms.pickle", "black_list.pickle"] - for fname in expected_files: - fpath = os.path.join(pickle_dir, fname) - assert os.path.exists(fpath), f"Expected pickle file not found: {fpath}" - - def test_map_files_exist(self): - """Test that SVG map files exist""" - map_dir = os.path.join(TOOL_DIR, "local", "svg metabolic maps") - assert os.path.exists(map_dir), f"Map directory not found at {map_dir}" - - def test_medium_file_exists(self): - """Test that medium file exists""" - medium_path = os.path.join(TOOL_DIR, "local", "medium", "medium.csv") - assert os.path.exists(medium_path), f"Medium file not found at {medium_path}" - - def test_mapping_file_exists(self): - """Test that mapping file exists""" - mapping_path = os.path.join(TOOL_DIR, "local", "mappings", "genes_human.csv") - assert os.path.exists(mapping_path), f"Mapping file not found at {mapping_path}" - - -if __name__ == "__main__": - # Run tests with pytest if available, otherwise run basic checks - if HAS_PYTEST: - pytest.main([__file__, "-v"]) - else: - print("pytest not available, running basic tests...") - - # Run basic tests manually - test_classes = [ - TestGeneralUtils(), - TestRuleParsing(), - TestReactionParsing(), - TestModelUtils(), - TestModelLoading() - ] - - failed = 0 - passed = 0 - - for test_class in test_classes: - class_name = test_class.__class__.__name__ - print(f"\n{class_name}:") - - for method_name in dir(test_class): - if method_name.startswith("test_"): - try: - method = getattr(test_class, method_name) - method() - print(f" ✓ {method_name}") - passed += 1 - except Exception as e: - print(f" ✗ {method_name}: {str(e)}") - failed += 1 - - print(f"\n{'='*60}") - print(f"Results: {passed} passed, {failed} failed") - if failed > 0: - sys.exit(1)
--- a/COBRAxy/src/utils/model_utils.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/utils/model_utils.py Sun Oct 26 19:27:41 2025 +0000 @@ -15,11 +15,16 @@ import logging from typing import Optional, Tuple, Union, List, Dict, Set from collections import defaultdict -import utils.rule_parsing as rulesUtils -import utils.reaction_parsing as reactionUtils from cobra import Model as cobraModel, Reaction, Metabolite import sys +try: + from . import rule_parsing as rulesUtils + from . import reaction_parsing as reactionUtils +except: + import rule_parsing as rulesUtils + import reaction_parsing as reactionUtils + ############################ check_methods #################################### def gene_type(l :str, name :str) -> str:
--- a/COBRAxy/src/utils/reaction_parsing.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/utils/reaction_parsing.py Sun Oct 26 19:27:41 2025 +0000 @@ -8,10 +8,14 @@ - Loading custom reactions from a tabular file (TSV) """ from enum import Enum -import utils.general_utils as utils from typing import Dict import re +try: + from . import general_utils as utils +except: + import general_utils as utils + # Reaction direction encoding: class ReactionDir(Enum): """
--- a/COBRAxy/src/utils/rule_parsing.py Sat Oct 25 15:20:55 2025 +0000 +++ b/COBRAxy/src/utils/rule_parsing.py Sun Oct 26 19:27:41 2025 +0000 @@ -9,9 +9,13 @@ - parseRuleToNestedList: main entry to parse a rule string into an OpList """ from enum import Enum -import utils.general_utils as utils from typing import List, Union, Optional +try: + from . import general_utils as utils +except: + import general_utils as utils + class RuleErr(utils.CustomErr): """ Error type for rule syntax errors.
