diff COBRAxy/docs/tools/ras-generator.md @ 492:4ed95023af20 draft

Uploaded
author francesco_lapi
date Tue, 30 Sep 2025 14:02:17 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/COBRAxy/docs/tools/ras-generator.md	Tue Sep 30 14:02:17 2025 +0000
@@ -0,0 +1,320 @@
+# RAS Generator
+
+Generate Reaction Activity Scores (RAS) from gene expression data and GPR (Gene-Protein-Reaction) rules.
+
+## Overview
+
+The RAS Generator computes metabolic reaction activity by:
+1. Mapping gene expression to reactions via GPR rules
+2. Applying logical operations (AND/OR) for enzyme complexes
+3. Producing activity scores for each reaction in each sample
+
+**Input**: Gene expression data + GPR rules  
+**Output**: Reaction activity scores (RAS)
+
+## Parameters
+
+### Required Parameters
+
+| Parameter | Short | Type | Description |
+|-----------|--------|------|-------------|
+| `--tool_dir` | `-td` | string | COBRAxy installation directory |
+| `--input` | `-in` | file | Gene expression dataset (TSV format) |
+| `--ras_output` | `-ra` | file | Output file for RAS values |
+| `--rules_selector` | `-rs` | choice | Built-in model (ENGRO2, Recon, HMRcore) |
+
+### Optional Parameters
+
+| Parameter | Short | Type | Default | Description |
+|-----------|--------|------|---------|-------------|
+| `--none` | `-n` | boolean | true | Handle missing gene values |
+| `--model_upload` | `-rl` | file | - | Custom GPR rules file |
+| `--model_upload_name` | `-rn` | string | - | Custom model name |
+| `--out_log` | - | file | log.txt | Output log file |
+
+## Input Format
+
+### Gene Expression File
+```tsv
+Gene_ID	Sample_1	Sample_2	Sample_3	Sample_4
+HGNC:5	10.5	11.2	15.7	14.3
+HGNC:10	3.2	4.1	8.8	7.9
+HGNC:15	7.9	8.2	4.4	5.1
+HGNC:25	12.1	13.5	18.2	17.8
+```
+
+**Requirements**:
+- First column: Gene identifiers (HGNC, Ensembl, Entrez, etc.)
+- Subsequent columns: Expression values (numeric)
+- Header row with sample names
+- Tab-separated format
+
+### Custom GPR Rules File (Optional)
+```tsv
+Reaction_ID	GPR
+R_HEX1	HGNC:4922
+R_PGI	HGNC:8906
+R_PFK	HGNC:8877 or HGNC:8878
+R_ALDOA	HGNC:414 and HGNC:417
+```
+
+## Algorithm Details
+
+### GPR Rule Processing
+
+**Gene Mapping**: Each gene in the expression data is mapped to reactions via GPR rules.
+
+**Logical Operations**:
+- **OR**: `Gene1 or Gene2` → `max(expr1, expr2)` or `expr1 + expr2`
+- **AND**: `Gene1 and Gene2` → `min(expr1, expr2)`
+
+**Missing Gene Handling**:
+- `-n true`: Missing genes treated as 0, OR operations continue
+- `-n false`: Missing genes cause reaction score to be null
+
+### RAS Computation
+
+For each reaction and sample:
+
+1. **Parse GPR rule** into nested logical structure
+2. **Replace gene names** with expression values  
+3. **Evaluate logical operations** recursively
+4. **Assign RAS score** based on final result
+
+**Example**:
+```
+GPR: (HGNC:5 and HGNC:10) or HGNC:15
+Expression: HGNC:5=10.5, HGNC:10=3.2, HGNC:15=7.9
+RAS = max(min(10.5, 3.2), 7.9) = max(3.2, 7.9) = 7.9
+```
+
+## Output Format
+
+### RAS Values File
+```tsv
+Reactions	Sample_1	Sample_2	Sample_3	Sample_4
+R_HEX1	8.5	9.2	12.1	11.3
+R_PGI	7.3	8.1	6.4	7.2
+R_PFK	15.2	16.8	20.1	18.9
+R_ALDOA	3.2	4.1	4.4	5.1
+```
+
+**Format**:
+- First column: Reaction identifiers
+- Subsequent columns: RAS values for each sample
+- Missing values represented as "None"
+
+## Usage Examples
+
+### Command Line
+
+```bash
+# Basic usage with built-in model
+ras_generator -td /path/to/COBRAxy \
+  -in expression_data.tsv \
+  -ra ras_output.tsv \
+  -rs ENGRO2
+
+# With custom model and strict missing gene handling
+ras_generator -td /path/to/COBRAxy \
+  -in expression_data.tsv \
+  -ra ras_output.tsv \
+  -rl custom_rules.tsv \
+  -rn "CustomModel" \
+  -n false
+```
+
+### Python API
+
+```python
+import ras_generator
+
+# Basic RAS generation
+args = [
+    '-td', '/path/to/COBRAxy',
+    '-in', 'expression_data.tsv', 
+    '-ra', 'ras_output.tsv',
+    '-rs', 'ENGRO2'
+]
+
+ras_generator.main(args)
+```
+
+### Galaxy Usage
+
+1. Upload gene expression file to Galaxy
+2. Select **RAS Generator** from COBRAxy tools
+3. Configure parameters:
+   - **Input dataset**: Your expression file
+   - **Rule selector**: ENGRO2 (or other model)
+   - **Handle missing genes**: Yes/No
+4. Click **Execute**
+
+## Built-in Models
+
+### ENGRO2 (Recommended for most analyses)
+- **Scope**: Focused human metabolism
+- **Reactions**: ~2,000
+- **Genes**: ~500
+- **Use case**: General metabolic analysis
+
+### Recon (Comprehensive analysis)
+- **Scope**: Complete human metabolism  
+- **Reactions**: ~10,000
+- **Genes**: ~2,000
+- **Use case**: Detailed metabolic studies
+
+### HMRcore (Balanced option)
+- **Scope**: Core human metabolism
+- **Reactions**: ~5,000  
+- **Genes**: ~1,000
+- **Use case**: Balanced coverage
+
+## Gene ID Mapping
+
+COBRAxy supports multiple gene identifier formats:
+
+| Format | Example | Notes |
+|--------|---------|--------|
+| **HGNC ID** | HGNC:5 | Recommended, most stable |
+| **HGNC Symbol** | ALDOA | Human-readable but may change |
+| **Ensembl** | ENSG00000149925 | Version-specific |
+| **Entrez** | 226 | Numeric identifier |
+
+**Recommendation**: Use HGNC IDs for best compatibility and stability.
+
+
+
+## Troubleshooting
+
+### Common Issues
+
+**"Gene not found" warnings**
+```
+Solution: Check gene ID format matches model expectations
+- Verify gene identifiers (HGNC vs symbols vs Ensembl)
+- Use gene mapping tools if needed
+- Set -n true to handle missing genes gracefully
+```
+
+**"No computable scores" error**
+```
+Solution: Insufficient gene overlap between data and model
+- Check gene ID format compatibility
+- Verify expression file format
+- Try different built-in model
+```
+
+**Empty output file**
+```
+Solution: Check input file format and permissions
+- Ensure TSV format with proper headers
+- Verify file paths are correct
+- Check write permissions for output directory
+```
+
+
+
+### Debug Mode
+
+Enable detailed logging:
+
+```bash
+ras_generator -td /path/to/COBRAxy \
+  -in expression_data.tsv \
+  -ra ras_output.tsv \
+  -rs ENGRO2 \
+  --out_log detailed_log.txt
+```
+
+Check log file for detailed error messages and processing statistics.
+
+## Validation
+
+### Check Output Quality
+
+```python
+import pandas as pd
+
+# Read RAS output
+ras_df = pd.read_csv('ras_output.tsv', sep='\t', index_col=0)
+
+# Basic statistics
+print(f"RAS matrix shape: {ras_df.shape}")
+print(f"Non-null values: {ras_df.count().sum()}")
+print(f"Value range: {ras_df.min().min():.2f} to {ras_df.max().max():.2f}")
+
+# Check for problematic reactions
+null_reactions = ras_df.isnull().all(axis=1).sum()
+print(f"Reactions with no data: {null_reactions}")
+```
+
+### Expected Results
+
+- **Coverage**: 60-90% of reactions should have computable scores
+- **Range**: RAS values typically 0-20 for log-transformed expression
+- **Distribution**: Should reflect biological variation in your samples
+
+## Integration with Other Tools
+
+### Downstream Analysis
+
+RAS output can be used with:
+
+- **[MAREA](marea.md)**: Statistical enrichment analysis
+- **[RAS to Bounds](ras-to-bounds.md)**: Flux constraint application
+- **[MAREA Cluster](marea-cluster.md)**: Sample clustering
+
+### Preprocessing Options
+
+Before RAS generation:
+- **Normalize** expression data (log2, quantile, etc.)
+- **Filter** low-expression genes
+- **Batch correct** if multiple datasets
+
+## Advanced Usage
+
+### Custom Model Integration
+
+```python
+# Create custom GPR rules
+custom_rules = {
+    'R_CUSTOM1': 'HGNC:5 and HGNC:10',
+    'R_CUSTOM2': 'HGNC:15 or HGNC:20'  
+}
+
+# Save as TSV
+import pandas as pd
+rules_df = pd.DataFrame(list(custom_rules.items()), 
+                       columns=['Reaction_ID', 'GPR'])
+rules_df.to_csv('custom_rules.tsv', sep='\t', index=False)
+
+# Use with RAS generator
+args = ['-rl', 'custom_rules.tsv', '-rn', 'CustomModel']
+```
+
+### Batch Processing
+
+```python
+# Process multiple expression files
+expression_files = ['data1.tsv', 'data2.tsv', 'data3.tsv']
+
+for i, exp_file in enumerate(expression_files):
+    output_file = f'ras_output_{i}.tsv'
+    
+    args = [
+        '-td', '/path/to/COBRAxy',
+        '-in', exp_file,
+        '-ra', output_file,
+        '-rs', 'ENGRO2'
+    ]
+    
+    ras_generator.main(args)
+    print(f"Processed {exp_file} → {output_file}")
+```
+
+## References
+
+- [COBRApy documentation](https://cobrapy.readthedocs.io/) - Underlying metabolic modeling
+- [GPR rules format](https://cobrapy.readthedocs.io/en/stable/getting_started.html#gene-protein-reaction-rules) - Standard format specification  
+- [HGNC database](https://www.genenames.org/) - Gene nomenclature standards
\ No newline at end of file