diff COBRAxy/docs/tools/marea-cluster.md @ 492:4ed95023af20 draft

Uploaded
author francesco_lapi
date Tue, 30 Sep 2025 14:02:17 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/COBRAxy/docs/tools/marea-cluster.md	Tue Sep 30 14:02:17 2025 +0000
@@ -0,0 +1,512 @@
+# MAREA Cluster
+
+Perform clustering analysis on metabolic data to identify sample groups and patterns.
+
+## Overview
+
+MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.
+
+## Usage
+
+### Command Line
+
+```bash
+marea_cluster -td /path/to/COBRAxy \
+              -in metabolic_data.tsv \
+              -cy kmeans \
+              -sc true \
+              -k1 2 \
+              -k2 8 \
+              -el true \
+              -si true \
+              -idop clustering_results/ \
+              -ol cluster.log
+```
+
+### Galaxy Interface
+
+Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.
+
+## Parameters
+
+### Required Parameters
+
+| Parameter | Flag | Description |
+|-----------|------|-------------|
+| Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
+| Input Data | `-in, --input` | Metabolic data file (TSV format) |
+
+### Clustering Parameters
+
+| Parameter | Flag | Description | Default |
+|-----------|------|-------------|---------|
+| Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
+| Data Scaling | `-sc, --scaling` | Apply data normalization | true |
+| Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
+| Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |
+
+### Analysis Options
+
+| Parameter | Flag | Description | Default |
+|-----------|------|-------------|---------|
+| Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
+| Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |
+
+### DBSCAN Specific Parameters
+
+| Parameter | Flag | Description | Default |
+|-----------|------|-------------|---------|
+| Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
+| Epsilon | `-ep, --eps` | Maximum distance between samples | - |
+
+### Output Parameters
+
+| Parameter | Flag | Description | Default |
+|-----------|------|-------------|---------|
+| Output Path | `-idop, --output_path` | Results directory | clustering/ |
+| Output Log | `-ol, --out_log` | Log file path | - |
+| Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |
+
+## Clustering Algorithms
+
+### K-means
+**Method**: Partitional clustering using centroids
+- Assumes spherical clusters
+- Requires pre-specified number of clusters (k)
+- Fast and scalable
+- Works well with normalized data
+
+**Best for**:
+- Well-separated, compact clusters
+- Large datasets
+- When cluster number is approximately known
+
+### DBSCAN
+**Method**: Density-based clustering  
+- Identifies clusters of varying shapes
+- Automatically determines cluster number
+- Robust to outliers and noise
+- Requires epsilon and min_samples parameters
+
+**Best for**:
+- Irregular cluster shapes
+- Datasets with noise/outliers
+- Unknown number of clusters
+
+### Hierarchical
+**Method**: Agglomerative clustering with dendrograms
+- Creates tree-like cluster hierarchy
+- No need to specify cluster number initially  
+- Deterministic results
+- Provides multiple resolution levels
+
+**Best for**:
+- Small to medium datasets
+- When cluster hierarchy is important
+- Exploratory analysis
+
+## Input Format
+
+### Metabolic Data File
+
+Tab-separated format with samples as rows and reactions/metabolites as columns:
+
+```
+Sample	R00001	R00002	R00003	R00004	...
+Sample1	1.25	0.85	1.42	0.78	...
+Sample2	0.65	1.35	0.72	1.28	...
+Sample3	2.15	2.05	0.45	0.52	...
+Control1	1.05	0.98	1.15	1.08	...
+Control2	0.95	1.12	0.88	0.92	...
+```
+
+**Requirements**:
+- First column: sample identifiers
+- Subsequent columns: feature values (RAS, RPS, fluxes)
+- Missing values: use 0 or leave empty
+- Numeric data only (excluding sample names)
+
+## Data Preprocessing
+
+### Scaling Options
+
+#### Standard Scaling (Recommended)
+- Mean centering and unit variance scaling
+- Formula: `(x - mean) / std`
+- Ensures equal feature contribution
+- Required for distance-based algorithms
+
+#### No Scaling
+- Use original data values
+- May be appropriate for already normalized data
+- Risk of feature dominance by high-magnitude variables
+
+### Feature Selection
+
+Consider preprocessing steps:
+- Remove low-variance features
+- Apply dimensionality reduction (PCA)
+- Select most variable reactions/metabolites
+- Handle missing data appropriately
+
+## Output Files
+
+### Cluster Assignments
+
+#### Best Clustering Result (`best_clusters.tsv`)
+```
+Sample	Cluster	Silhouette_Score
+Sample1	1	0.73
+Sample2	1	0.68  
+Sample3	2	0.81
+Control1	0	0.59
+Control2	0	0.62
+```
+
+#### All K Results (`clustering_results_k{n}.tsv`)
+Individual files for each tested cluster number.
+
+### Validation Metrics
+
+#### Elbow Plot (`elbow_plot.png`)
+- X-axis: Number of clusters (k)
+- Y-axis: Within-cluster sum of squares (WCSS)
+- Identifies optimal k at the "elbow" point
+
+#### Silhouette Plots (`silhouette_k{n}.png`)
+- Individual sample silhouette scores
+- Average silhouette width per cluster
+- Overall clustering quality assessment
+
+### Summary Statistics
+
+#### Clustering Summary (`clustering_summary.txt`)
+```
+Algorithm: kmeans
+Scaling: true
+Optimal K: 3
+Best Silhouette Score: 0.72
+Number of Samples: 20
+Feature Dimensions: 150
+```
+
+#### Cluster Characteristics (`cluster_stats.tsv`)
+```
+Cluster	Size	Centroid_R00001	Centroid_R00002	Avg_Silhouette
+0	8	0.95	1.12	0.68
+1	7	1.35	0.82	0.74
+2	5	0.65	1.55	0.69
+```
+
+## Examples
+
+### Basic K-means Clustering
+
+```bash
+# Simple K-means with elbow analysis
+marea_cluster -td /opt/COBRAxy \
+              -in ras_data.tsv \
+              -cy kmeans \
+              -sc true \
+              -k1 2 \
+              -k2 10 \
+              -el true \
+              -si true \
+              -idop kmeans_results/ \
+              -ol kmeans.log
+```
+
+### DBSCAN Analysis
+
+```bash
+# Density-based clustering with custom parameters
+marea_cluster -td /opt/COBRAxy \
+              -in flux_samples.tsv \
+              -cy dbscan \
+              -sc true \
+              -ms 5 \
+              -ep 0.5 \
+              -idop dbscan_results/ \
+              -bc best_dbscan_clusters.tsv \
+              -ol dbscan.log
+```
+
+### Hierarchical Clustering
+
+```bash
+# Hierarchical clustering for small dataset
+marea_cluster -td /opt/COBRAxy \
+              -in rps_scores.tsv \
+              -cy hierarchy \
+              -sc true \
+              -k1 2 \
+              -k2 6 \
+              -si true \
+              -idop hierarchical_results/ \
+              -ol hierarchy.log
+```
+
+### Comprehensive Clustering Analysis
+
+```bash
+# Compare multiple algorithms
+algorithms=("kmeans" "dbscan" "hierarchy")
+for alg in "${algorithms[@]}"; do
+    marea_cluster -td /opt/COBRAxy \
+                  -in metabolomics_data.tsv \
+                  -cy "$alg" \
+                  -sc true \
+                  -k1 2 \
+                  -k2 8 \
+                  -el true \
+                  -si true \
+                  -idop "${alg}_clustering/" \
+                  -ol "${alg}_cluster.log"
+done
+```
+
+## Parameter Optimization
+
+### K-means Optimization
+
+#### Elbow Method
+1. Run K-means for k = 2 to k_max
+2. Plot WCSS vs k
+3. Identify "elbow" point where improvement diminishes
+4. Select k at elbow as optimal
+
+#### Silhouette Analysis
+1. Compute silhouette scores for each k
+2. Select k with highest average silhouette score
+3. Validate with silhouette plots
+4. Ensure clusters are well-separated
+
+### DBSCAN Parameter Tuning
+
+#### Epsilon (eps) Selection
+- Use k-distance plot to identify knee point
+- Start with eps = average distance to k-th nearest neighbor
+- Adjust based on cluster quality metrics
+
+#### Min Samples Selection
+- Rule of thumb: min_samples ≥ dimensionality + 1
+- Higher values create denser clusters
+- Lower values may increase noise sensitivity
+
+### Hierarchical Clustering
+
+#### Linkage Method
+- Ward: Minimizes within-cluster variance
+- Complete: Maximum distance between clusters
+- Average: Mean distance between clusters
+- Single: Minimum distance (prone to chaining)
+
+## Quality Assessment
+
+### Internal Validation Metrics
+
+#### Silhouette Score
+- Range: [-1, 1]
+- >0.7: Strong clustering
+- 0.5-0.7: Reasonable clustering
+- <0.5: Weak clustering
+
+#### Calinski-Harabasz Index
+- Higher values indicate better clustering
+- Ratio of between-cluster to within-cluster variance
+
+#### Davies-Bouldin Index
+- Lower values indicate better clustering
+- Average similarity between clusters
+
+### External Validation
+
+When ground truth labels available:
+- Adjusted Rand Index (ARI)
+- Normalized Mutual Information (NMI)
+- Homogeneity and Completeness scores
+
+## Biological Interpretation
+
+### Cluster Characterization
+
+#### Metabolic Pathway Analysis
+- Identify enriched pathways per cluster
+- Compare metabolic profiles between clusters
+- Relate clusters to biological conditions
+
+#### Sample Annotation
+- Map clusters to experimental conditions
+- Identify batch effects or confounders
+- Validate with independent datasets
+
+#### Feature Importance
+- Determine reactions/metabolites driving clustering
+- Analyze cluster centroids for biological insights
+- Connect to known metabolic phenotypes
+
+## Integration Workflow
+
+### Upstream Data Sources
+
+#### COBRAxy Tools
+- [RAS Generator](ras-generator.md) - Cluster based on reaction activities
+- [RPS Generator](rps-generator.md) - Cluster based on reaction propensities  
+- [Flux Simulation](flux-simulation.md) - Cluster flux distributions
+
+#### External Data
+- Gene expression matrices
+- Metabolomics datasets
+- Clinical metadata
+
+### Downstream Analysis
+
+#### Supervised Learning
+Use cluster labels for:
+- Classification model training
+- Biomarker discovery
+- Outcome prediction
+
+#### Differential Analysis
+- Compare clusters with [MAREA](marea.md)
+- Identify cluster-specific metabolic signatures
+- Pathway enrichment analysis
+
+### Typical Pipeline
+
+```bash
+# 1. Generate metabolic scores
+ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv
+
+# 2. Perform clustering analysis
+marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
+              -sc true -k1 2 -k2 8 -el true -si true \
+              -idop clusters/ -bc best_clusters.tsv
+
+# 3. Analyze cluster differences
+marea -td /opt/COBRAxy -input_data ras.tsv \
+      -input_class best_clusters.tsv -comparison manyvsmany \
+      -test ks -choice_map ENGRO2 -idop cluster_analysis/
+```
+
+## Tips and Best Practices
+
+### Data Preparation
+- **Normalization**: Always scale features for distance-based methods
+- **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
+- **Missing Values**: Handle appropriately (imputation or removal)
+- **Outliers**: Identify and consider removal for K-means
+
+### Algorithm Selection
+- **K-means**: Start here for most applications
+- **DBSCAN**: Use when clusters have irregular shapes or noise present
+- **Hierarchical**: Choose for small datasets or when hierarchy matters
+
+### Parameter Selection
+- **Start Simple**: Begin with default parameters
+- **Use Validation**: Always employ silhouette analysis
+- **Cross-Validate**: Test stability across parameter ranges
+- **Biological Validation**: Ensure clusters make biological sense
+
+### Result Interpretation
+- **Multiple Algorithms**: Compare results across methods
+- **Stability Assessment**: Check clustering reproducibility
+- **Biological Context**: Integrate with known sample characteristics
+- **Statistical Testing**: Validate cluster differences formally
+
+## Troubleshooting
+
+### Common Issues
+
+**Poor clustering quality**
+- Check data scaling and normalization
+- Assess feature selection and dimensionality
+- Try different algorithms or parameters
+- Evaluate data structure with PCA/t-SNE
+
+**Algorithm doesn't converge**
+- Increase iteration limits for K-means
+- Adjust epsilon/min_samples for DBSCAN
+- Check for numerical stability issues
+- Verify input data format
+
+**Memory or performance issues**
+- Reduce dataset size or dimensionality
+- Use sampling for large datasets
+- Consider approximate algorithms
+- Monitor system resources
+
+### Error Messages
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
+| "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
+| "Memory allocation error" | Dataset too large | Reduce size or increase memory |
+| "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |
+
+### Performance Optimization
+
+**Large Datasets**
+- Use mini-batch K-means for speed
+- Sample data for parameter optimization
+- Employ dimensionality reduction
+- Consider distributed computing
+
+**High-Dimensional Data**
+- Apply feature selection
+- Use PCA preprocessing
+- Consider specialized algorithms
+- Validate results carefully
+
+## Advanced Usage
+
+### Custom Distance Metrics
+
+For specialized applications, modify distance calculations:
+
+```python
+# Custom distance function for metabolic data
+def metabolic_distance(x, y):
+    # Implement pathway-aware distance metric
+    return custom_distance_value
+```
+
+### Ensemble Clustering
+
+Combine multiple clustering results:
+
+```bash
+# Run multiple algorithms and combine
+for method in kmeans dbscan hierarchy; do
+    marea_cluster -cy $method -in data.tsv -idop ${method}_results/
+done
+
+# Consensus clustering (requires custom script)
+python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
+```
+
+### Interactive Analysis
+
+Generate interactive plots for exploration:
+
+```python
+import plotly.express as px
+import pandas as pd
+
+# Load clustering results  
+results = pd.read_csv('best_clusters.tsv', sep='\t')
+data = pd.read_csv('metabolic_data.tsv', sep='\t')
+
+# Interactive scatter plot
+fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
+fig.show()
+```
+
+## See Also
+
+- [MAREA](marea.md) - Statistical analysis of cluster differences
+- [RAS Generator](ras-generator.md) - Generate clustering input data
+- [Flux Simulation](flux-simulation.md) - Alternative clustering data source
+- [Clustering Tutorial](../tutorials/clustering-analysis.md)
+- [Validation Methods Reference](../tutorials/cluster-validation.md)
\ No newline at end of file