Mercurial > repos > bimib > cobraxy
diff COBRAxy/docs/tools/marea-cluster.md @ 492:4ed95023af20 draft
Uploaded
author | francesco_lapi |
---|---|
date | Tue, 30 Sep 2025 14:02:17 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COBRAxy/docs/tools/marea-cluster.md Tue Sep 30 14:02:17 2025 +0000 @@ -0,0 +1,512 @@ +# MAREA Cluster + +Perform clustering analysis on metabolic data to identify sample groups and patterns. + +## Overview + +MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis. + +## Usage + +### Command Line + +```bash +marea_cluster -td /path/to/COBRAxy \ + -in metabolic_data.tsv \ + -cy kmeans \ + -sc true \ + -k1 2 \ + -k2 8 \ + -el true \ + -si true \ + -idop clustering_results/ \ + -ol cluster.log +``` + +### Galaxy Interface + +Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface. + +## Parameters + +### Required Parameters + +| Parameter | Flag | Description | +|-----------|------|-------------| +| Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory | +| Input Data | `-in, --input` | Metabolic data file (TSV format) | + +### Clustering Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans | +| Data Scaling | `-sc, --scaling` | Apply data normalization | true | +| Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 | +| Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 | + +### Analysis Options + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false | +| Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false | + +### DBSCAN Specific Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - | +| Epsilon | `-ep, --eps` | Maximum distance between samples | - | + +### Output Parameters + +| Parameter | Flag | Description | Default | +|-----------|------|-------------|---------| +| Output Path | `-idop, --output_path` | Results directory | clustering/ | +| Output Log | `-ol, --out_log` | Log file path | - | +| Best Cluster | `-bc, --best_cluster` | Best clustering result file | - | + +## Clustering Algorithms + +### K-means +**Method**: Partitional clustering using centroids +- Assumes spherical clusters +- Requires pre-specified number of clusters (k) +- Fast and scalable +- Works well with normalized data + +**Best for**: +- Well-separated, compact clusters +- Large datasets +- When cluster number is approximately known + +### DBSCAN +**Method**: Density-based clustering +- Identifies clusters of varying shapes +- Automatically determines cluster number +- Robust to outliers and noise +- Requires epsilon and min_samples parameters + +**Best for**: +- Irregular cluster shapes +- Datasets with noise/outliers +- Unknown number of clusters + +### Hierarchical +**Method**: Agglomerative clustering with dendrograms +- Creates tree-like cluster hierarchy +- No need to specify cluster number initially +- Deterministic results +- Provides multiple resolution levels + +**Best for**: +- Small to medium datasets +- When cluster hierarchy is important +- Exploratory analysis + +## Input Format + +### Metabolic Data File + +Tab-separated format with samples as rows and reactions/metabolites as columns: + +``` +Sample R00001 R00002 R00003 R00004 ... +Sample1 1.25 0.85 1.42 0.78 ... +Sample2 0.65 1.35 0.72 1.28 ... +Sample3 2.15 2.05 0.45 0.52 ... +Control1 1.05 0.98 1.15 1.08 ... +Control2 0.95 1.12 0.88 0.92 ... +``` + +**Requirements**: +- First column: sample identifiers +- Subsequent columns: feature values (RAS, RPS, fluxes) +- Missing values: use 0 or leave empty +- Numeric data only (excluding sample names) + +## Data Preprocessing + +### Scaling Options + +#### Standard Scaling (Recommended) +- Mean centering and unit variance scaling +- Formula: `(x - mean) / std` +- Ensures equal feature contribution +- Required for distance-based algorithms + +#### No Scaling +- Use original data values +- May be appropriate for already normalized data +- Risk of feature dominance by high-magnitude variables + +### Feature Selection + +Consider preprocessing steps: +- Remove low-variance features +- Apply dimensionality reduction (PCA) +- Select most variable reactions/metabolites +- Handle missing data appropriately + +## Output Files + +### Cluster Assignments + +#### Best Clustering Result (`best_clusters.tsv`) +``` +Sample Cluster Silhouette_Score +Sample1 1 0.73 +Sample2 1 0.68 +Sample3 2 0.81 +Control1 0 0.59 +Control2 0 0.62 +``` + +#### All K Results (`clustering_results_k{n}.tsv`) +Individual files for each tested cluster number. + +### Validation Metrics + +#### Elbow Plot (`elbow_plot.png`) +- X-axis: Number of clusters (k) +- Y-axis: Within-cluster sum of squares (WCSS) +- Identifies optimal k at the "elbow" point + +#### Silhouette Plots (`silhouette_k{n}.png`) +- Individual sample silhouette scores +- Average silhouette width per cluster +- Overall clustering quality assessment + +### Summary Statistics + +#### Clustering Summary (`clustering_summary.txt`) +``` +Algorithm: kmeans +Scaling: true +Optimal K: 3 +Best Silhouette Score: 0.72 +Number of Samples: 20 +Feature Dimensions: 150 +``` + +#### Cluster Characteristics (`cluster_stats.tsv`) +``` +Cluster Size Centroid_R00001 Centroid_R00002 Avg_Silhouette +0 8 0.95 1.12 0.68 +1 7 1.35 0.82 0.74 +2 5 0.65 1.55 0.69 +``` + +## Examples + +### Basic K-means Clustering + +```bash +# Simple K-means with elbow analysis +marea_cluster -td /opt/COBRAxy \ + -in ras_data.tsv \ + -cy kmeans \ + -sc true \ + -k1 2 \ + -k2 10 \ + -el true \ + -si true \ + -idop kmeans_results/ \ + -ol kmeans.log +``` + +### DBSCAN Analysis + +```bash +# Density-based clustering with custom parameters +marea_cluster -td /opt/COBRAxy \ + -in flux_samples.tsv \ + -cy dbscan \ + -sc true \ + -ms 5 \ + -ep 0.5 \ + -idop dbscan_results/ \ + -bc best_dbscan_clusters.tsv \ + -ol dbscan.log +``` + +### Hierarchical Clustering + +```bash +# Hierarchical clustering for small dataset +marea_cluster -td /opt/COBRAxy \ + -in rps_scores.tsv \ + -cy hierarchy \ + -sc true \ + -k1 2 \ + -k2 6 \ + -si true \ + -idop hierarchical_results/ \ + -ol hierarchy.log +``` + +### Comprehensive Clustering Analysis + +```bash +# Compare multiple algorithms +algorithms=("kmeans" "dbscan" "hierarchy") +for alg in "${algorithms[@]}"; do + marea_cluster -td /opt/COBRAxy \ + -in metabolomics_data.tsv \ + -cy "$alg" \ + -sc true \ + -k1 2 \ + -k2 8 \ + -el true \ + -si true \ + -idop "${alg}_clustering/" \ + -ol "${alg}_cluster.log" +done +``` + +## Parameter Optimization + +### K-means Optimization + +#### Elbow Method +1. Run K-means for k = 2 to k_max +2. Plot WCSS vs k +3. Identify "elbow" point where improvement diminishes +4. Select k at elbow as optimal + +#### Silhouette Analysis +1. Compute silhouette scores for each k +2. Select k with highest average silhouette score +3. Validate with silhouette plots +4. Ensure clusters are well-separated + +### DBSCAN Parameter Tuning + +#### Epsilon (eps) Selection +- Use k-distance plot to identify knee point +- Start with eps = average distance to k-th nearest neighbor +- Adjust based on cluster quality metrics + +#### Min Samples Selection +- Rule of thumb: min_samples ≥ dimensionality + 1 +- Higher values create denser clusters +- Lower values may increase noise sensitivity + +### Hierarchical Clustering + +#### Linkage Method +- Ward: Minimizes within-cluster variance +- Complete: Maximum distance between clusters +- Average: Mean distance between clusters +- Single: Minimum distance (prone to chaining) + +## Quality Assessment + +### Internal Validation Metrics + +#### Silhouette Score +- Range: [-1, 1] +- >0.7: Strong clustering +- 0.5-0.7: Reasonable clustering +- <0.5: Weak clustering + +#### Calinski-Harabasz Index +- Higher values indicate better clustering +- Ratio of between-cluster to within-cluster variance + +#### Davies-Bouldin Index +- Lower values indicate better clustering +- Average similarity between clusters + +### External Validation + +When ground truth labels available: +- Adjusted Rand Index (ARI) +- Normalized Mutual Information (NMI) +- Homogeneity and Completeness scores + +## Biological Interpretation + +### Cluster Characterization + +#### Metabolic Pathway Analysis +- Identify enriched pathways per cluster +- Compare metabolic profiles between clusters +- Relate clusters to biological conditions + +#### Sample Annotation +- Map clusters to experimental conditions +- Identify batch effects or confounders +- Validate with independent datasets + +#### Feature Importance +- Determine reactions/metabolites driving clustering +- Analyze cluster centroids for biological insights +- Connect to known metabolic phenotypes + +## Integration Workflow + +### Upstream Data Sources + +#### COBRAxy Tools +- [RAS Generator](ras-generator.md) - Cluster based on reaction activities +- [RPS Generator](rps-generator.md) - Cluster based on reaction propensities +- [Flux Simulation](flux-simulation.md) - Cluster flux distributions + +#### External Data +- Gene expression matrices +- Metabolomics datasets +- Clinical metadata + +### Downstream Analysis + +#### Supervised Learning +Use cluster labels for: +- Classification model training +- Biomarker discovery +- Outcome prediction + +#### Differential Analysis +- Compare clusters with [MAREA](marea.md) +- Identify cluster-specific metabolic signatures +- Pathway enrichment analysis + +### Typical Pipeline + +```bash +# 1. Generate metabolic scores +ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv + +# 2. Perform clustering analysis +marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \ + -sc true -k1 2 -k2 8 -el true -si true \ + -idop clusters/ -bc best_clusters.tsv + +# 3. Analyze cluster differences +marea -td /opt/COBRAxy -input_data ras.tsv \ + -input_class best_clusters.tsv -comparison manyvsmany \ + -test ks -choice_map ENGRO2 -idop cluster_analysis/ +``` + +## Tips and Best Practices + +### Data Preparation +- **Normalization**: Always scale features for distance-based methods +- **Dimensionality**: Consider PCA for high-dimensional data (>1000 features) +- **Missing Values**: Handle appropriately (imputation or removal) +- **Outliers**: Identify and consider removal for K-means + +### Algorithm Selection +- **K-means**: Start here for most applications +- **DBSCAN**: Use when clusters have irregular shapes or noise present +- **Hierarchical**: Choose for small datasets or when hierarchy matters + +### Parameter Selection +- **Start Simple**: Begin with default parameters +- **Use Validation**: Always employ silhouette analysis +- **Cross-Validate**: Test stability across parameter ranges +- **Biological Validation**: Ensure clusters make biological sense + +### Result Interpretation +- **Multiple Algorithms**: Compare results across methods +- **Stability Assessment**: Check clustering reproducibility +- **Biological Context**: Integrate with known sample characteristics +- **Statistical Testing**: Validate cluster differences formally + +## Troubleshooting + +### Common Issues + +**Poor clustering quality** +- Check data scaling and normalization +- Assess feature selection and dimensionality +- Try different algorithms or parameters +- Evaluate data structure with PCA/t-SNE + +**Algorithm doesn't converge** +- Increase iteration limits for K-means +- Adjust epsilon/min_samples for DBSCAN +- Check for numerical stability issues +- Verify input data format + +**Memory or performance issues** +- Reduce dataset size or dimensionality +- Use sampling for large datasets +- Consider approximate algorithms +- Monitor system resources + +### Error Messages + +| Error | Cause | Solution | +|-------|-------|----------| +| "Convergence failed" | K-means iteration limit | Increase max iterations or check data | +| "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples | +| "Memory allocation error" | Dataset too large | Reduce size or increase memory | +| "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm | + +### Performance Optimization + +**Large Datasets** +- Use mini-batch K-means for speed +- Sample data for parameter optimization +- Employ dimensionality reduction +- Consider distributed computing + +**High-Dimensional Data** +- Apply feature selection +- Use PCA preprocessing +- Consider specialized algorithms +- Validate results carefully + +## Advanced Usage + +### Custom Distance Metrics + +For specialized applications, modify distance calculations: + +```python +# Custom distance function for metabolic data +def metabolic_distance(x, y): + # Implement pathway-aware distance metric + return custom_distance_value +``` + +### Ensemble Clustering + +Combine multiple clustering results: + +```bash +# Run multiple algorithms and combine +for method in kmeans dbscan hierarchy; do + marea_cluster -cy $method -in data.tsv -idop ${method}_results/ +done + +# Consensus clustering (requires custom script) +python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv +``` + +### Interactive Analysis + +Generate interactive plots for exploration: + +```python +import plotly.express as px +import pandas as pd + +# Load clustering results +results = pd.read_csv('best_clusters.tsv', sep='\t') +data = pd.read_csv('metabolic_data.tsv', sep='\t') + +# Interactive scatter plot +fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster']) +fig.show() +``` + +## See Also + +- [MAREA](marea.md) - Statistical analysis of cluster differences +- [RAS Generator](ras-generator.md) - Generate clustering input data +- [Flux Simulation](flux-simulation.md) - Alternative clustering data source +- [Clustering Tutorial](../tutorials/clustering-analysis.md) +- [Validation Methods Reference](../tutorials/cluster-validation.md) \ No newline at end of file