diff COBRAxy/docs/tools/marea-cluster.md @ 547:73f2f7e2be17 draft

Uploaded
author francesco_lapi
date Tue, 28 Oct 2025 10:44:07 +0000
parents fcdbc81feb45
children
line wrap: on
line diff
--- a/COBRAxy/docs/tools/marea-cluster.md	Mon Oct 27 12:33:08 2025 +0000
+++ b/COBRAxy/docs/tools/marea-cluster.md	Tue Oct 28 10:44:07 2025 +0000
@@ -1,512 +1,74 @@
 # MAREA Cluster
 
-Perform clustering analysis on metabolic data to identify sample groups and patterns.
+Cluster analysis for metabolic data (RAS/RPS scores, flux distributions).
 
 ## Overview
 
-MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.
+MAREA Cluster performs unsupervised clustering on metabolic data using K-means, DBSCAN, or hierarchical algorithms.
+
+## Galaxy Interface
+
+In Galaxy: **COBRAxy → Cluster Analysis**
 
-## Usage
+1. Upload metabolic data file
+2. Select clustering algorithm and parameters
+3. Click **Run tool**
 
-### Command Line
+## Command-line console
 
 ```bash
-marea_cluster -td /path/to/COBRAxy \
-              -in metabolic_data.tsv \
-              -cy kmeans \
-              -sc true \
-              -k1 2 \
-              -k2 8 \
-              -el true \
-              -si true \
-              -idop clustering_results/ \
-              -ol cluster.log
-```
-
-### Galaxy Interface
-
-Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.
-
-## Parameters
-
-### Required Parameters
-
-| Parameter | Flag | Description |
-|-----------|------|-------------|
-| Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
-| Input Data | `-in, --input` | Metabolic data file (TSV format) |
-
-### Clustering Parameters
-
-| Parameter | Flag | Description | Default |
-|-----------|------|-------------|---------|
-| Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
-| Data Scaling | `-sc, --scaling` | Apply data normalization | true |
-| Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
-| Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |
-
-### Analysis Options
-
-| Parameter | Flag | Description | Default |
-|-----------|------|-------------|---------|
-| Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
-| Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |
-
-### DBSCAN Specific Parameters
-
-| Parameter | Flag | Description | Default |
-|-----------|------|-------------|---------|
-| Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
-| Epsilon | `-ep, --eps` | Maximum distance between samples | - |
-
-### Output Parameters
-
-| Parameter | Flag | Description | Default |
-|-----------|------|-------------|---------|
-| Output Path | `-idop, --output_path` | Results directory | clustering/ |
-| Output Log | `-ol, --out_log` | Log file path | - |
-| Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |
-
-## Clustering Algorithms
-
-### K-means
-**Method**: Partitional clustering using centroids
-- Assumes spherical clusters
-- Requires pre-specified number of clusters (k)
-- Fast and scalable
-- Works well with normalized data
-
-**Best for**:
-- Well-separated, compact clusters
-- Large datasets
-- When cluster number is approximately known
-
-### DBSCAN
-**Method**: Density-based clustering  
-- Identifies clusters of varying shapes
-- Automatically determines cluster number
-- Robust to outliers and noise
-- Requires epsilon and min_samples parameters
-
-**Best for**:
-- Irregular cluster shapes
-- Datasets with noise/outliers
-- Unknown number of clusters
-
-### Hierarchical
-**Method**: Agglomerative clustering with dendrograms
-- Creates tree-like cluster hierarchy
-- No need to specify cluster number initially  
-- Deterministic results
-- Provides multiple resolution levels
-
-**Best for**:
-- Small to medium datasets
-- When cluster hierarchy is important
-- Exploratory analysis
-
-## Input Format
-
-### Metabolic Data File
-
-Tab-separated format with samples as rows and reactions/metabolites as columns:
-
-```
-Sample	R00001	R00002	R00003	R00004	...
-Sample1	1.25	0.85	1.42	0.78	...
-Sample2	0.65	1.35	0.72	1.28	...
-Sample3	2.15	2.05	0.45	0.52	...
-Control1	1.05	0.98	1.15	1.08	...
-Control2	0.95	1.12	0.88	0.92	...
-```
-
-**Requirements**:
-- First column: sample identifiers
-- Subsequent columns: feature values (RAS, RPS, fluxes)
-- Missing values: use 0 or leave empty
-- Numeric data only (excluding sample names)
-
-## Data Preprocessing
-
-### Scaling Options
-
-#### Standard Scaling (Recommended)
-- Mean centering and unit variance scaling
-- Formula: `(x - mean) / std`
-- Ensures equal feature contribution
-- Required for distance-based algorithms
-
-#### No Scaling
-- Use original data values
-- May be appropriate for already normalized data
-- Risk of feature dominance by high-magnitude variables
-
-### Feature Selection
-
-Consider preprocessing steps:
-- Remove low-variance features
-- Apply dimensionality reduction (PCA)
-- Select most variable reactions/metabolites
-- Handle missing data appropriately
-
-## Output Files
-
-### Cluster Assignments
-
-#### Best Clustering Result (`best_clusters.tsv`)
-```
-Sample	Cluster	Silhouette_Score
-Sample1	1	0.73
-Sample2	1	0.68  
-Sample3	2	0.81
-Control1	0	0.59
-Control2	0	0.62
-```
-
-#### All K Results (`clustering_results_k{n}.tsv`)
-Individual files for each tested cluster number.
-
-### Validation Metrics
-
-#### Elbow Plot (`elbow_plot.png`)
-- X-axis: Number of clusters (k)
-- Y-axis: Within-cluster sum of squares (WCSS)
-- Identifies optimal k at the "elbow" point
-
-#### Silhouette Plots (`silhouette_k{n}.png`)
-- Individual sample silhouette scores
-- Average silhouette width per cluster
-- Overall clustering quality assessment
-
-### Summary Statistics
-
-#### Clustering Summary (`clustering_summary.txt`)
-```
-Algorithm: kmeans
-Scaling: true
-Optimal K: 3
-Best Silhouette Score: 0.72
-Number of Samples: 20
-Feature Dimensions: 150
-```
-
-#### Cluster Characteristics (`cluster_stats.tsv`)
-```
-Cluster	Size	Centroid_R00001	Centroid_R00002	Avg_Silhouette
-0	8	0.95	1.12	0.68
-1	7	1.35	0.82	0.74
-2	5	0.65	1.55	0.69
-```
-
-## Examples
-
-### Basic K-means Clustering
-
-```bash
-# Simple K-means with elbow analysis
-marea_cluster -td /opt/COBRAxy \
-              -in ras_data.tsv \
+marea_cluster -in metabolic_data.tsv \
               -cy kmeans \
               -sc true \
               -k1 2 \
               -k2 10 \
-              -el true \
-              -si true \
-              -idop kmeans_results/ \
-              -ol kmeans.log
-```
-
-### DBSCAN Analysis
-
-```bash
-# Density-based clustering with custom parameters
-marea_cluster -td /opt/COBRAxy \
-              -in flux_samples.tsv \
-              -cy dbscan \
-              -sc true \
-              -ms 5 \
-              -ep 0.5 \
-              -idop dbscan_results/ \
-              -bc best_dbscan_clusters.tsv \
-              -ol dbscan.log
-```
-
-### Hierarchical Clustering
-
-```bash
-# Hierarchical clustering for small dataset
-marea_cluster -td /opt/COBRAxy \
-              -in rps_scores.tsv \
-              -cy hierarchy \
-              -sc true \
-              -k1 2 \
-              -k2 6 \
-              -si true \
-              -idop hierarchical_results/ \
-              -ol hierarchy.log
-```
-
-### Comprehensive Clustering Analysis
-
-```bash
-# Compare multiple algorithms
-algorithms=("kmeans" "dbscan" "hierarchy")
-for alg in "${algorithms[@]}"; do
-    marea_cluster -td /opt/COBRAxy \
-                  -in metabolomics_data.tsv \
-                  -cy "$alg" \
-                  -sc true \
-                  -k1 2 \
-                  -k2 8 \
-                  -el true \
-                  -si true \
-                  -idop "${alg}_clustering/" \
-                  -ol "${alg}_cluster.log"
-done
+              -idop output/
 ```
 
-## Parameter Optimization
-
-### K-means Optimization
-
-#### Elbow Method
-1. Run K-means for k = 2 to k_max
-2. Plot WCSS vs k
-3. Identify "elbow" point where improvement diminishes
-4. Select k at elbow as optimal
-
-#### Silhouette Analysis
-1. Compute silhouette scores for each k
-2. Select k with highest average silhouette score
-3. Validate with silhouette plots
-4. Ensure clusters are well-separated
-
-### DBSCAN Parameter Tuning
-
-#### Epsilon (eps) Selection
-- Use k-distance plot to identify knee point
-- Start with eps = average distance to k-th nearest neighbor
-- Adjust based on cluster quality metrics
-
-#### Min Samples Selection
-- Rule of thumb: min_samples ≥ dimensionality + 1
-- Higher values create denser clusters
-- Lower values may increase noise sensitivity
-
-### Hierarchical Clustering
-
-#### Linkage Method
-- Ward: Minimizes within-cluster variance
-- Complete: Maximum distance between clusters
-- Average: Mean distance between clusters
-- Single: Minimum distance (prone to chaining)
-
-## Quality Assessment
-
-### Internal Validation Metrics
-
-#### Silhouette Score
-- Range: [-1, 1]
-- >0.7: Strong clustering
-- 0.5-0.7: Reasonable clustering
-- <0.5: Weak clustering
-
-#### Calinski-Harabasz Index
-- Higher values indicate better clustering
-- Ratio of between-cluster to within-cluster variance
-
-#### Davies-Bouldin Index
-- Lower values indicate better clustering
-- Average similarity between clusters
-
-### External Validation
-
-When ground truth labels available:
-- Adjusted Rand Index (ARI)
-- Normalized Mutual Information (NMI)
-- Homogeneity and Completeness scores
+## Parameters
 
-## Biological Interpretation
-
-### Cluster Characterization
-
-#### Metabolic Pathway Analysis
-- Identify enriched pathways per cluster
-- Compare metabolic profiles between clusters
-- Relate clusters to biological conditions
-
-#### Sample Annotation
-- Map clusters to experimental conditions
-- Identify batch effects or confounders
-- Validate with independent datasets
-
-#### Feature Importance
-- Determine reactions/metabolites driving clustering
-- Analyze cluster centroids for biological insights
-- Connect to known metabolic phenotypes
-
-## Integration Workflow
-
-### Upstream Data Sources
-
-#### COBRAxy Tools
-- [RAS Generator](ras-generator.md) - Cluster based on reaction activities
-- [RPS Generator](rps-generator.md) - Cluster based on reaction propensities  
-- [Flux Simulation](flux-simulation.md) - Cluster flux distributions
+| Parameter | Flag | Description | Default |
+|-----------|------|-------------|---------|
+| Input Data | `-in` | Metabolic data TSV file | - |
+| Algorithm | `-cy` | kmeans, dbscan, hierarchy | kmeans |
+| Scaling | `-sc` | Scale data | false |
+| K Min | `-k1` | Minimum clusters (K-means/hierarchy) | 2 |
+| K Max | `-k2` | Maximum clusters (K-means/hierarchy) | 10 |
+| Epsilon | `-ep` | DBSCAN radius | 0.5 |
+| Min Samples | `-ms` | DBSCAN minimum samples | 5 |
+| Elbow Plot | `-el` | Generate elbow plot | false |
+| Silhouette | `-si` | Compute silhouette scores | false |
+| Output Path | `-idop` | Output directory | marea_cluster/ |
 
-#### External Data
-- Gene expression matrices
-- Metabolomics datasets
-- Clinical metadata
-
-### Downstream Analysis
-
-#### Supervised Learning
-Use cluster labels for:
-- Classification model training
-- Biomarker discovery
-- Outcome prediction
+## Input Format
 
-#### Differential Analysis
-- Compare clusters with [MAREA](marea.md)
-- Identify cluster-specific metabolic signatures
-- Pathway enrichment analysis
-
-### Typical Pipeline
-
-```bash
-# 1. Generate metabolic scores
-ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv
-
-# 2. Perform clustering analysis
-marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
-              -sc true -k1 2 -k2 8 -el true -si true \
-              -idop clusters/ -bc best_clusters.tsv
-
-# 3. Analyze cluster differences
-marea -td /opt/COBRAxy -input_data ras.tsv \
-      -input_class best_clusters.tsv -comparison manyvsmany \
-      -test ks -choice_map ENGRO2 -idop cluster_analysis/
+```
+Reaction	Sample1	Sample2	Sample3
+R00001	1.25	0.85	1.42
+R00002	0.65	1.35	0.72
 ```
 
-## Tips and Best Practices
-
-### Data Preparation
-- **Normalization**: Always scale features for distance-based methods
-- **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
-- **Missing Values**: Handle appropriately (imputation or removal)
-- **Outliers**: Identify and consider removal for K-means
-
-### Algorithm Selection
-- **K-means**: Start here for most applications
-- **DBSCAN**: Use when clusters have irregular shapes or noise present
-- **Hierarchical**: Choose for small datasets or when hierarchy matters
-
-### Parameter Selection
-- **Start Simple**: Begin with default parameters
-- **Use Validation**: Always employ silhouette analysis
-- **Cross-Validate**: Test stability across parameter ranges
-- **Biological Validation**: Ensure clusters make biological sense
-
-### Result Interpretation
-- **Multiple Algorithms**: Compare results across methods
-- **Stability Assessment**: Check clustering reproducibility
-- **Biological Context**: Integrate with known sample characteristics
-- **Statistical Testing**: Validate cluster differences formally
-
-## Troubleshooting
+**File Format Notes:**
+- Use **tab-separated** values (TSV) or **comma-separated** (CSV)
+- First row must contain column headers (Reaction, Sample names)
+- Numeric values only for metabolic data
+- Missing values should be avoided or handled before clustering
 
-### Common Issues
-
-**Poor clustering quality**
-- Check data scaling and normalization
-- Assess feature selection and dimensionality
-- Try different algorithms or parameters
-- Evaluate data structure with PCA/t-SNE
-
-**Algorithm doesn't converge**
-- Increase iteration limits for K-means
-- Adjust epsilon/min_samples for DBSCAN
-- Check for numerical stability issues
-- Verify input data format
-
-**Memory or performance issues**
-- Reduce dataset size or dimensionality
-- Use sampling for large datasets
-- Consider approximate algorithms
-- Monitor system resources
-
-### Error Messages
-
-| Error | Cause | Solution |
-|-------|-------|----------|
-| "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
-| "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
-| "Memory allocation error" | Dataset too large | Reduce size or increase memory |
-| "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |
+## Algorithms
 
-### Performance Optimization
-
-**Large Datasets**
-- Use mini-batch K-means for speed
-- Sample data for parameter optimization
-- Employ dimensionality reduction
-- Consider distributed computing
-
-**High-Dimensional Data**
-- Apply feature selection
-- Use PCA preprocessing
-- Consider specialized algorithms
-- Validate results carefully
-
-## Advanced Usage
-
-### Custom Distance Metrics
-
-For specialized applications, modify distance calculations:
-
-```python
-# Custom distance function for metabolic data
-def metabolic_distance(x, y):
-    # Implement pathway-aware distance metric
-    return custom_distance_value
-```
-
-### Ensemble Clustering
+- **K-means**: Fast, requires number of clusters
+- **DBSCAN**: Density-based, handles noise and irregular shapes
+- **Hierarchical**: Tree-based, good for small datasets
 
-Combine multiple clustering results:
-
-```bash
-# Run multiple algorithms and combine
-for method in kmeans dbscan hierarchy; do
-    marea_cluster -cy $method -in data.tsv -idop ${method}_results/
-done
-
-# Consensus clustering (requires custom script)
-python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
-```
-
-### Interactive Analysis
+## Output
 
-Generate interactive plots for exploration:
-
-```python
-import plotly.express as px
-import pandas as pd
-
-# Load clustering results  
-results = pd.read_csv('best_clusters.tsv', sep='\t')
-data = pd.read_csv('metabolic_data.tsv', sep='\t')
-
-# Interactive scatter plot
-fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
-fig.show()
-```
+- `clusters.tsv`: Sample assignments
+- `silhouette_scores.tsv`: Cluster quality metrics
+- `elbow_plot.svg`: Optimal K visualization (K-means)
+- `*.log`: Processing log
 
 ## See Also
 
-- [MAREA](marea.md) - Statistical analysis of cluster differences
-- [RAS Generator](ras-generator.md) - Generate clustering input data
-- [Flux Simulation](flux-simulation.md) - Alternative clustering data source
-- [Clustering Tutorial](/tutorials/clustering-analysis.md)
-- [Validation Methods Reference](/tutorials/cluster-validation.md)
\ No newline at end of file
+- [MAREA](tools/marea)
+- [RAS Generator](tools/ras-generator)
+- [Flux Simulation](tools/flux-simulation)