view COBRAxy/docs/tools/marea-cluster.md @ 509:5956dcf94277 draft default tip

Uploaded
author francesco_lapi
date Wed, 01 Oct 2025 15:34:21 +0000
parents 4ed95023af20
children
line wrap: on
line source

# MAREA Cluster

Perform clustering analysis on metabolic data to identify sample groups and patterns.

## Overview

MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.

## Usage

### Command Line

```bash
marea_cluster -td /path/to/COBRAxy \
              -in metabolic_data.tsv \
              -cy kmeans \
              -sc true \
              -k1 2 \
              -k2 8 \
              -el true \
              -si true \
              -idop clustering_results/ \
              -ol cluster.log
```

### Galaxy Interface

Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.

## Parameters

### Required Parameters

| Parameter | Flag | Description |
|-----------|------|-------------|
| Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
| Input Data | `-in, --input` | Metabolic data file (TSV format) |

### Clustering Parameters

| Parameter | Flag | Description | Default |
|-----------|------|-------------|---------|
| Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
| Data Scaling | `-sc, --scaling` | Apply data normalization | true |
| Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
| Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |

### Analysis Options

| Parameter | Flag | Description | Default |
|-----------|------|-------------|---------|
| Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
| Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |

### DBSCAN Specific Parameters

| Parameter | Flag | Description | Default |
|-----------|------|-------------|---------|
| Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
| Epsilon | `-ep, --eps` | Maximum distance between samples | - |

### Output Parameters

| Parameter | Flag | Description | Default |
|-----------|------|-------------|---------|
| Output Path | `-idop, --output_path` | Results directory | clustering/ |
| Output Log | `-ol, --out_log` | Log file path | - |
| Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |

## Clustering Algorithms

### K-means
**Method**: Partitional clustering using centroids
- Assumes spherical clusters
- Requires pre-specified number of clusters (k)
- Fast and scalable
- Works well with normalized data

**Best for**:
- Well-separated, compact clusters
- Large datasets
- When cluster number is approximately known

### DBSCAN
**Method**: Density-based clustering  
- Identifies clusters of varying shapes
- Automatically determines cluster number
- Robust to outliers and noise
- Requires epsilon and min_samples parameters

**Best for**:
- Irregular cluster shapes
- Datasets with noise/outliers
- Unknown number of clusters

### Hierarchical
**Method**: Agglomerative clustering with dendrograms
- Creates tree-like cluster hierarchy
- No need to specify cluster number initially  
- Deterministic results
- Provides multiple resolution levels

**Best for**:
- Small to medium datasets
- When cluster hierarchy is important
- Exploratory analysis

## Input Format

### Metabolic Data File

Tab-separated format with samples as rows and reactions/metabolites as columns:

```
Sample	R00001	R00002	R00003	R00004	...
Sample1	1.25	0.85	1.42	0.78	...
Sample2	0.65	1.35	0.72	1.28	...
Sample3	2.15	2.05	0.45	0.52	...
Control1	1.05	0.98	1.15	1.08	...
Control2	0.95	1.12	0.88	0.92	...
```

**Requirements**:
- First column: sample identifiers
- Subsequent columns: feature values (RAS, RPS, fluxes)
- Missing values: use 0 or leave empty
- Numeric data only (excluding sample names)

## Data Preprocessing

### Scaling Options

#### Standard Scaling (Recommended)
- Mean centering and unit variance scaling
- Formula: `(x - mean) / std`
- Ensures equal feature contribution
- Required for distance-based algorithms

#### No Scaling
- Use original data values
- May be appropriate for already normalized data
- Risk of feature dominance by high-magnitude variables

### Feature Selection

Consider preprocessing steps:
- Remove low-variance features
- Apply dimensionality reduction (PCA)
- Select most variable reactions/metabolites
- Handle missing data appropriately

## Output Files

### Cluster Assignments

#### Best Clustering Result (`best_clusters.tsv`)
```
Sample	Cluster	Silhouette_Score
Sample1	1	0.73
Sample2	1	0.68  
Sample3	2	0.81
Control1	0	0.59
Control2	0	0.62
```

#### All K Results (`clustering_results_k{n}.tsv`)
Individual files for each tested cluster number.

### Validation Metrics

#### Elbow Plot (`elbow_plot.png`)
- X-axis: Number of clusters (k)
- Y-axis: Within-cluster sum of squares (WCSS)
- Identifies optimal k at the "elbow" point

#### Silhouette Plots (`silhouette_k{n}.png`)
- Individual sample silhouette scores
- Average silhouette width per cluster
- Overall clustering quality assessment

### Summary Statistics

#### Clustering Summary (`clustering_summary.txt`)
```
Algorithm: kmeans
Scaling: true
Optimal K: 3
Best Silhouette Score: 0.72
Number of Samples: 20
Feature Dimensions: 150
```

#### Cluster Characteristics (`cluster_stats.tsv`)
```
Cluster	Size	Centroid_R00001	Centroid_R00002	Avg_Silhouette
0	8	0.95	1.12	0.68
1	7	1.35	0.82	0.74
2	5	0.65	1.55	0.69
```

## Examples

### Basic K-means Clustering

```bash
# Simple K-means with elbow analysis
marea_cluster -td /opt/COBRAxy \
              -in ras_data.tsv \
              -cy kmeans \
              -sc true \
              -k1 2 \
              -k2 10 \
              -el true \
              -si true \
              -idop kmeans_results/ \
              -ol kmeans.log
```

### DBSCAN Analysis

```bash
# Density-based clustering with custom parameters
marea_cluster -td /opt/COBRAxy \
              -in flux_samples.tsv \
              -cy dbscan \
              -sc true \
              -ms 5 \
              -ep 0.5 \
              -idop dbscan_results/ \
              -bc best_dbscan_clusters.tsv \
              -ol dbscan.log
```

### Hierarchical Clustering

```bash
# Hierarchical clustering for small dataset
marea_cluster -td /opt/COBRAxy \
              -in rps_scores.tsv \
              -cy hierarchy \
              -sc true \
              -k1 2 \
              -k2 6 \
              -si true \
              -idop hierarchical_results/ \
              -ol hierarchy.log
```

### Comprehensive Clustering Analysis

```bash
# Compare multiple algorithms
algorithms=("kmeans" "dbscan" "hierarchy")
for alg in "${algorithms[@]}"; do
    marea_cluster -td /opt/COBRAxy \
                  -in metabolomics_data.tsv \
                  -cy "$alg" \
                  -sc true \
                  -k1 2 \
                  -k2 8 \
                  -el true \
                  -si true \
                  -idop "${alg}_clustering/" \
                  -ol "${alg}_cluster.log"
done
```

## Parameter Optimization

### K-means Optimization

#### Elbow Method
1. Run K-means for k = 2 to k_max
2. Plot WCSS vs k
3. Identify "elbow" point where improvement diminishes
4. Select k at elbow as optimal

#### Silhouette Analysis
1. Compute silhouette scores for each k
2. Select k with highest average silhouette score
3. Validate with silhouette plots
4. Ensure clusters are well-separated

### DBSCAN Parameter Tuning

#### Epsilon (eps) Selection
- Use k-distance plot to identify knee point
- Start with eps = average distance to k-th nearest neighbor
- Adjust based on cluster quality metrics

#### Min Samples Selection
- Rule of thumb: min_samples ≥ dimensionality + 1
- Higher values create denser clusters
- Lower values may increase noise sensitivity

### Hierarchical Clustering

#### Linkage Method
- Ward: Minimizes within-cluster variance
- Complete: Maximum distance between clusters
- Average: Mean distance between clusters
- Single: Minimum distance (prone to chaining)

## Quality Assessment

### Internal Validation Metrics

#### Silhouette Score
- Range: [-1, 1]
- >0.7: Strong clustering
- 0.5-0.7: Reasonable clustering
- <0.5: Weak clustering

#### Calinski-Harabasz Index
- Higher values indicate better clustering
- Ratio of between-cluster to within-cluster variance

#### Davies-Bouldin Index
- Lower values indicate better clustering
- Average similarity between clusters

### External Validation

When ground truth labels available:
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- Homogeneity and Completeness scores

## Biological Interpretation

### Cluster Characterization

#### Metabolic Pathway Analysis
- Identify enriched pathways per cluster
- Compare metabolic profiles between clusters
- Relate clusters to biological conditions

#### Sample Annotation
- Map clusters to experimental conditions
- Identify batch effects or confounders
- Validate with independent datasets

#### Feature Importance
- Determine reactions/metabolites driving clustering
- Analyze cluster centroids for biological insights
- Connect to known metabolic phenotypes

## Integration Workflow

### Upstream Data Sources

#### COBRAxy Tools
- [RAS Generator](ras-generator.md) - Cluster based on reaction activities
- [RPS Generator](rps-generator.md) - Cluster based on reaction propensities  
- [Flux Simulation](flux-simulation.md) - Cluster flux distributions

#### External Data
- Gene expression matrices
- Metabolomics datasets
- Clinical metadata

### Downstream Analysis

#### Supervised Learning
Use cluster labels for:
- Classification model training
- Biomarker discovery
- Outcome prediction

#### Differential Analysis
- Compare clusters with [MAREA](marea.md)
- Identify cluster-specific metabolic signatures
- Pathway enrichment analysis

### Typical Pipeline

```bash
# 1. Generate metabolic scores
ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv

# 2. Perform clustering analysis
marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
              -sc true -k1 2 -k2 8 -el true -si true \
              -idop clusters/ -bc best_clusters.tsv

# 3. Analyze cluster differences
marea -td /opt/COBRAxy -input_data ras.tsv \
      -input_class best_clusters.tsv -comparison manyvsmany \
      -test ks -choice_map ENGRO2 -idop cluster_analysis/
```

## Tips and Best Practices

### Data Preparation
- **Normalization**: Always scale features for distance-based methods
- **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
- **Missing Values**: Handle appropriately (imputation or removal)
- **Outliers**: Identify and consider removal for K-means

### Algorithm Selection
- **K-means**: Start here for most applications
- **DBSCAN**: Use when clusters have irregular shapes or noise present
- **Hierarchical**: Choose for small datasets or when hierarchy matters

### Parameter Selection
- **Start Simple**: Begin with default parameters
- **Use Validation**: Always employ silhouette analysis
- **Cross-Validate**: Test stability across parameter ranges
- **Biological Validation**: Ensure clusters make biological sense

### Result Interpretation
- **Multiple Algorithms**: Compare results across methods
- **Stability Assessment**: Check clustering reproducibility
- **Biological Context**: Integrate with known sample characteristics
- **Statistical Testing**: Validate cluster differences formally

## Troubleshooting

### Common Issues

**Poor clustering quality**
- Check data scaling and normalization
- Assess feature selection and dimensionality
- Try different algorithms or parameters
- Evaluate data structure with PCA/t-SNE

**Algorithm doesn't converge**
- Increase iteration limits for K-means
- Adjust epsilon/min_samples for DBSCAN
- Check for numerical stability issues
- Verify input data format

**Memory or performance issues**
- Reduce dataset size or dimensionality
- Use sampling for large datasets
- Consider approximate algorithms
- Monitor system resources

### Error Messages

| Error | Cause | Solution |
|-------|-------|----------|
| "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
| "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
| "Memory allocation error" | Dataset too large | Reduce size or increase memory |
| "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |

### Performance Optimization

**Large Datasets**
- Use mini-batch K-means for speed
- Sample data for parameter optimization
- Employ dimensionality reduction
- Consider distributed computing

**High-Dimensional Data**
- Apply feature selection
- Use PCA preprocessing
- Consider specialized algorithms
- Validate results carefully

## Advanced Usage

### Custom Distance Metrics

For specialized applications, modify distance calculations:

```python
# Custom distance function for metabolic data
def metabolic_distance(x, y):
    # Implement pathway-aware distance metric
    return custom_distance_value
```

### Ensemble Clustering

Combine multiple clustering results:

```bash
# Run multiple algorithms and combine
for method in kmeans dbscan hierarchy; do
    marea_cluster -cy $method -in data.tsv -idop ${method}_results/
done

# Consensus clustering (requires custom script)
python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
```

### Interactive Analysis

Generate interactive plots for exploration:

```python
import plotly.express as px
import pandas as pd

# Load clustering results  
results = pd.read_csv('best_clusters.tsv', sep='\t')
data = pd.read_csv('metabolic_data.tsv', sep='\t')

# Interactive scatter plot
fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
fig.show()
```

## See Also

- [MAREA](marea.md) - Statistical analysis of cluster differences
- [RAS Generator](ras-generator.md) - Generate clustering input data
- [Flux Simulation](flux-simulation.md) - Alternative clustering data source
- [Clustering Tutorial](../tutorials/clustering-analysis.md)
- [Validation Methods Reference](../tutorials/cluster-validation.md)