492
|
1 # MAREA Cluster
|
|
2
|
|
3 Perform clustering analysis on metabolic data to identify sample groups and patterns.
|
|
4
|
|
5 ## Overview
|
|
6
|
|
7 MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.
|
|
8
|
|
9 ## Usage
|
|
10
|
|
11 ### Command Line
|
|
12
|
|
13 ```bash
|
|
14 marea_cluster -td /path/to/COBRAxy \
|
|
15 -in metabolic_data.tsv \
|
|
16 -cy kmeans \
|
|
17 -sc true \
|
|
18 -k1 2 \
|
|
19 -k2 8 \
|
|
20 -el true \
|
|
21 -si true \
|
|
22 -idop clustering_results/ \
|
|
23 -ol cluster.log
|
|
24 ```
|
|
25
|
|
26 ### Galaxy Interface
|
|
27
|
|
28 Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.
|
|
29
|
|
30 ## Parameters
|
|
31
|
|
32 ### Required Parameters
|
|
33
|
|
34 | Parameter | Flag | Description |
|
|
35 |-----------|------|-------------|
|
|
36 | Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
|
|
37 | Input Data | `-in, --input` | Metabolic data file (TSV format) |
|
|
38
|
|
39 ### Clustering Parameters
|
|
40
|
|
41 | Parameter | Flag | Description | Default |
|
|
42 |-----------|------|-------------|---------|
|
|
43 | Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
|
|
44 | Data Scaling | `-sc, --scaling` | Apply data normalization | true |
|
|
45 | Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
|
|
46 | Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |
|
|
47
|
|
48 ### Analysis Options
|
|
49
|
|
50 | Parameter | Flag | Description | Default |
|
|
51 |-----------|------|-------------|---------|
|
|
52 | Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
|
|
53 | Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |
|
|
54
|
|
55 ### DBSCAN Specific Parameters
|
|
56
|
|
57 | Parameter | Flag | Description | Default |
|
|
58 |-----------|------|-------------|---------|
|
|
59 | Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
|
|
60 | Epsilon | `-ep, --eps` | Maximum distance between samples | - |
|
|
61
|
|
62 ### Output Parameters
|
|
63
|
|
64 | Parameter | Flag | Description | Default |
|
|
65 |-----------|------|-------------|---------|
|
|
66 | Output Path | `-idop, --output_path` | Results directory | clustering/ |
|
|
67 | Output Log | `-ol, --out_log` | Log file path | - |
|
|
68 | Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |
|
|
69
|
|
70 ## Clustering Algorithms
|
|
71
|
|
72 ### K-means
|
|
73 **Method**: Partitional clustering using centroids
|
|
74 - Assumes spherical clusters
|
|
75 - Requires pre-specified number of clusters (k)
|
|
76 - Fast and scalable
|
|
77 - Works well with normalized data
|
|
78
|
|
79 **Best for**:
|
|
80 - Well-separated, compact clusters
|
|
81 - Large datasets
|
|
82 - When cluster number is approximately known
|
|
83
|
|
84 ### DBSCAN
|
|
85 **Method**: Density-based clustering
|
|
86 - Identifies clusters of varying shapes
|
|
87 - Automatically determines cluster number
|
|
88 - Robust to outliers and noise
|
|
89 - Requires epsilon and min_samples parameters
|
|
90
|
|
91 **Best for**:
|
|
92 - Irregular cluster shapes
|
|
93 - Datasets with noise/outliers
|
|
94 - Unknown number of clusters
|
|
95
|
|
96 ### Hierarchical
|
|
97 **Method**: Agglomerative clustering with dendrograms
|
|
98 - Creates tree-like cluster hierarchy
|
|
99 - No need to specify cluster number initially
|
|
100 - Deterministic results
|
|
101 - Provides multiple resolution levels
|
|
102
|
|
103 **Best for**:
|
|
104 - Small to medium datasets
|
|
105 - When cluster hierarchy is important
|
|
106 - Exploratory analysis
|
|
107
|
|
108 ## Input Format
|
|
109
|
|
110 ### Metabolic Data File
|
|
111
|
|
112 Tab-separated format with samples as rows and reactions/metabolites as columns:
|
|
113
|
|
114 ```
|
|
115 Sample R00001 R00002 R00003 R00004 ...
|
|
116 Sample1 1.25 0.85 1.42 0.78 ...
|
|
117 Sample2 0.65 1.35 0.72 1.28 ...
|
|
118 Sample3 2.15 2.05 0.45 0.52 ...
|
|
119 Control1 1.05 0.98 1.15 1.08 ...
|
|
120 Control2 0.95 1.12 0.88 0.92 ...
|
|
121 ```
|
|
122
|
|
123 **Requirements**:
|
|
124 - First column: sample identifiers
|
|
125 - Subsequent columns: feature values (RAS, RPS, fluxes)
|
|
126 - Missing values: use 0 or leave empty
|
|
127 - Numeric data only (excluding sample names)
|
|
128
|
|
129 ## Data Preprocessing
|
|
130
|
|
131 ### Scaling Options
|
|
132
|
|
133 #### Standard Scaling (Recommended)
|
|
134 - Mean centering and unit variance scaling
|
|
135 - Formula: `(x - mean) / std`
|
|
136 - Ensures equal feature contribution
|
|
137 - Required for distance-based algorithms
|
|
138
|
|
139 #### No Scaling
|
|
140 - Use original data values
|
|
141 - May be appropriate for already normalized data
|
|
142 - Risk of feature dominance by high-magnitude variables
|
|
143
|
|
144 ### Feature Selection
|
|
145
|
|
146 Consider preprocessing steps:
|
|
147 - Remove low-variance features
|
|
148 - Apply dimensionality reduction (PCA)
|
|
149 - Select most variable reactions/metabolites
|
|
150 - Handle missing data appropriately
|
|
151
|
|
152 ## Output Files
|
|
153
|
|
154 ### Cluster Assignments
|
|
155
|
|
156 #### Best Clustering Result (`best_clusters.tsv`)
|
|
157 ```
|
|
158 Sample Cluster Silhouette_Score
|
|
159 Sample1 1 0.73
|
|
160 Sample2 1 0.68
|
|
161 Sample3 2 0.81
|
|
162 Control1 0 0.59
|
|
163 Control2 0 0.62
|
|
164 ```
|
|
165
|
|
166 #### All K Results (`clustering_results_k{n}.tsv`)
|
|
167 Individual files for each tested cluster number.
|
|
168
|
|
169 ### Validation Metrics
|
|
170
|
|
171 #### Elbow Plot (`elbow_plot.png`)
|
|
172 - X-axis: Number of clusters (k)
|
|
173 - Y-axis: Within-cluster sum of squares (WCSS)
|
|
174 - Identifies optimal k at the "elbow" point
|
|
175
|
|
176 #### Silhouette Plots (`silhouette_k{n}.png`)
|
|
177 - Individual sample silhouette scores
|
|
178 - Average silhouette width per cluster
|
|
179 - Overall clustering quality assessment
|
|
180
|
|
181 ### Summary Statistics
|
|
182
|
|
183 #### Clustering Summary (`clustering_summary.txt`)
|
|
184 ```
|
|
185 Algorithm: kmeans
|
|
186 Scaling: true
|
|
187 Optimal K: 3
|
|
188 Best Silhouette Score: 0.72
|
|
189 Number of Samples: 20
|
|
190 Feature Dimensions: 150
|
|
191 ```
|
|
192
|
|
193 #### Cluster Characteristics (`cluster_stats.tsv`)
|
|
194 ```
|
|
195 Cluster Size Centroid_R00001 Centroid_R00002 Avg_Silhouette
|
|
196 0 8 0.95 1.12 0.68
|
|
197 1 7 1.35 0.82 0.74
|
|
198 2 5 0.65 1.55 0.69
|
|
199 ```
|
|
200
|
|
201 ## Examples
|
|
202
|
|
203 ### Basic K-means Clustering
|
|
204
|
|
205 ```bash
|
|
206 # Simple K-means with elbow analysis
|
|
207 marea_cluster -td /opt/COBRAxy \
|
|
208 -in ras_data.tsv \
|
|
209 -cy kmeans \
|
|
210 -sc true \
|
|
211 -k1 2 \
|
|
212 -k2 10 \
|
|
213 -el true \
|
|
214 -si true \
|
|
215 -idop kmeans_results/ \
|
|
216 -ol kmeans.log
|
|
217 ```
|
|
218
|
|
219 ### DBSCAN Analysis
|
|
220
|
|
221 ```bash
|
|
222 # Density-based clustering with custom parameters
|
|
223 marea_cluster -td /opt/COBRAxy \
|
|
224 -in flux_samples.tsv \
|
|
225 -cy dbscan \
|
|
226 -sc true \
|
|
227 -ms 5 \
|
|
228 -ep 0.5 \
|
|
229 -idop dbscan_results/ \
|
|
230 -bc best_dbscan_clusters.tsv \
|
|
231 -ol dbscan.log
|
|
232 ```
|
|
233
|
|
234 ### Hierarchical Clustering
|
|
235
|
|
236 ```bash
|
|
237 # Hierarchical clustering for small dataset
|
|
238 marea_cluster -td /opt/COBRAxy \
|
|
239 -in rps_scores.tsv \
|
|
240 -cy hierarchy \
|
|
241 -sc true \
|
|
242 -k1 2 \
|
|
243 -k2 6 \
|
|
244 -si true \
|
|
245 -idop hierarchical_results/ \
|
|
246 -ol hierarchy.log
|
|
247 ```
|
|
248
|
|
249 ### Comprehensive Clustering Analysis
|
|
250
|
|
251 ```bash
|
|
252 # Compare multiple algorithms
|
|
253 algorithms=("kmeans" "dbscan" "hierarchy")
|
|
254 for alg in "${algorithms[@]}"; do
|
|
255 marea_cluster -td /opt/COBRAxy \
|
|
256 -in metabolomics_data.tsv \
|
|
257 -cy "$alg" \
|
|
258 -sc true \
|
|
259 -k1 2 \
|
|
260 -k2 8 \
|
|
261 -el true \
|
|
262 -si true \
|
|
263 -idop "${alg}_clustering/" \
|
|
264 -ol "${alg}_cluster.log"
|
|
265 done
|
|
266 ```
|
|
267
|
|
268 ## Parameter Optimization
|
|
269
|
|
270 ### K-means Optimization
|
|
271
|
|
272 #### Elbow Method
|
|
273 1. Run K-means for k = 2 to k_max
|
|
274 2. Plot WCSS vs k
|
|
275 3. Identify "elbow" point where improvement diminishes
|
|
276 4. Select k at elbow as optimal
|
|
277
|
|
278 #### Silhouette Analysis
|
|
279 1. Compute silhouette scores for each k
|
|
280 2. Select k with highest average silhouette score
|
|
281 3. Validate with silhouette plots
|
|
282 4. Ensure clusters are well-separated
|
|
283
|
|
284 ### DBSCAN Parameter Tuning
|
|
285
|
|
286 #### Epsilon (eps) Selection
|
|
287 - Use k-distance plot to identify knee point
|
|
288 - Start with eps = average distance to k-th nearest neighbor
|
|
289 - Adjust based on cluster quality metrics
|
|
290
|
|
291 #### Min Samples Selection
|
|
292 - Rule of thumb: min_samples ≥ dimensionality + 1
|
|
293 - Higher values create denser clusters
|
|
294 - Lower values may increase noise sensitivity
|
|
295
|
|
296 ### Hierarchical Clustering
|
|
297
|
|
298 #### Linkage Method
|
|
299 - Ward: Minimizes within-cluster variance
|
|
300 - Complete: Maximum distance between clusters
|
|
301 - Average: Mean distance between clusters
|
|
302 - Single: Minimum distance (prone to chaining)
|
|
303
|
|
304 ## Quality Assessment
|
|
305
|
|
306 ### Internal Validation Metrics
|
|
307
|
|
308 #### Silhouette Score
|
|
309 - Range: [-1, 1]
|
|
310 - >0.7: Strong clustering
|
|
311 - 0.5-0.7: Reasonable clustering
|
|
312 - <0.5: Weak clustering
|
|
313
|
|
314 #### Calinski-Harabasz Index
|
|
315 - Higher values indicate better clustering
|
|
316 - Ratio of between-cluster to within-cluster variance
|
|
317
|
|
318 #### Davies-Bouldin Index
|
|
319 - Lower values indicate better clustering
|
|
320 - Average similarity between clusters
|
|
321
|
|
322 ### External Validation
|
|
323
|
|
324 When ground truth labels available:
|
|
325 - Adjusted Rand Index (ARI)
|
|
326 - Normalized Mutual Information (NMI)
|
|
327 - Homogeneity and Completeness scores
|
|
328
|
|
329 ## Biological Interpretation
|
|
330
|
|
331 ### Cluster Characterization
|
|
332
|
|
333 #### Metabolic Pathway Analysis
|
|
334 - Identify enriched pathways per cluster
|
|
335 - Compare metabolic profiles between clusters
|
|
336 - Relate clusters to biological conditions
|
|
337
|
|
338 #### Sample Annotation
|
|
339 - Map clusters to experimental conditions
|
|
340 - Identify batch effects or confounders
|
|
341 - Validate with independent datasets
|
|
342
|
|
343 #### Feature Importance
|
|
344 - Determine reactions/metabolites driving clustering
|
|
345 - Analyze cluster centroids for biological insights
|
|
346 - Connect to known metabolic phenotypes
|
|
347
|
|
348 ## Integration Workflow
|
|
349
|
|
350 ### Upstream Data Sources
|
|
351
|
|
352 #### COBRAxy Tools
|
|
353 - [RAS Generator](ras-generator.md) - Cluster based on reaction activities
|
|
354 - [RPS Generator](rps-generator.md) - Cluster based on reaction propensities
|
|
355 - [Flux Simulation](flux-simulation.md) - Cluster flux distributions
|
|
356
|
|
357 #### External Data
|
|
358 - Gene expression matrices
|
|
359 - Metabolomics datasets
|
|
360 - Clinical metadata
|
|
361
|
|
362 ### Downstream Analysis
|
|
363
|
|
364 #### Supervised Learning
|
|
365 Use cluster labels for:
|
|
366 - Classification model training
|
|
367 - Biomarker discovery
|
|
368 - Outcome prediction
|
|
369
|
|
370 #### Differential Analysis
|
|
371 - Compare clusters with [MAREA](marea.md)
|
|
372 - Identify cluster-specific metabolic signatures
|
|
373 - Pathway enrichment analysis
|
|
374
|
|
375 ### Typical Pipeline
|
|
376
|
|
377 ```bash
|
|
378 # 1. Generate metabolic scores
|
|
379 ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv
|
|
380
|
|
381 # 2. Perform clustering analysis
|
|
382 marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
|
|
383 -sc true -k1 2 -k2 8 -el true -si true \
|
|
384 -idop clusters/ -bc best_clusters.tsv
|
|
385
|
|
386 # 3. Analyze cluster differences
|
|
387 marea -td /opt/COBRAxy -input_data ras.tsv \
|
|
388 -input_class best_clusters.tsv -comparison manyvsmany \
|
|
389 -test ks -choice_map ENGRO2 -idop cluster_analysis/
|
|
390 ```
|
|
391
|
|
392 ## Tips and Best Practices
|
|
393
|
|
394 ### Data Preparation
|
|
395 - **Normalization**: Always scale features for distance-based methods
|
|
396 - **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
|
|
397 - **Missing Values**: Handle appropriately (imputation or removal)
|
|
398 - **Outliers**: Identify and consider removal for K-means
|
|
399
|
|
400 ### Algorithm Selection
|
|
401 - **K-means**: Start here for most applications
|
|
402 - **DBSCAN**: Use when clusters have irregular shapes or noise present
|
|
403 - **Hierarchical**: Choose for small datasets or when hierarchy matters
|
|
404
|
|
405 ### Parameter Selection
|
|
406 - **Start Simple**: Begin with default parameters
|
|
407 - **Use Validation**: Always employ silhouette analysis
|
|
408 - **Cross-Validate**: Test stability across parameter ranges
|
|
409 - **Biological Validation**: Ensure clusters make biological sense
|
|
410
|
|
411 ### Result Interpretation
|
|
412 - **Multiple Algorithms**: Compare results across methods
|
|
413 - **Stability Assessment**: Check clustering reproducibility
|
|
414 - **Biological Context**: Integrate with known sample characteristics
|
|
415 - **Statistical Testing**: Validate cluster differences formally
|
|
416
|
|
417 ## Troubleshooting
|
|
418
|
|
419 ### Common Issues
|
|
420
|
|
421 **Poor clustering quality**
|
|
422 - Check data scaling and normalization
|
|
423 - Assess feature selection and dimensionality
|
|
424 - Try different algorithms or parameters
|
|
425 - Evaluate data structure with PCA/t-SNE
|
|
426
|
|
427 **Algorithm doesn't converge**
|
|
428 - Increase iteration limits for K-means
|
|
429 - Adjust epsilon/min_samples for DBSCAN
|
|
430 - Check for numerical stability issues
|
|
431 - Verify input data format
|
|
432
|
|
433 **Memory or performance issues**
|
|
434 - Reduce dataset size or dimensionality
|
|
435 - Use sampling for large datasets
|
|
436 - Consider approximate algorithms
|
|
437 - Monitor system resources
|
|
438
|
|
439 ### Error Messages
|
|
440
|
|
441 | Error | Cause | Solution |
|
|
442 |-------|-------|----------|
|
|
443 | "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
|
|
444 | "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
|
|
445 | "Memory allocation error" | Dataset too large | Reduce size or increase memory |
|
|
446 | "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |
|
|
447
|
|
448 ### Performance Optimization
|
|
449
|
|
450 **Large Datasets**
|
|
451 - Use mini-batch K-means for speed
|
|
452 - Sample data for parameter optimization
|
|
453 - Employ dimensionality reduction
|
|
454 - Consider distributed computing
|
|
455
|
|
456 **High-Dimensional Data**
|
|
457 - Apply feature selection
|
|
458 - Use PCA preprocessing
|
|
459 - Consider specialized algorithms
|
|
460 - Validate results carefully
|
|
461
|
|
462 ## Advanced Usage
|
|
463
|
|
464 ### Custom Distance Metrics
|
|
465
|
|
466 For specialized applications, modify distance calculations:
|
|
467
|
|
468 ```python
|
|
469 # Custom distance function for metabolic data
|
|
470 def metabolic_distance(x, y):
|
|
471 # Implement pathway-aware distance metric
|
|
472 return custom_distance_value
|
|
473 ```
|
|
474
|
|
475 ### Ensemble Clustering
|
|
476
|
|
477 Combine multiple clustering results:
|
|
478
|
|
479 ```bash
|
|
480 # Run multiple algorithms and combine
|
|
481 for method in kmeans dbscan hierarchy; do
|
|
482 marea_cluster -cy $method -in data.tsv -idop ${method}_results/
|
|
483 done
|
|
484
|
|
485 # Consensus clustering (requires custom script)
|
|
486 python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
|
|
487 ```
|
|
488
|
|
489 ### Interactive Analysis
|
|
490
|
|
491 Generate interactive plots for exploration:
|
|
492
|
|
493 ```python
|
|
494 import plotly.express as px
|
|
495 import pandas as pd
|
|
496
|
|
497 # Load clustering results
|
|
498 results = pd.read_csv('best_clusters.tsv', sep='\t')
|
|
499 data = pd.read_csv('metabolic_data.tsv', sep='\t')
|
|
500
|
|
501 # Interactive scatter plot
|
|
502 fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
|
|
503 fig.show()
|
|
504 ```
|
|
505
|
|
506 ## See Also
|
|
507
|
|
508 - [MAREA](marea.md) - Statistical analysis of cluster differences
|
|
509 - [RAS Generator](ras-generator.md) - Generate clustering input data
|
|
510 - [Flux Simulation](flux-simulation.md) - Alternative clustering data source
|
|
511 - [Clustering Tutorial](../tutorials/clustering-analysis.md)
|
|
512 - [Validation Methods Reference](../tutorials/cluster-validation.md) |