annotate COBRAxy/docs/tools/marea-cluster.md @ 509:5956dcf94277 draft default tip

Uploaded
author francesco_lapi
date Wed, 01 Oct 2025 15:34:21 +0000
parents 4ed95023af20
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
492
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
1 # MAREA Cluster
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
2
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
3 Perform clustering analysis on metabolic data to identify sample groups and patterns.
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
4
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
5 ## Overview
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
6
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
7 MAREA Cluster performs unsupervised clustering analysis on RAS, RPS, or flux data to identify natural groupings among samples. It supports multiple clustering algorithms (K-means, DBSCAN, Hierarchical) with optional data scaling and validation metrics including elbow plots and silhouette analysis.
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
8
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
9 ## Usage
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
10
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
11 ### Command Line
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
12
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
13 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
14 marea_cluster -td /path/to/COBRAxy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
15 -in metabolic_data.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
16 -cy kmeans \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
17 -sc true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
18 -k1 2 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
19 -k2 8 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
20 -el true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
21 -si true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
22 -idop clustering_results/ \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
23 -ol cluster.log
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
24 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
25
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
26 ### Galaxy Interface
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
27
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
28 Select "MAREA Cluster" from the COBRAxy tool suite and configure clustering parameters through the web interface.
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
29
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
30 ## Parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
31
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
32 ### Required Parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
33
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
34 | Parameter | Flag | Description |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
35 |-----------|------|-------------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
36 | Tool Directory | `-td, --tool_dir` | Path to COBRAxy installation directory |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
37 | Input Data | `-in, --input` | Metabolic data file (TSV format) |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
38
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
39 ### Clustering Parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
40
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
41 | Parameter | Flag | Description | Default |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
42 |-----------|------|-------------|---------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
43 | Cluster Type | `-cy, --cluster_type` | Clustering algorithm | kmeans |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
44 | Data Scaling | `-sc, --scaling` | Apply data normalization | true |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
45 | Minimum K | `-k1, --k_min` | Minimum number of clusters | 2 |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
46 | Maximum K | `-k2, --k_max` | Maximum number of clusters | 7 |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
47
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
48 ### Analysis Options
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
49
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
50 | Parameter | Flag | Description | Default |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
51 |-----------|------|-------------|---------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
52 | Elbow Plot | `-el, --elbow` | Generate elbow plot for K-means | false |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
53 | Silhouette Analysis | `-si, --silhouette` | Generate silhouette plots | false |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
54
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
55 ### DBSCAN Specific Parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
56
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
57 | Parameter | Flag | Description | Default |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
58 |-----------|------|-------------|---------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
59 | Min Samples | `-ms, --min_samples` | Minimum samples per cluster | - |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
60 | Epsilon | `-ep, --eps` | Maximum distance between samples | - |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
61
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
62 ### Output Parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
63
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
64 | Parameter | Flag | Description | Default |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
65 |-----------|------|-------------|---------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
66 | Output Path | `-idop, --output_path` | Results directory | clustering/ |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
67 | Output Log | `-ol, --out_log` | Log file path | - |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
68 | Best Cluster | `-bc, --best_cluster` | Best clustering result file | - |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
69
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
70 ## Clustering Algorithms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
71
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
72 ### K-means
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
73 **Method**: Partitional clustering using centroids
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
74 - Assumes spherical clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
75 - Requires pre-specified number of clusters (k)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
76 - Fast and scalable
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
77 - Works well with normalized data
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
78
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
79 **Best for**:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
80 - Well-separated, compact clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
81 - Large datasets
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
82 - When cluster number is approximately known
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
83
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
84 ### DBSCAN
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
85 **Method**: Density-based clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
86 - Identifies clusters of varying shapes
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
87 - Automatically determines cluster number
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
88 - Robust to outliers and noise
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
89 - Requires epsilon and min_samples parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
90
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
91 **Best for**:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
92 - Irregular cluster shapes
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
93 - Datasets with noise/outliers
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
94 - Unknown number of clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
95
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
96 ### Hierarchical
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
97 **Method**: Agglomerative clustering with dendrograms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
98 - Creates tree-like cluster hierarchy
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
99 - No need to specify cluster number initially
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
100 - Deterministic results
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
101 - Provides multiple resolution levels
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
102
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
103 **Best for**:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
104 - Small to medium datasets
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
105 - When cluster hierarchy is important
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
106 - Exploratory analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
107
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
108 ## Input Format
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
109
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
110 ### Metabolic Data File
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
111
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
112 Tab-separated format with samples as rows and reactions/metabolites as columns:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
113
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
114 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
115 Sample R00001 R00002 R00003 R00004 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
116 Sample1 1.25 0.85 1.42 0.78 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
117 Sample2 0.65 1.35 0.72 1.28 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
118 Sample3 2.15 2.05 0.45 0.52 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
119 Control1 1.05 0.98 1.15 1.08 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
120 Control2 0.95 1.12 0.88 0.92 ...
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
121 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
122
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
123 **Requirements**:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
124 - First column: sample identifiers
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
125 - Subsequent columns: feature values (RAS, RPS, fluxes)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
126 - Missing values: use 0 or leave empty
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
127 - Numeric data only (excluding sample names)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
128
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
129 ## Data Preprocessing
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
130
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
131 ### Scaling Options
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
132
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
133 #### Standard Scaling (Recommended)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
134 - Mean centering and unit variance scaling
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
135 - Formula: `(x - mean) / std`
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
136 - Ensures equal feature contribution
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
137 - Required for distance-based algorithms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
138
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
139 #### No Scaling
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
140 - Use original data values
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
141 - May be appropriate for already normalized data
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
142 - Risk of feature dominance by high-magnitude variables
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
143
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
144 ### Feature Selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
145
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
146 Consider preprocessing steps:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
147 - Remove low-variance features
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
148 - Apply dimensionality reduction (PCA)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
149 - Select most variable reactions/metabolites
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
150 - Handle missing data appropriately
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
151
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
152 ## Output Files
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
153
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
154 ### Cluster Assignments
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
155
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
156 #### Best Clustering Result (`best_clusters.tsv`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
157 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
158 Sample Cluster Silhouette_Score
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
159 Sample1 1 0.73
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
160 Sample2 1 0.68
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
161 Sample3 2 0.81
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
162 Control1 0 0.59
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
163 Control2 0 0.62
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
164 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
165
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
166 #### All K Results (`clustering_results_k{n}.tsv`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
167 Individual files for each tested cluster number.
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
168
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
169 ### Validation Metrics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
170
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
171 #### Elbow Plot (`elbow_plot.png`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
172 - X-axis: Number of clusters (k)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
173 - Y-axis: Within-cluster sum of squares (WCSS)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
174 - Identifies optimal k at the "elbow" point
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
175
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
176 #### Silhouette Plots (`silhouette_k{n}.png`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
177 - Individual sample silhouette scores
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
178 - Average silhouette width per cluster
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
179 - Overall clustering quality assessment
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
180
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
181 ### Summary Statistics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
182
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
183 #### Clustering Summary (`clustering_summary.txt`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
184 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
185 Algorithm: kmeans
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
186 Scaling: true
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
187 Optimal K: 3
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
188 Best Silhouette Score: 0.72
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
189 Number of Samples: 20
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
190 Feature Dimensions: 150
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
191 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
192
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
193 #### Cluster Characteristics (`cluster_stats.tsv`)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
194 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
195 Cluster Size Centroid_R00001 Centroid_R00002 Avg_Silhouette
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
196 0 8 0.95 1.12 0.68
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
197 1 7 1.35 0.82 0.74
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
198 2 5 0.65 1.55 0.69
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
199 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
200
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
201 ## Examples
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
202
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
203 ### Basic K-means Clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
204
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
205 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
206 # Simple K-means with elbow analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
207 marea_cluster -td /opt/COBRAxy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
208 -in ras_data.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
209 -cy kmeans \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
210 -sc true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
211 -k1 2 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
212 -k2 10 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
213 -el true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
214 -si true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
215 -idop kmeans_results/ \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
216 -ol kmeans.log
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
217 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
218
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
219 ### DBSCAN Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
220
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
221 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
222 # Density-based clustering with custom parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
223 marea_cluster -td /opt/COBRAxy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
224 -in flux_samples.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
225 -cy dbscan \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
226 -sc true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
227 -ms 5 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
228 -ep 0.5 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
229 -idop dbscan_results/ \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
230 -bc best_dbscan_clusters.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
231 -ol dbscan.log
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
232 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
233
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
234 ### Hierarchical Clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
235
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
236 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
237 # Hierarchical clustering for small dataset
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
238 marea_cluster -td /opt/COBRAxy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
239 -in rps_scores.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
240 -cy hierarchy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
241 -sc true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
242 -k1 2 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
243 -k2 6 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
244 -si true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
245 -idop hierarchical_results/ \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
246 -ol hierarchy.log
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
247 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
248
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
249 ### Comprehensive Clustering Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
250
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
251 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
252 # Compare multiple algorithms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
253 algorithms=("kmeans" "dbscan" "hierarchy")
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
254 for alg in "${algorithms[@]}"; do
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
255 marea_cluster -td /opt/COBRAxy \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
256 -in metabolomics_data.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
257 -cy "$alg" \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
258 -sc true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
259 -k1 2 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
260 -k2 8 \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
261 -el true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
262 -si true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
263 -idop "${alg}_clustering/" \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
264 -ol "${alg}_cluster.log"
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
265 done
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
266 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
267
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
268 ## Parameter Optimization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
269
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
270 ### K-means Optimization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
271
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
272 #### Elbow Method
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
273 1. Run K-means for k = 2 to k_max
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
274 2. Plot WCSS vs k
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
275 3. Identify "elbow" point where improvement diminishes
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
276 4. Select k at elbow as optimal
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
277
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
278 #### Silhouette Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
279 1. Compute silhouette scores for each k
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
280 2. Select k with highest average silhouette score
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
281 3. Validate with silhouette plots
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
282 4. Ensure clusters are well-separated
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
283
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
284 ### DBSCAN Parameter Tuning
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
285
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
286 #### Epsilon (eps) Selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
287 - Use k-distance plot to identify knee point
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
288 - Start with eps = average distance to k-th nearest neighbor
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
289 - Adjust based on cluster quality metrics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
290
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
291 #### Min Samples Selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
292 - Rule of thumb: min_samples ≥ dimensionality + 1
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
293 - Higher values create denser clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
294 - Lower values may increase noise sensitivity
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
295
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
296 ### Hierarchical Clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
297
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
298 #### Linkage Method
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
299 - Ward: Minimizes within-cluster variance
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
300 - Complete: Maximum distance between clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
301 - Average: Mean distance between clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
302 - Single: Minimum distance (prone to chaining)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
303
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
304 ## Quality Assessment
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
305
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
306 ### Internal Validation Metrics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
307
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
308 #### Silhouette Score
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
309 - Range: [-1, 1]
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
310 - >0.7: Strong clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
311 - 0.5-0.7: Reasonable clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
312 - <0.5: Weak clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
313
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
314 #### Calinski-Harabasz Index
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
315 - Higher values indicate better clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
316 - Ratio of between-cluster to within-cluster variance
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
317
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
318 #### Davies-Bouldin Index
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
319 - Lower values indicate better clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
320 - Average similarity between clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
321
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
322 ### External Validation
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
323
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
324 When ground truth labels available:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
325 - Adjusted Rand Index (ARI)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
326 - Normalized Mutual Information (NMI)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
327 - Homogeneity and Completeness scores
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
328
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
329 ## Biological Interpretation
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
330
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
331 ### Cluster Characterization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
332
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
333 #### Metabolic Pathway Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
334 - Identify enriched pathways per cluster
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
335 - Compare metabolic profiles between clusters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
336 - Relate clusters to biological conditions
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
337
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
338 #### Sample Annotation
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
339 - Map clusters to experimental conditions
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
340 - Identify batch effects or confounders
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
341 - Validate with independent datasets
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
342
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
343 #### Feature Importance
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
344 - Determine reactions/metabolites driving clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
345 - Analyze cluster centroids for biological insights
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
346 - Connect to known metabolic phenotypes
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
347
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
348 ## Integration Workflow
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
349
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
350 ### Upstream Data Sources
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
351
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
352 #### COBRAxy Tools
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
353 - [RAS Generator](ras-generator.md) - Cluster based on reaction activities
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
354 - [RPS Generator](rps-generator.md) - Cluster based on reaction propensities
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
355 - [Flux Simulation](flux-simulation.md) - Cluster flux distributions
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
356
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
357 #### External Data
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
358 - Gene expression matrices
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
359 - Metabolomics datasets
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
360 - Clinical metadata
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
361
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
362 ### Downstream Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
363
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
364 #### Supervised Learning
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
365 Use cluster labels for:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
366 - Classification model training
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
367 - Biomarker discovery
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
368 - Outcome prediction
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
369
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
370 #### Differential Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
371 - Compare clusters with [MAREA](marea.md)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
372 - Identify cluster-specific metabolic signatures
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
373 - Pathway enrichment analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
374
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
375 ### Typical Pipeline
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
376
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
377 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
378 # 1. Generate metabolic scores
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
379 ras_generator -td /opt/COBRAxy -in expression.tsv -ra ras.tsv
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
380
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
381 # 2. Perform clustering analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
382 marea_cluster -td /opt/COBRAxy -in ras.tsv -cy kmeans \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
383 -sc true -k1 2 -k2 8 -el true -si true \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
384 -idop clusters/ -bc best_clusters.tsv
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
385
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
386 # 3. Analyze cluster differences
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
387 marea -td /opt/COBRAxy -input_data ras.tsv \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
388 -input_class best_clusters.tsv -comparison manyvsmany \
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
389 -test ks -choice_map ENGRO2 -idop cluster_analysis/
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
390 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
391
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
392 ## Tips and Best Practices
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
393
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
394 ### Data Preparation
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
395 - **Normalization**: Always scale features for distance-based methods
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
396 - **Dimensionality**: Consider PCA for high-dimensional data (>1000 features)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
397 - **Missing Values**: Handle appropriately (imputation or removal)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
398 - **Outliers**: Identify and consider removal for K-means
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
399
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
400 ### Algorithm Selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
401 - **K-means**: Start here for most applications
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
402 - **DBSCAN**: Use when clusters have irregular shapes or noise present
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
403 - **Hierarchical**: Choose for small datasets or when hierarchy matters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
404
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
405 ### Parameter Selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
406 - **Start Simple**: Begin with default parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
407 - **Use Validation**: Always employ silhouette analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
408 - **Cross-Validate**: Test stability across parameter ranges
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
409 - **Biological Validation**: Ensure clusters make biological sense
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
410
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
411 ### Result Interpretation
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
412 - **Multiple Algorithms**: Compare results across methods
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
413 - **Stability Assessment**: Check clustering reproducibility
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
414 - **Biological Context**: Integrate with known sample characteristics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
415 - **Statistical Testing**: Validate cluster differences formally
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
416
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
417 ## Troubleshooting
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
418
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
419 ### Common Issues
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
420
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
421 **Poor clustering quality**
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
422 - Check data scaling and normalization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
423 - Assess feature selection and dimensionality
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
424 - Try different algorithms or parameters
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
425 - Evaluate data structure with PCA/t-SNE
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
426
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
427 **Algorithm doesn't converge**
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
428 - Increase iteration limits for K-means
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
429 - Adjust epsilon/min_samples for DBSCAN
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
430 - Check for numerical stability issues
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
431 - Verify input data format
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
432
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
433 **Memory or performance issues**
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
434 - Reduce dataset size or dimensionality
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
435 - Use sampling for large datasets
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
436 - Consider approximate algorithms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
437 - Monitor system resources
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
438
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
439 ### Error Messages
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
440
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
441 | Error | Cause | Solution |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
442 |-------|-------|----------|
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
443 | "Convergence failed" | K-means iteration limit | Increase max iterations or check data |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
444 | "No clusters found" | DBSCAN parameters too strict | Reduce eps or min_samples |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
445 | "Memory allocation error" | Dataset too large | Reduce size or increase memory |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
446 | "Invalid silhouette score" | Single cluster found | Adjust parameters or algorithm |
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
447
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
448 ### Performance Optimization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
449
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
450 **Large Datasets**
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
451 - Use mini-batch K-means for speed
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
452 - Sample data for parameter optimization
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
453 - Employ dimensionality reduction
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
454 - Consider distributed computing
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
455
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
456 **High-Dimensional Data**
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
457 - Apply feature selection
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
458 - Use PCA preprocessing
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
459 - Consider specialized algorithms
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
460 - Validate results carefully
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
461
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
462 ## Advanced Usage
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
463
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
464 ### Custom Distance Metrics
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
465
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
466 For specialized applications, modify distance calculations:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
467
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
468 ```python
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
469 # Custom distance function for metabolic data
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
470 def metabolic_distance(x, y):
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
471 # Implement pathway-aware distance metric
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
472 return custom_distance_value
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
473 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
474
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
475 ### Ensemble Clustering
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
476
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
477 Combine multiple clustering results:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
478
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
479 ```bash
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
480 # Run multiple algorithms and combine
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
481 for method in kmeans dbscan hierarchy; do
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
482 marea_cluster -cy $method -in data.tsv -idop ${method}_results/
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
483 done
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
484
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
485 # Consensus clustering (requires custom script)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
486 python consensus_clustering.py -i *_results/best_clusters.tsv -o consensus.tsv
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
487 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
488
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
489 ### Interactive Analysis
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
490
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
491 Generate interactive plots for exploration:
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
492
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
493 ```python
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
494 import plotly.express as px
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
495 import pandas as pd
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
496
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
497 # Load clustering results
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
498 results = pd.read_csv('best_clusters.tsv', sep='\t')
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
499 data = pd.read_csv('metabolic_data.tsv', sep='\t')
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
500
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
501 # Interactive scatter plot
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
502 fig = px.scatter(data, x='PC1', y='PC2', color=results['Cluster'])
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
503 fig.show()
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
504 ```
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
505
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
506 ## See Also
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
507
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
508 - [MAREA](marea.md) - Statistical analysis of cluster differences
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
509 - [RAS Generator](ras-generator.md) - Generate clustering input data
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
510 - [Flux Simulation](flux-simulation.md) - Alternative clustering data source
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
511 - [Clustering Tutorial](../tutorials/clustering-analysis.md)
4ed95023af20 Uploaded
francesco_lapi
parents:
diff changeset
512 - [Validation Methods Reference](../tutorials/cluster-validation.md)