comparison maaslin-4450aa4ecc84/README.md @ 1:a87d5a5f2776

Uploaded the version running on the prod server
author george-weingart
date Sun, 08 Feb 2015 23:08:38 -0500
parents
children
comparison
equal deleted inserted replaced
0:e0b5980139d9 1:a87d5a5f2776
1 MaAsLin User Guide v3.1
2 =======================
3
4 September 2013 - Updated April 2014 for Galaxy
5
6 Timothy Tickle and Curtis Huttenhower
7
8 Table of Contents
9 -----------------
10
11 A. Introduction to MaAsLin
12 B. Related Projects and Scripts
13 C. Installing MaAsLin
14 D. MaAsLin Inputs
15 E. Process Flow Overview
16 D. Process Flow Detail
17 G. Expected Output Files
18 H. Troubleshooting
19 I. Installation as an Automated Pipeline
20 J. Commandline Options (Modifying Process and Figures)
21
22 # A. Introduction to MaAsLin
23
24 MaAsLin is a multivariate statistical framework that finds
25 associations between clinical metadata and potentially
26 high-dimensional experimental data. MaAsLin performs boosted additive
27 general linear models between one group of data (metadata/the
28 predictors) and another group (in our case relative taxonomic
29 abundances/the response). In our context we use it to discover
30 associations between clinical metadata and microbial community
31 relative abundance or function; however, it is applicable to other
32 data types.
33
34 Metagenomic data are sparse, and boosting is used to select metadata
35 that show some potential to be useful in a linear model between the
36 metadata and abundances. In the context of metadata and community
37 abundance, a sample's metadata is boosted for one Operational
38 Taxonomic Unit (OTU) (Yi). The metadata that are selected by boosting
39 are then used in a general linear model, with each combination of
40 metadata (as predictors) and OTU abundance (as response
41 variables). This occurs for every OTU and metadata combination. Given
42 we work with proportional data, the Yi (abundances) are
43 `arcsin(sqrt(Yi))` transformed. A final formula is as follows:
44
45 ![](https://bitbucket.org/biobakery/maaslin/downloads/maaslinformula2.png)
46
47 For more information about maaslin please visit
48 [http://huttenhower.sph.harvard.edu/maaslin](http://huttenhower.sph.harvard.edu/maaslin).
49
50
51 # B. Related Projects and Scripts
52
53 Other projects exist at www.bitbucket.com that may help in your
54 analysis:
55
56 * **QiimeToMaAsLin** is a project that reformats abundance files from
57 Qiime for MaAsLin. Several formats of Qiime consensus lineages are
58 supported for this project. To download please visit
59 [https://bitbucket.org/timothyltickle/qiimetomaaslin](https://bitbucket.org/timothyltickle/qiimetomaaslin).
60
61 * **merge_metadata.py** is a script included in the MaAsLin project to
62 generically merge a metadata file with a table of microbial (or
63 other) measurements. This script is located in `maaslin/src` and
64 is documented in `maaslin/doc/ Merge_Metadata_Read_Me.txt`.
65
66
67 # C. Installing MaAsLin
68
69 R Libraries: Several libraries need to be installed in R these are
70 the following:
71
72 * agricolae, gam, gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme, optparse, outliers, penalized, pscl, robustbase, testhat, vegan
73
74 You can install them by typing R in a terminal and using the
75 install.packages command:
76
77 install.packages(c('agricolae', 'gam', 'gamlss', 'gbm', 'glmnet', 'inlinedocs', 'logging', 'MASS', 'nlme', 'optparse', 'outliers', 'penalized', 'pscl', 'robustbase', 'testthat'))
78
79 # D. MaAsLin Inputs
80
81 There are 3 input files for each project, the "\*.read.config" file, the "\*.pcl" file, and the "\*.R" script. (If using the sfle automated pipeline, the "\*" in the file names can be anything but need to be identical for all three files. All three files need to be in the `../sfle/input/maasalin/input` folder only if using sfle). Details of each file follow:
82
83 ### 1\. "\*.pcl"
84
85 Required input file. A PCL file is the file that contains all the data
86 and metadata. This file is formatted so that metadata/data (otus or
87 bugs) are rows and samples are columns. All metadata rows should come
88 first before any abundance data. The file should be a tab delimited
89 text file with the extension ".pcl".
90
91 ### 2\. "\*.read.config"
92
93 Required input file. A read config file allows one to indicate what data is read from a PCL file without having to change the pcl file or change code. This means one can have a pcl file which is a superset of metadata and abundances which includes data you are not interested in for the run. This file is a text file with ".read.config" as an extension. This file is later described in detail in section **F. Process Flow Overview** subsection **4. Create your read.config file**.
94
95 ### 3\. "\*.R"
96
97 Optional input file. The R script file is using a call back programming pattern that allows one to add/modify specific code to customize analysis without touching the main MaAsLin engine. A generic R script is provided "maaslin_demo2.R" and can be renamed and used for any study. The R script can be modified to add quality control or formatting of data, add ecological measurements, or other changes to the underlying data before MaAsLin runs on it. This file is not required to run MaAsLin.
98
99 # E. Process Flow Overview
100
101 1. Obtain your abundance or relative function table.
102 2. Obtain your metadata.
103 3. Format and combine your abundance table and metadata as a pcl file for MaAsLin.
104 4. Create your read.config file.
105 5. Create your R script or use the default.
106 6. Place .pcl, .read.config, .R files in `../sfle/input/maaslin/input/` (sfle only)
107 7. Run
108 8. Discover amazing associations in your results!
109
110 # F. Process Flow Detail
111
112 ### 1\. Obtain your abundance or relative function table.
113
114 Abundance tables are normally derived from sequence data using
115 *Mothur*, *Qiime*, *HUMAnN*, or *MetaPhlAn*. Please refer to their documentation
116 for further details.
117
118 ### 2\. Obtain your metadata.
119
120 Metadata would be information about the samples in the study. For
121 instance, one may analyze a case / control study. In this study, you
122 may have a disease and healthy group (disease state), the sex of the
123 patents (patient demographics), medication use (chemical treatment),
124 smoking (patient lifestyle) or other types of data. All aforementioned
125 data would be study metadata. This section can have any type of data
126 (factor, ordered factor, continuous, integer, or logical
127 variables). If a particular data is missing for a sample for a
128 metadata please write NA. It is preferable to write NA so that, when
129 looking at the data, it is understood the metadata is missing and it's
130 absence is intentional and not a mistake. Often investigators are
131 interested in genetic measurements that may also be placed in the
132 metadata section to associate to bugs.
133
134 If you are not wanting to manually add metadata to your abundance
135 table, you may be interested in associated tools or scripts to help
136 combine your abundance table and metadata to create your pcl
137 file. Both require a specific format for your metadata file. Please
138 see the documentation for *QiimeToMaaslin* or *merge_metadata.py* (for
139 more details see section B).
140
141 ### 3\. Format and combine your abundance table and metadata as a pcl
142 file for *MaAsLin*.
143
144 Please note two tools have been developed to help you! If you are
145 working from a Qiime OTU output and have a metadata text file try using
146 *QiimeToMaaslin* found at bitbucket. If you have a tab delimited file
147 which matches the below .pcl description (for instance MetaPhlAn
148 output) use the merge_metadata.py script provided in this project
149 (`maaslin/src/merge_metadata.py`) and documented in
150 `maaslin/doc/Merge_Metadata_Read_Me.txt`.
151
152 ###PCL format description:
153
154 i. Row 1 is expected to be sample IDs beginning the first column with a feature name to identify the row, for example "ID".
155
156 ii. Rows of metadata. Each row is one metadata, the first column entry
157 being the name of the metadata and each following column being the
158 metadata value for that each sample.
159
160 iii. Row of taxa/otu abundance. Each row is one taxa/otu, the first
161 column entry being the name of the taxa/otu followed by abundances of
162 the taxa/otu per sample.
163
164 iv. Abundances should be normalized by dividing each abundance measurement by the sum of the column (sample) abundances.
165
166 v. Here is an example of the contents of an extremely small pcl file;
167 another example can be found in this project at
168 `maaslin/input/maaslin_demo.pcl`.
169
170
171 ID Sample1 Sample2 Sample3 Sample4
172 metadata1 True True False False
173 metadata2 1.23 2.34 3.22 3.44
174 metadata3 Male Female Male Female
175 taxa1 0.022 0.014 0.333 0.125
176 taxa2 0.406 0.029 0.166 0.300
177 taxa3 0.571 0.955 0.500 0.575
178
179
180 ### 4\. Create your read.config file.
181
182 A *.read.config file is a structured text file used to indicate which
183 data in a *.pcl file should be read into MaAsLin and used for
184 analysis. This allows one to keep their *.pcl file intact while
185 varying analysis. Hopefully, this avoids errors that may be introduced
186 while manipulating the pcl files.
187
188 Here is an example of the contents of a *.read.config file.
189
190 Matrix: Metadata
191 Read_PCL_Columns: Sample2-Sample15
192 Read_PCL_Rows: Age-Height,Weight,Sex,Cohort-Profession
193
194 Matrix: Abundance
195 Read_PCL_Columns: Sample2-Sample15
196 Read_PCL_Rows: Bacteria-Bug100
197
198 The minimal requirement for a MaAsLin .read.config file is as
199 follows. The Matrix: should be specified. Metadata needs to be named
200 "Metadata" for the metadata section and "Abundance" for the abundance
201 section. “Read\_PCL\_Rows:” is used to indicate which rows are data or
202 metadata to be analyzed. Rows can be identified by their metadata/data
203 id. Separate ids by commas. If there is a consecutive group of
204 metadata/data a range of rows can be defined by indicating the first
205 and last id separated by a “-“. If the beginning or ending id is
206 missing surrounding an “–“, the rows are read from the beginning or to
207 the end of the pcl file, respectively.
208
209 A minimal example is shown here:
210
211 Matrix: Metadata
212 Read\_PCL\_Rows: -Weight
213
214 Matrix: Abundance
215 Read_PCL_Rows: Bacteria-
216
217 With this minimal example, the delimiter of the file is assumed to be
218 a tab, all columns are read (since they are not indicated
219 here). Metadata are read as all rows from the beginning of the pcl
220 file (skipping the first Sample ID row) to Weight; all data are read
221 as all rows from Bacteria to the end of the pcl file. This example
222 refers to the default input files given in the MaAsLin download as
223 maaslin_demo2.\*.
224
225 ### 5\. Optionally, create your R script.
226
227 The R script is used to add code that manipulates your data before
228 analysis, and for manipulating the multifactoral analysis figure. A
229 default “*.R” script is available with the default MaAsLin project at
230 maaslin/input/maaslin_demo2.R. This is an expert option and should
231 only be used by someone very comfortable with the R language.
232
233 ###6. Optional step if using the sfle analysis pipeline. Place .pcl, .read.config, and optional .R files in `../sfle/input/maasalin/input`
234
235 ###7. Run.
236
237 By running the commandline script:
238 On the commandline call the Maaslin.R script. Please refer to the help (-h, --help) for command line options. If running from commandline, the PCL file will need to be transposed. A script is included in Maaslin for your convenience (src/transpose.py). The following example will have such a call included. An example call from the Maaslin folder for the demo data could be as follows.
239
240 ./src/transpose.py < input/maaslin_demo2.pcl > maaslin_demo2.tsv
241 ./src/Maaslin.R -i input/maaslin_demo2.read.config demo.text maaslin_demo2.tsv
242
243 When using sfle:
244 Go to ../sfle and type the following: scons output/maaslin
245
246 ###8. Discover amazing associations in your results!
247
248
249 #G. Expected Output Files
250
251 The following files will be generated per MaAsLin run. In the
252 following listing the term projectname refers to what you named your "\*.pcl" file without the extension.
253
254 ###Output files that are always created:
255
256 **projectname_log.txt**
257
258 This file contains the detail for the statistical engine. This can be
259 useful for detailed troubleshooting.
260
261 **projectname-metadata.txt**
262
263 Each metadata will have a file of associations. Any associations
264 indicated to be performed after initial variable selection (boosting)
265 is recorded here. Included are the information from the final general
266 linear model (performed after the boosting) and the FDR corrected
267 p-value (q-value). Can be opened as a text file or spreadsheet.
268
269 **projectname-metadata.pdf**
270
271 Any association that had a q-value less than or equal to the given
272 significance threshold will be plotted here (default is 0.25; can be
273 changed using the commandline argument -d). If this file does not
274 exist, the projectname-metadata.txt should not have an entry that is
275 less than or equal to the threshold. Factor data is plotted as
276 knotched box plots; continuous data is plotted as a scatter plot with
277 a line of best fit. Two plots are given for MaAslin Methodology; the
278 left being a raw data plot, the right being a corresponding partial
279 residual plot.
280
281 **projectname.pdf**
282
283 Contains the biplot visualization. This visualization is presented as a build and can be affected by modifications in the R.script or by using commandline.
284
285 **projectname.txt**
286
287 A collection of all entries in the projectname-metadata.pdf. Can be
288 opened as a text file or spreadsheet.
289
290 ###Additional troubleshooting files when the commandline:
291
292 **data.tsv**
293
294 The data matrix that was read in (transposed). Useful for making sure
295 the correct data was read in.
296
297 **data.read.config**
298
299 Can be used to read in the data.tsv.
300
301 **metadata.tsv**
302
303 The metadata that was read in (transposed). Useful for making sure the
304 correct metadata was read in.
305
306 **metadata.read.config**
307
308 Can be used to read in the data.tsv.
309
310 **read_merged.tsv**
311
312 The data and metadata merged (transposed). Useful for making sure the
313 merging occurred correctly.
314
315 **read_merged.read.config**
316
317 Can be used to read in the read_merged.tsv.
318
319 **read_cleaned.tsv**
320
321 The data read in, merged, and then cleaned. After this process the
322 data is written to this file for reference if needed.
323
324 **read_cleaned.read.config**
325
326 Can be used to read in read_cleaned.tsv.
327
328 **ProcessQC.txt**
329
330 Contains quality control for the MaAsLin analysis. This includes
331 information on the magnitude of outlier removal.
332
333 **Run_Parameters.txt**
334 Contains an account of all the options used when running MaAsLin so the exact methodology can be recreated if needed.
335
336 #H. Other Analysis Flows
337
338 ###1. All verses All
339 The all verses all analysis flow is a way of manipulating how metadata are used. In this method there is a group of metadata that are always evaluated, as well there are a group that are added to this one at a time. To give a more concrete example: You may have metadata cage, diet, and treatment. You may always want to have the association of abundance evaluated controlling for cage but otherwise looking at the metadata one at a time. In this way the cage metadata is the \D2forced\D3 part of the evaluation while the others are not forced and evaluated in serial. The appropriate commandline to indicate this follows (placed in your args file if using sfle, otherwise added in the commandline call):
340
341 > -a -F cage
342
343 -a indicates all verses all is being used, -F indicates which metadata are forced (multiple metadata can be given comma delimited as shown here -F metadata1,metadata2,metadata3). This does not bypass the feature selection method so the metadata that are not forced are subject to feature selection and may be removed before coming to the evaluation. If you want all the metadata that are not forced to be evaluated in serial you will need to turn off feature selection and will have a final combined commandline as seen here:
344
345 > -a -F cage -s none
346
347 #I. Troubleshooting
348
349 ###1\. (Only valid if using Sfle) ImportError: No module named sfle
350
351 When using the command "scons output/maaslin/..." to run my projects I
352 get the message:
353
354 ImportError: No module named sfle:
355 File "/home/user/sfle/SConstruct", line 2:
356 import sfle
357
358 **Solution:** You need to update your path. On a linux or MacOS terminal
359 in the sfle directory type the following.
360
361 export PATH=/usr/local/bin:`pwd`/src:$PATH
362 export PYTHONPATH=$PATH
363
364
365 ###2\. When trying to run a script I am told I do not have permission
366 even though file permissions have been set for myself.
367
368 **Solution:** Most likely, you need to set the main MaAsLin script
369 (Maaslin.R) to executable.
370
371 #J. Installation as an Automated Pipeline
372
373 SflE (pronounced souffle), is a framework for automation and
374 parallelization on a multiprocessor machine. MaAsLin has been
375 developed to be compatible with this framework. More information can
376 be found at
377 [http://huttenhower.sph.harvard.edu/sfle](http://huttenhower.sph.harvard.edu/sfle). If
378 interested in installing MaAsLin in a SflE environment. After
379 installing SflE, download or move the complete maaslin directory into
380 `sfle/input`. After setting up, one places all maaslin input files in
381 `sfle/input/maaslin/input`. To run the automated pipeline and analyze
382 all files in the `sfle/input/maaslin/input` directory, type: `scons output/maaslin`
383 in a terminal in the sfle directory. This will produce
384 output in the `sfle/output/maaslin` directory.
385
386 #K. Commandline Options (Modifying Process and Figures)
387
388 Although we recommend the use of default options, commandline
389 arguments exist to modify both MaAsLin methodology and figures. To see
390 an up-to-date listing of argument usage, in a terminal in the
391 `maaslin/src` directory type `./Maaslin.R -h`.
392
393 An additional input file (the args file) can be used to apply
394 commandline arguments to a MaAsLin run. This is useful when using
395 MaAsLin as an automated pipeline (using SflE) and is a way to document
396 what commandline are used for different projects. The args file should
397 be named the same as the *.pcl file except using the extension .args
398 . This file should be placed in the `maaslin/input` directory with the
399 other matching project input files. In this file please have one line
400 of arguments and values (if needed; some arguments are logical flags
401 and do not require a value), each separated by a space. The contents
402 of this file will be directly added to the commandline call for
403 Maaslin.R. An example of the contents of an args file is given here.
404
405 **Example.args:**
406
407 -v DEBUG -d 0.1 -b 5
408
409 In this example MaAsLin is modified to produce verbose output for
410 debugging (-v DEBUG), to change the threshold for making pdfs to a
411 q-value equal to or less than 0.1 (-d 0.1), and to plot
412 5 data (bug) features in the biplot (-b 5).
413