comparison format_input.xml @ 0:e7cd19afda2e draft

Lefse
author george-weingart
date Tue, 13 May 2014 21:57:00 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:e7cd19afda2e
1 <tool id="LEfSe_for" name="A) Format Data for LEfSe" version="1.0">
2 <code file="format_input_selector.py"/>
3 <description></description>
4 <!-- <command interpreter="python">./format_input.py $inp_data $formatted_input -f $feat_dir -c $cls_n -s $subcls_n -u $subj_n -o 1000000.0 </command> -->
5 <command interpreter="python">format_input.py $inp_data $formatted_input -f $cond.feat_dir -c $cond.cls_n -s $cond.subcls_n -u $cond.subj_n -o $norm </command>
6 <inputs>
7 <page>
8 <param format="tabular" name="inp_data" type="data" label="Upload a tabular file of relative abundances and class labels (possibly also subclass and subjects labels) for LEfSe - See samples below - Please use Galaxy Get-Data/Upload-File. Use File-Type = Tabular" help=""/>
9 <param name="cond" type="data_column" data_ref="inp_data" accept_default="true" />
10
11 <conditional name="cond" type="data_column" data_ref="inp_data" accept_default="true">
12 <param name="feat_dir" type="select" data_ref="inp_data" label="Select whether the vectors (features and meta-data information) are listed in rows or columns" help="">
13 <option value="r" selected='True'>Rows</option>
14 <option value="c">Columns</option>
15 </param>
16
17 <when value="r">
18 <param name="cls_n" label="Select which row to use as class" size ="70" type='select' dynamic_options="get_cols(inp_data,'r','cl')" />
19 <param name="subcls_n" label="Select which row to use as subclass" type='select' dynamic_options="get_cols(inp_data,'r','subclass')" />
20 <param name="subj_n" label="Select which row to use as subject" type='select' dynamic_options="get_cols(inp_data,'r','subject')" />
21 </when>
22 <when value="c">
23 <param name="cls_n" label="Select which column to use as class" type='select' dynamic_options="get_cols(inp_data,'c','cl')" />
24 <param name="subcls_n" label="Select which column to use as subclass" type='select' dynamic_options="get_cols(inp_data,'c','subclass')" />
25 <param name="subj_n" label="Select which column to use as subject" type='select' dynamic_options="get_cols(inp_data,'c','subject')" />
26 </when>
27
28 </conditional>
29
30 <param name="norm" type="select" label="Per-sample normalization of the sum of the values to 1M (recommended when very low values are present)" help="">
31 <option value="1000000.0" selected='True'>Yes</option>
32 <option value="-1">No</option>
33 </param>
34
35 <!-- <param name="row" label="on row" type="data_row" data_ref="inp_data" accept_default="true" /> -->
36 </page>
37 </inputs>
38 <outputs>
39 <data format="lefse" name="formatted_input" />
40 </outputs>
41
42 <tests>
43 <test>
44 <param name="inp_data" value="lefse_input" ftype="tabular" />
45 <param name="cond.feat_dir" value="r" />
46 <param name="cond.cls_n" value="1" />
47 <param name="cond.subcls" value="-1" />
48 <param name="cond.subj" value="-1" />
49 <param name="norm" value="1000000" />
50 <output name="formatted_input" file="lefse_output_a" />
51 </test>
52 </tests>
53
54
55
56
57 <help>
58
59
60 **What it does**
61
62 LDA Effect Size (LEfSe) `(Segata et. al 2010)`_ is an algorithm for high-dimensional biomarker discovery and
63 explanation that identifies genomic features (genes, pathways, or taxa) characterizing
64 the differences between two or more biological conditions (or classes, see figure below). It
65 emphasizes both statistical significance and biological relevance, allowing
66 researchers to identify differentially abundant features that are also consistent with
67 biologically meaningful categories (subclasses). LEfSe first robustly
68 identifies features that are statistically different among biological classes. It then
69 performs additional tests to assess whether these differences are consistent with
70 respect to expected biological behavior.
71
72 Specifically, we first use the non-parametric factorial
73 Kruskal-Wallis (KW) sum-rank test to detect features with
74 significant differential abundance with respect to the class of interest; biological
75 significance is subsequently investigated using a set of pairwise tests among
76 subclasses using the (unpaired) Wilcoxon rank-sum test. As a last step, LEfSe uses
77 Linear Discriminant Analysis to estimate the effect size of each differentially
78 abundant feature and, if desired by the investigator, to perform dimension reduction.
79
80 LEfSe consists of six modules performing the following steps (see the figure below).
81
82 The first step consists of **uploading your file** by using Galaxy's "Get-Data / Upload-file"
83
84
85 The next steps are:
86
87 + **A) Format Data for LEfSe**: selects the structure of the problem (classes, subclasses, subjects) and formats the tabular abundance data for the B module
88 + **B) LDA Effect Size (LEfSe)**: performs the analysis using the data formatted with module A and provides input for the visualization modules (C, D, E, F)
89 + **C) Plot LEfSe Results**: graphically reports the discovered biomarkes (output of B) with their effect sizes
90 + **D) Plot Cladogram**: graphically represents the discovered biomarkers (output of B) in a taxonomic tree specified by the hierarchical feature names (not available for non-hierarchical features)
91 + **E) Plot One Feature**: plots the row values of a feature (biomarker or not) as an abundance histogram with classes and subclasses structure (only one feature at the time)
92 + **F) Plot Differential Features**: plots the row values of all features (biomarkers or not) as abundance histograms with classes and subclasses structure and provides a zip archive of the figures
93
94 .. image:: https://bytebucket.org/biobakery/galaxy_lefse/wiki/lefse_ove.png
95
96
97 ------
98
99
100 **Input file format**
101
102 The text tab-delimited input file consists of a list of numerical features, the class vector and optionally the subclass and subject vectors. The features can be read counts directly or abundance floating-point values more generally, and the first field is the name of the feature. Class, subclass and subject vectors have a name (the first field) and a list of non-numerical strings.
103
104 Although both column and row feature organization is accepted, given the high-dimensional nature of metagenomic data, the listing of the features in rows is preferred. A partial example of an input file follows (all values are separated by single-tab)::
105
106 bodysite mucosal mucosal mucosal mucosal mucosal non_mucosal non_mucosal non_mucosal non_mucosal non_mucosal
107 subsite oral gut oral oral gut skin nasal skin ear nasal
108 id 1023 1023 1672 1876 1672 159005010 1023 1023 1023 1672
109 Bacteria 0.99999 0.99999 0.999993 0.999989 0.999997 0.999927 0.999977 0.999987 0.999997 0.999993
110 Bacteria|Actinobacteria 0.311037 0.000864363 0.00446132 0.0312045 0.000773642 0.359354 0.761108 0.603002 0.95913 0.753688
111 Bacteria|Bacteroidetes 0.0689602 0.804293 0.00983343 0.0303561 0.859838 0.0195298 0.0212741 0.145729 0.0115617 0.0114511
112 Bacteria|Firmicutes 0.494223 0.173411 0.715345 0.813046 0.124552 0.177961 0.189178 0.188964 0.0226835 0.192665
113 Bacteria|Proteobacteria 0.0914284 0.0180378 0.265664 0.109549 0.00941215 0.430869 0.0225884 0.0532684 0.00512034 0.0365453
114 Bacteria|Firmicutes|Clostridia 0.090041 0.170246 0.00483188 0.0465328 0.122702 0.0402301 0.0460614 0.135201 0.0115835 0.0537381
115
116 In this case one may want to use bodysite as class, subsite as subclass and id as subject. Notice that the features have a hierarchical structure specified using the character \|.
117
118
119 **Input file sample**
120
121 You can try the LEfSe modules using the dataset available here_. You can upload the dataset using Galaxy's **Get-Data / Upload File**
122
123 This is a 16S dataset from `(Garrett et. al 2010)`_ and `(Veiga et. al 2010)`_ for studying the characteristics of the fecal microbiota in a mouse model of spontaneous colitis. The dataset contains 30 abundance profiles (obtained processing the 16S reads with RDP) belonging to 10 rag2 (control) and 20 truc (case) mice. The metadata consists in class information only, as we don't have subject or subclass information. The same dataset is used to show the graphical results in the module descriptions.
124
125
126
127 ------
128
129 STEP A:
130 -------
131
132
133 **What STEP A does**
134
135 Preprocessing module for the biomarker discovery tool called LEfSe:
136
137 This module of LEfSe preprocesses metagenomic abundance data for the analyses to be carried out with the "Run LEfSe" module. This module is separated from the "Run LEfSe" because one may want to preprocess the data only once but run multiple analyses.
138
139 For an overview of LEfSe please refer to the "Introduction" module or to `(Segata et. al 2011)`_.
140
141 ------
142
143 **Input format**
144
145 The module accepts tabular data with the feature list in rows or columns.
146
147 ------
148
149 **Output format**
150
151 The module generates data readable by the "Run LEfSe" module only.
152
153 ------
154
155 **Parameters**
156
157 The class vector represents the labels of the main condition under investigation. The (optional) subclass vector denotes the internal groupings with biological meaning inside each class (subclasses of different classes with the same name are processed as different subclasses). The subject vector (optional) reports a third dimension denoting meta-data (subject id, sample type, ... ) which is independent from the class and subclass definition.
158
159 The labels can have a hierarchical organization (see example below) reflecting taxonomies (like NCBI or RDB microbial taxonomy, SEED subsystems or GO terms). The taxonomic levels are specified using the character \|.
160
161 The per-sample normalization is usually applied for metagenomic data in which the relative abundances are taken into account.
162
163 ------
164
165 **Example**
166
167 Although both column and row feature organization is accepted, given the high-dimensional nature of metagenomic data, the listing of the features in rows is preferred. A partial example of an input file follows (all values are separated by single-tab)::
168
169 bodysite mucosal mucosal mucosal mucosal mucosal non_mucosal non_mucosal non_mucosal non_mucosal non_mucosal
170 subsite oral gut oral oral gut skin nasal skin ear nasal
171 id 1023 1023 1672 1876 1672 159005010 1023 1023 1023 1672
172 Bacteria 0.99999 0.99999 0.999993 0.999989 0.999997 0.999927 0.999977 0.999987 0.999997 0.999993
173 Bacteria|Actinobacteria 0.311037 0.000864363 0.00446132 0.0312045 0.000773642 0.359354 0.761108 0.603002 0.95913 0.753688
174 Bacteria|Bacteroidetes 0.0689602 0.804293 0.00983343 0.0303561 0.859838 0.0195298 0.0212741 0.145729 0.0115617 0.0114511
175 Bacteria|Firmicutes 0.494223 0.173411 0.715345 0.813046 0.124552 0.177961 0.189178 0.188964 0.0226835 0.192665
176 Bacteria|Proteobacteria 0.0914284 0.0180378 0.265664 0.109549 0.00941215 0.430869 0.0225884 0.0532684 0.00512034 0.0365453
177 Bacteria|Firmicutes|Clostridia 0.090041 0.170246 0.00483188 0.0465328 0.122702 0.0402301 0.0460614 0.135201 0.0115835 0.0537381
178
179 In this case one may want to use bodysite as class, subsite as subclass and id as subject. Notice that the features have a hierarchical structure specified using the character \|.
180
181 **Example with the "mouse model dataset"**
182
183 You can try the LEfSe modules using the dataset available here_. This is a 16S dataset from `(Garrett et. al 2010)`_ and `(Veiga et. al 2010)`_ for studying the characteristics of the fecal microbiota in a mouse model of spontaneous colitis. The dataset contains 30 abundance profiles (obtained processing the 16S reads with RDP) belonging to 10 rag2 (control) and 20 truc (case) mice. The metadata consists of class information only, as we don't have subject or subclass information. The dataset contains the features organized in rows; you need to select the first row as class, whereas you have to select "no subclass" and "no subject" options.
184
185
186 .. _here: http://www.huttenhower.org/webfm_send/73
187 .. _(Segata et. al 2011): http://www.ncbi.nlm.nih.gov/pubmed/21702898
188 .. _(Garrett et. al 2010): http://www.ncbi.nlm.nih.gov/pubmed/20833380
189 .. _(Veiga et. al 2010): http://www.ncbi.nlm.nih.gov/pubmed/20921388
190 .. _contact us: nsegata@hsph.harvard.edu
191
192
193
194
195 **How to Cite LEfSe**
196
197 If you find LEfSe usefull in your research please city our paper `(Segata et. al 2010)`_:
198
199 | `Nicola Segata`_, Jacques Izard, Levi Walron, Dirk Gevers, Larisa Miropolsky, Wendy Garrett, `Curtis Huttenhower`_.
200 | "`Metagenomic Biomarker Discovery and Explanation`_"
201 | Genome Biology, 2011 Jun 24;12(6):R60
202
203
204 Please do not hesitate to `contact us`_ for any questions of comments.
205
206 .. _here: http://www.huttenhower.org/webfm_send/73
207 .. _(Segata et. al 2010): http://www.ncbi.nlm.nih.gov/pubmed/21702898
208 .. _(Garrett et. al 2010): http://www.ncbi.nlm.nih.gov/pubmed/20833380
209 .. _(Veiga et. al 2010): http://www.ncbi.nlm.nih.gov/pubmed/20921388
210 .. _contact us: nsegata@hsph.harvard.edu
211 .. _Nicola Segata: nsegata@hsph.harvard.edu
212 .. _Curtis Huttenhower: chuttenh@hsph.harvard.edu
213 .. _Metagenomic Biomarker Discovery and Explanation: http://genomebiology.com/2011/12/6/R60
214
215
216
217
218 </help>
219 </tool>