comparison uchime/uchime.xml @ 0:fd0ab76b83f1 draft default tip

Uploaded
author qfab
date Wed, 28 May 2014 22:14:14 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:fd0ab76b83f1
1 <tool id="uchime" name="Uchime" version="1.0.0">
2 <description>Detecting chimeric sequences with two or more segments.</description>
3 <command>
4 #if str( $runmode.mode ) == "denovo"
5 usearch -uchime_denovo $input -chimeras $output -nonchimeras $outputnon -uchimeout $outputtab -uchimealns $outputread -quiet
6 #else
7 usearch -uchime_ref $input -db $db -chimeras $output -nonchimeras $outputnon -uchimeout $outputtab -uchimealns $outputread -strand plus -quiet
8 #end if
9 </command>
10 <inputs>
11 <conditional name="runmode">
12 <param name="mode" type="select" label="Mode to detect chimeras" help="Which mode? See help below">
13 <option value="ref" selected="true">ref</option>
14 <option value="denovo">de novo</option>
15 </param>
16 <when value="denovo">
17 <param name='input' type='data' format='fasta,tabular' label='Input file' help='' />
18 </when>
19 <when value="ref">
20 <param name='input' type='data' format='fasta,tabular' label='Input reference file' help='' />
21 <param name='db' type='data' format='fasta' label='Reference Database' />
22 </when>
23 </conditional>
24 </inputs>
25 <outputs>
26 <data name='output' format='fasta' label="${tool.name} on ${on_string}:chimeras" />
27 <data name='outputnon' format='fasta' label="${tool.name} on ${on_string}:non_chimeras" />
28 <data name='outputread' format='tabular' hidden="TRUE" label="${tool.name} on ${on_string}:Human-readable output" />
29 <data name='outputtab' format='tabular' hidden="TRUE" label="${tool.name} on ${on_string}:Tabbed output" help='Output in tabbed format with one record per sequence. First field is score (h), second filed is query label.' />
30 </outputs>
31 <help>
32 ===========
33 Description
34 ===========
35
36 .. class:: infomark
37
38 Two additional files are generated by this tool, the log files in tabbed and human-readable format that are hidden from the history list. You can view these outputs by clicking on the cogwheel next to the History panel and select "Include Hidden Dataset".
39
40 UCHIME is an algorithm for detecting chimeric sequences. It is implemented in the USEARCH-Tool-Suite_.
41
42 The fundamental step in UCHIME is a search for a 3-way alignment of a query sequence with two parent sequences (A and B) such that one parent is more similar to one segment of the query (Q) and the other parent is similar over another segment.
43
44 A score is calculated from the alignment. Higher scores indicate a stronger chimeric signal. A score cutoff set by the .minh option (0.28 by default) determines whether the query is classified as a chimera.
45
46 This search can be performed with a reference database of parent sequences believed to be chimera-free provided by the user, or the database can be constructed de novo from the query sequences. In de novo mode, the sequences are assumed to be derived from one PCR run. In this case, parent sequences should be more abundant than their chimeras because the parent amplicons will have undergone more rounds of amplification.
47
48 .. _USEARCH-Tool-Suite: http://www.drive5.com/usearch/
49
50 .. class:: warningmark
51
52 Please note: The free 32-bit version of USEARCH is limited to using 4GB or less RAM (Linux, OSX).
53 If you are using the free 32-bit version of USEARCH, we recommend to use reference datasets up to 800MB in size to avoid running into the "out of memory" error.
54 Please see the USEARCH_ site for more info on the memory requirments.
55
56 .. _USEARCH: http://drive5.com/usearch/manual/bitness.html
57
58 -----
59
60 ----------
61 Parameters
62 ----------
63
64 **Reference database (ref) mode**
65 A database file of nucleotide sequences must be specified using the Reference Database (ref) option. The database may be in FASTA format. The reference database should include sequences that might appear as parents in the query set. These should be high-quality sequences that are believed to be free of chimeras. Errors in reference sequences will degrade detection accuracy and increase the number of false positives. Chimeras will not be detected if their parents (or sufficiently close relatives) are not present in the database.
66
67 .. class:: warningmark
68
69 The reference database should contain high-quality sequences that are believed to be chimera-free.
70
71 **De novo mode**
72 De novo chimera detection using the UCHIME algorithm. The input file must contain estimated amplicons with abundances specified by size annotations. In de novo mode, abundance skew is used to distinguish chimeras from parents. input should be estimated amplicon sequences with integer abundances specified using size annotations, e.g.:
73
74 >FQ23BBGZ5;size=23;
75
76 The minimum abundance skew is specified by the .abskew parameter, which defaults to 2.0 (because one round of PCR doubles the abundance). Abundance is a measure of how many amplicons with a given unique sequence were present in the sample after amplification by PCR. One way to estimate this is to sum the total number of reads in the cluster used to estimate the given amplicon sequence. UCHIME uses only ratios of abundances, so the absolute value does not matter. However, using the number of reads is a useful indicator.for example, a cluster containing one read is likely to be spurious. Amplicon sequences and abundances can be estimated using USEARCH, or by using another algorithm such as Chris Quince's PyroNoise or AmpliconNoise. When using de novo mode, sequences should be estimated amplicons from one sequencing run (strictly, one PCR amplification stage), otherwise abundances may not be directly comparable.
77
78 ------
79 Inputs
80 ------
81 **Reference database mode**
82
83 (A) An input file containing the sequences in FASTA format.
84 (B) A reference database file in FASTA format containing nucleotide sequences believed to be free of chimeras.
85
86 **De novo mode**
87
88 (A) A FASTA file containing for each sequence estimated amplicons with abundances specified by size annotations, e.g. >FQ23BBGZ5;size=23; .
89
90 ------
91 Output
92 ------
93
94 This tool produced four output files two of which are hidden by default.
95
96 .. class:: infomark
97
98 To view the hidden files: click on the cogwheel icon in the history panel and select 'Include Hidden Datasets'.
99
100 (A) A FASTA file of predicted chimeras
101 (B) A FASTA file of non-chimeras
102 (C) *(hidden) A human readable file of chimeric alignments*
103 (D) *(hidden) A tab-separated file with the following 18 columns:*
104
105 +-------+---------------+--------------------------------------------------------------------------------------------+
106 |1 |Score |Value >= 0.0, high score means more likely to be a chimera |
107 +-------+---------------+--------------------------------------------------------------------------------------------+
108 |2 |Q |Query label |
109 +-------+---------------+--------------------------------------------------------------------------------------------+
110 |3 |A |Parent A label |
111 +-------+---------------+--------------------------------------------------------------------------------------------+
112 |4 |B |Parent B label |
113 +-------+---------------+--------------------------------------------------------------------------------------------+
114 |5 |T |Top parent (T) label. This isthe closest reference sequence; usuallly either A or B |
115 +-------+---------------+--------------------------------------------------------------------------------------------+
116 |6 |IdQM |Percent identity of query and the model (M) constructed as a segment of A and a segment of B|
117 +-------+---------------+--------------------------------------------------------------------------------------------+
118 |7 |IdQA |Percent identity of Q and A |
119 +-------+---------------+--------------------------------------------------------------------------------------------+
120 |8 |IdQB |Percent identity of Q and B |
121 +-------+---------------+--------------------------------------------------------------------------------------------+
122 |9 |IdAB |Percent identity of A and B |
123 +-------+---------------+--------------------------------------------------------------------------------------------+
124 |10 |IdQT |Percent identity of Q and T |
125 +-------+---------------+--------------------------------------------------------------------------------------------+
126 |11 |LY |Yes votes in left segment |
127 +-------+---------------+--------------------------------------------------------------------------------------------+
128 |12 |LN |No votes in left segment |
129 +-------+---------------+--------------------------------------------------------------------------------------------+
130 |13 |LA |Abstain votes in left segment |
131 +-------+---------------+--------------------------------------------------------------------------------------------+
132 |14 |RY |Yes votes in right segment |
133 +-------+---------------+--------------------------------------------------------------------------------------------+
134 |15 |RN |No votes in right segment |
135 +-------+---------------+--------------------------------------------------------------------------------------------+
136 |16 |RA |Abstain votes in right segmen |
137 +-------+---------------+--------------------------------------------------------------------------------------------+
138 |17 |Div |Divergence, defined as (IdQM -IdQT) |
139 +-------+---------------+--------------------------------------------------------------------------------------------+
140 |18 |YN |Y(yes) or N(no) classification as a chimera |
141 +-------+---------------+--------------------------------------------------------------------------------------------+
142
143 -----
144
145 =========
146 Resources
147 =========
148
149 UCHIME_
150
151 .. _UCHIME: http://drive5.com/usearch/manual/uchime_algo.html
152
153 **Author**
154
155 Robert C. Edgar (bob@drive5.com)
156
157 **Wrapper Author**
158
159 QFAB Bioinformatics (support@qfab.org)
160 </help>
161 <tests>
162 <test>
163 <param name="input_file" value="seqs.fasta" />
164 <param name="mode" value="ref" />
165 <param name="ref_db" value="gold.fasta" />
166 <output name="output" file="chimeras.fasta" ftype="fasta" lines_diff="10" />
167 <output name="outputnon" file="non_chimeras.fasta" ftype="fasta" lines_diff="10" />
168 <output name="outputtab" file="output.tabular" ftype="tabular" lines_diff="10" />
169 <output name="outputread" file="outputread.tabular" ftype="tabular" lines_diff="10" />
170 </test>
171 </tests>
172 </tool>
173