Mercurial > repos > qfab > usearch_uchime

<tool id="uchime" name="Uchime" version="1.0.0">
<description>Detecting chimeric sequences with two or more segments.</description>
<command>
      #if str( $runmode.mode ) == "denovo"
        usearch -uchime_denovo $input -chimeras $output -nonchimeras $outputnon -uchimeout $outputtab -uchimealns $outputread -quiet
      #else
        usearch -uchime_ref $input -db $db -chimeras $output -nonchimeras $outputnon -uchimeout $outputtab -uchimealns $outputread -strand plus -quiet
      #end if
</command>
<inputs>
    <conditional name="runmode">
    	<param name="mode" type="select" label="Mode to detect chimeras" help="Which mode? See help below">
                <option value="ref" selected="true">ref</option>
    		<option value="denovo">de novo</option>
    	</param>
       <when value="denovo">
        <param name='input' type='data' format='fasta,tabular' label='Input file' help='' />
       </when>
       <when value="ref">
       	 <param name='input' type='data' format='fasta,tabular' label='Input reference file' help='' />
	 <param name='db' type='data' format='fasta' label='Reference Database' />
       </when>
    </conditional>
</inputs>
<outputs>
    <data name='output' format='fasta' label="${tool.name} on ${on_string}:chimeras" />
    <data name='outputnon' format='fasta' label="${tool.name} on ${on_string}:non_chimeras" />
    <data name='outputread' format='tabular' hidden="TRUE" label="${tool.name} on ${on_string}:Human-readable output" />
    <data name='outputtab' format='tabular' hidden="TRUE" label="${tool.name} on ${on_string}:Tabbed output" help='Output in tabbed format with one record per sequence. First field is  score (h), second filed is query label.'  />
</outputs>
<help>
===========
Description
===========

.. class:: infomark

Two additional files are generated by this tool, the log files in tabbed and human-readable format that are hidden from the history list. You can view these outputs by clicking on the cogwheel next to the History panel and select "Include Hidden Dataset".

UCHIME is an algorithm for detecting chimeric sequences. It is implemented in the USEARCH-Tool-Suite_.

The fundamental step in UCHIME is a search for a 3-way alignment of a query sequence with two parent sequences (A and B) such that one parent is more similar to one segment of the query (Q) and the other parent is similar over another segment.

A score is calculated from the alignment. Higher scores indicate a stronger chimeric signal. A score cutoff set by the .minh option (0.28 by default) determines whether the query is classified as a chimera.

This search can be performed with a reference database of parent sequences believed to be chimera-free provided by the user, or the database can be constructed de novo from the query sequences. In de novo mode, the sequences are assumed to be derived from one PCR run. In this case, parent sequences should be more abundant than their chimeras because the parent amplicons will have undergone more rounds of amplification.

.. _USEARCH-Tool-Suite: http://www.drive5.com/usearch/

.. class:: warningmark

Please note: The free 32-bit version of USEARCH is limited to using 4GB or less RAM (Linux, OSX).
If you are using the free 32-bit version of USEARCH, we recommend to use reference datasets up to 800MB in size to avoid running into the "out of memory" error.
Please see the USEARCH_ site for more info on the memory requirments.

.. _USEARCH: http://drive5.com/usearch/manual/bitness.html

-----

----------
Parameters
----------

**Reference database (ref) mode**
 A database file of nucleotide sequences must be specified using the Reference Database (ref) option. The database may be in FASTA format. The reference database should include sequences that might appear as parents in the query set. These should be high-quality sequences that are believed to be free of chimeras. Errors in reference sequences will degrade detection accuracy and increase the number of false positives. Chimeras will not be detected if their parents (or sufficiently close relatives) are not present in the database.

.. class:: warningmark

The reference database should contain high-quality sequences that are believed to be chimera-free.

**De novo mode**
 De novo chimera detection using the UCHIME algorithm. The input file must contain estimated amplicons with abundances specified by size annotations. In de novo mode, abundance skew is used to distinguish chimeras from parents. input should be estimated amplicon sequences with integer abundances specified using size annotations, e.g.:

 >FQ23BBGZ5;size=23;

The minimum abundance skew is specified by the .abskew parameter, which defaults to 2.0 (because one round of PCR doubles the abundance). Abundance is a measure of how many amplicons with a given unique sequence were present in the sample after amplification by PCR. One way to estimate this is to sum the total number of reads in the cluster used to estimate the given amplicon sequence. UCHIME uses only ratios of abundances, so the absolute value does not matter. However, using the number of reads is a useful indicator.for example, a cluster containing one read is likely to be spurious. Amplicon sequences and abundances can be estimated using USEARCH, or by using another algorithm such as Chris Quince's PyroNoise or AmpliconNoise. When using de novo mode, sequences should be estimated amplicons from one sequencing run (strictly, one PCR amplification stage), otherwise abundances may not be directly comparable.

------
Inputs
------
**Reference database mode**

(A) An input file containing the sequences in FASTA format.
(B) A reference database file in FASTA format containing nucleotide sequences believed to be free of chimeras.

**De novo mode**

(A) A FASTA file containing for each sequence estimated amplicons with abundances specified by size annotations, e.g. >FQ23BBGZ5;size=23; .

------
Output
------

This tool produced four output files two of which are hidden by default.

.. class:: infomark

To view the hidden files: click on the cogwheel icon in the history panel and select 'Include Hidden Datasets'.

(A) A FASTA file of predicted chimeras
(B) A FASTA file of non-chimeras
(C) *(hidden) A human readable file of chimeric alignments*
(D) *(hidden) A tab-separated file with the following 18 columns:*

+-------+---------------+--------------------------------------------------------------------------------------------+
|1	|Score		|Value >= 0.0, high score means more likely to be a chimera                                  |
+-------+---------------+--------------------------------------------------------------------------------------------+
|2	|Q		|Query label     						                             |
+-------+---------------+--------------------------------------------------------------------------------------------+
|3	|A      	|Parent A label                                                                              |
+-------+---------------+--------------------------------------------------------------------------------------------+
|4	|B      	|Parent B label                                                                              |
+-------+---------------+--------------------------------------------------------------------------------------------+
|5      |T              |Top parent (T) label. This isthe closest reference sequence; usuallly either A or B         |
+-------+---------------+--------------------------------------------------------------------------------------------+
|6	|IdQM		|Percent identity of query and the model (M) constructed as a segment of A and a segment of B|
+-------+---------------+--------------------------------------------------------------------------------------------+
|7	|IdQA		|Percent identity of Q and A                                                                 |
+-------+---------------+--------------------------------------------------------------------------------------------+
|8	|IdQB		|Percent identity of Q and B                                                                 |
+-------+---------------+--------------------------------------------------------------------------------------------+
|9	|IdAB		|Percent identity of A and B                                                                 |
+-------+---------------+--------------------------------------------------------------------------------------------+
|10	|IdQT		|Percent identity of Q and T                                                                 |
+-------+---------------+--------------------------------------------------------------------------------------------+
|11	|LY		|Yes votes in left segment                                                                   |
+-------+---------------+--------------------------------------------------------------------------------------------+
|12	|LN		|No votes in left segment                                                                    |
+-------+---------------+--------------------------------------------------------------------------------------------+
|13	|LA		|Abstain votes in left segment                                                               |
+-------+---------------+--------------------------------------------------------------------------------------------+
|14	|RY		|Yes votes in right segment                                                                  |
+-------+---------------+--------------------------------------------------------------------------------------------+
|15	|RN		|No votes in right segment                                                                   |
+-------+---------------+--------------------------------------------------------------------------------------------+
|16	|RA		|Abstain votes in right segmen                                                               |
+-------+---------------+--------------------------------------------------------------------------------------------+
|17	|Div		|Divergence, defined as (IdQM -IdQT)                                                         |
+-------+---------------+--------------------------------------------------------------------------------------------+
|18	|YN		|Y(yes) or N(no) classification as a chimera                                                 |
+-------+---------------+--------------------------------------------------------------------------------------------+

-----

=========
Resources
=========

UCHIME_

.. _UCHIME: http://drive5.com/usearch/manual/uchime_algo.html

**Author**

Robert C. Edgar (bob@drive5.com)

**Wrapper Author**

QFAB Bioinformatics (support@qfab.org)
</help>
<tests>
  <test>
	<param name="input_file" value="seqs.fasta" />
        <param name="mode" value="ref" />
	<param name="ref_db" value="gold.fasta" />
        <output name="output" file="chimeras.fasta" ftype="fasta" lines_diff="10" />
        <output name="outputnon" file="non_chimeras.fasta" ftype="fasta" lines_diff="10"  />
        <output name="outputtab" file="output.tabular" ftype="tabular" lines_diff="10"  />
        <output name="outputread" file="outputread.tabular" ftype="tabular" lines_diff="10" />
  </test>
</tests>
</tool>
author	qfab
date	Wed, 28 May 2014 22:14:14 -0400
parents
children