Mercurial > repos > bgruening > itsx

<tool id="itsx" name="ITSx" version="@TOOL_VERSION@+galaxy@SUFFIX_VERSION@" profile="20.01">
    <description>identifies ITS sequences and extracts the ITS region</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <xrefs>
        <xref type="bio.tools">ITSx</xref>
    </xrefs>
    <expand macro="requirements" />
    <version_command>ITSx --license</version_command>
    <command detect_errors="exit_code"><![CDATA[
    ITSx
    --cpu \${GALAXY_SLOTS:-4}
    -i '$i'
    --save_regions #echo ",".join($output_options.save_regions)
    --only-output $output_options.only_output
    --partial $output_options.partial
    --minlen $output_options.minlen
    --truncate $output_options.truncate
    -t #echo ",".join($sequence_section.character_code)
    -E $sequence_section.E
    -S $sequence_section.S
    -N $sequence_section.N
    --selection_priority $sequence_section.selection_priority
    --allow_single_domain $sequence_section.allow_single_domain
    --allow_reorder $sequence_section.allow_reorder
    --complement $sequence_section.complement
    --heuristics $sequence_section.heuristics
    --nhmmer $sequence_section.nhmmer
    #if 'summary' in $output_options.output_files
        --summary T
    #else
        --summary F
    #end if
    #if 'graphical' in $output_options.output_files
        --graphical T
    #else
        --graphical F
    #end if
    #if 'fasta' in $output_options.output_files
        --fasta T
    #else
        --fasta F
    #end if
    #if 'concat' in $output_options.output_files
        --concat T
    #else
        --concat F
    #end if
    #if 'table' in $output_options.output_files
        --table T
    #else
        --table F
    #end if
    #if 'not_found' in $output_options.output_files
        --not_found T
    #else
        --not_found F
    #end if
    #if 'detailed_results' in $output_options.output_files
        --detailed_results T
    #else
        --detailed_results F
    #end if
    #if 'positions' in $output_options.output_files
        --positions T
    #else
        --positions F
    #end if
    --preserve $output_options.preserve

    ]]></command>
    <inputs>
        <param argument="-i" type="data" format="fasta" label="FASTA file"/>
        <section name="sequence_section" title="Sequence selection options" expanded="true">
            <param name="character_code" argument="-t" type="select"
                multiple="true" label="Profile set to use for the search"
                help="Can be used to restrict the search to only a few organism groups types to save time, if
                    one or more of the origins are not relevant to the dataset under study. Default: all">
                <option value="all" selected="true">All</option>
                <option value="a">Alveolata</option>
                <option value="b">Bryophyta</option>
                <option value="c">Bacillariophyta</option>
                <option value="d">Amoebozoa</option>
                <option value="e">Euglenozoa</option>
                <option value="f">Fungi</option>
                <option value="g">Chlorophyta</option>
                <option value="h">Rhodophyta</option>
                <option value="i">Phaeophyceae</option>
                <option value="l">Marchantiophyta</option>
                <option value="m">Metazoa</option>
                <option value="o">Oomycota</option>
                <option value="p">Haptophyceae</option>
                <option value="q">Raphidophyceae</option>
                <option value="r">Rhizaria</option>
                <option value="s">Synurophyceae</option>
                <option value="t">Tracheophyta</option>
                <option value="u">Eustigmatophyceae</option>
                <option value="x">Apusozoa</option>
                <option value="y">Parabasalia</option>
            </param>
            <param argument="-E" type="float" min="0" max="1" value="0.00005" label="Domain E-value cutoff"
                help="Domain E-value cutoff for a sequence to be included in the output. Default: 1e-5"/>
            <param argument="-S" type="float" min="0" max="1" value="0" label="Domain score cutoff"
                help="Domain score cutoff for a sequence to be included in the output. Default:0" />
            <param argument="-N" type="integer" min="0" max="20" value="2" label="Minimal number of domains"
                help="The minimal number of domains that must match a sequence before it is included. Default:2" />
            <param argument="--selection_priority" type="select" label="Selection priority" help="Selects what will
                be of highest priority when determining the origin of the sequence. Default:score">
                <option value="score">Score</option>
                <option value="sum">Sum</option>
                <option value="domains">Domains</option>
                <option value="eval">Eval</option>
            </param>
            <conditional name="cutoff_conditional">
                <param name="cutoff_mode" type="select" label="HMMER search cutoff mode" help="Default: Score cutoff">
                    <option value="eval">E-value cutoff</option>
                    <option value="score" selected="true">Score cutoff</option>
                </param>
                <when value="eval">
                    <param argument="--search_eval" type="float" min="0" max="1" value="" optional="true"
                        label="E-value cutoff used in the HMMER search"
                        help="High numbers may slow down the process" />
                </when>
                <when value="score">
                    <param argument="--search_score" type="float" min="0" max="1" value="" optional="true"
                        label="The score cutoff used in the HMMER search"
                        help="Low numbers may slow down the process" />
                </when>
            </conditional>
            <param argument="--allow_single_domain" type="float" min="0" max="1" value="0" label="Allow single domain"
                help="Allow inclusion of sequences that only find a single domain, given that they meet the given E-value and score thresholds. Default: 0" />
            <param argument="--allow_reorder" type="boolean" truevalue="T" falsevalue="F" checked="false" label="Allow reorder"
                help="Allows profiles to be in the wrong order on extracted sequences" />
            <param argument="--complement" type="boolean" truevalue="T" falsevalue="F" checked="true" label="Check both DNA strains"
                help="Checks both DNA strands against the database, creating reverse complements" />
            <param argument="--heuristics" type="boolean" truevalue="T" falsevalue="F" checked="false"
                label="HMMER's heuristic filtering" help="Leave this setting off for higher precision" />
            <param argument="--nhmmer" type="boolean" truevalue="T" falsevalue="F" checked="false" label="Use nhmmer instead of hmmsearch" help="Default: No" />
        </section>
        <section name="output_options" title="Output options">
            <param name="output_files" type="select" multiple="true" display="checkboxes" label="Output files">
                <option value="summary" selected="true">Summary</option>
                <option value="graphical">Graphical output</option>
                <option value="fasta" selected="true">FASTA-format of extracted ITS sequences</option>
                <option value="concat">Concateneted ITS sequences</option>
                <option value="positions" selected="true">Positions of found IT sequences</option>
                <option value="table" selected="true">Positions of probable IT sequences</option>
                <option value="not_found" selected="true">List of not-found entries</option>
                <option value="detailed_results">Table of all results</option>
            </param>
            <param argument="--preserve" type="boolean" truevalue="T" falsevalue="F" checked="false" label="Preserve sequence headers in input file" help="Preserve sequence headers in input file instead of printing out ITSx headers. Default: No" />
            <param argument="--save_regions" type="select" label="Save regions" multiple="true" help="A comma separated list of regions to output separate FASTA files">
                <option value="none">None</option>
                <option value="SSU">SSU</option>
                <option value="ITS1" selected="true">ITS1</option>
                <option value="ITS2" selected="true">ITS2</option>
                <option value="5.8S">5.8S</option>
                <option value="LSU">LSU</option>
                <option value="all">All</option>
            </param>
            <param argument="--only_output" type="boolean" truevalue="T" falsevalue="F" checked="false" label="Only full-lenght regions" help="If enabled, output is limited to full-length regions. Default: no" />
            <param argument="--partial" type="integer" min="0" value="0" label="Saves additional FASTA-files for full and partial ITS sequences longer than the specified cutoff" help="Default: 0 (disabled)" />
            <param argument="--minlen" type="integer" min="0" value="0" label="Minimum length the ITS regions must be to be outputted in the concatenated file" help="Default: 0" />
            <param argument="--truncate" type="boolean" truevalue="T" falsevalue="F" checked="false" label="Truncates the FASTA output to only contain the actual ITS sequences found" help="Default: yes" />
        </section>
    </inputs>
    <outputs>
        <data name="not_detection_fasta" format="fasta" from_work_dir="ITSx_out_no_detections.fasta" label="${tool.name} on ${on_string}: no detection (FASTA)">
            <filter> output_options['output_files'] and 'not_found' in output_options['output_files']</filter>
        </data>
        <data name="not_detection_txt" format="txt" from_work_dir="ITSx_out_no_detections.txt" label="${tool.name} on ${on_string}: no detection (TXT)">
            <filter> output_options['output_files'] and 'not_found' in output_options['output_files']</filter>
        </data>
        <data name="its1" format="fasta" from_work_dir="ITSx_out.ITS1.fasta" label="${tool.name} on ${on_string}: ITS1">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
            <filter>output_options['save_regions'] and 'ITS1' in output_options['save_regions']</filter>
        </data>
        <data name="its2" format="fasta" from_work_dir="ITSx_out.ITS2.fasta" label="${tool.name} on ${on_string}: ITS2">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
            <filter>output_options['save_regions'] and 'ITS2' in output_options['save_regions']</filter>
        </data>
        <data name="ribosomall58" format="fasta" from_work_dir="ITSx_out.5_8S.fasta" label="${tool.name} on ${on_string}: 5.8S">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
            <filter>output_options['save_regions'] and '5.8S' in output_options['save_regions']</filter>
        </data>
        <data name="LSU" format="fasta" from_work_dir="ITSx_out.LSU.fasta" label="${tool.name} on ${on_string}: LSU">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
            <filter>output_options['save_regions'] and 'LSU' in output_options['save_regions']</filter>
        </data>
        <data name="SSU" format="fasta" from_work_dir="ITSx_out.SSU.fasta" label="${tool.name} on ${on_string}: SSU">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
            <filter>output_options['save_regions'] and 'SSU' in output_options['save_regions']</filter>
        </data>
        <data name="full_sequences" format="fasta" from_work_dir="ITSx_out.full.fasta" label="${tool.name} on ${on_string}: full sequences">
            <filter>output_options['output_files'] and 'fasta' in output_options['output_files']</filter>
        </data>
        <data name="concat" format="fasta" from_work_dir="ITSx_out.concat.fasta" label="${tool.name} on ${on_string}: concatenated sequences">
            <filter>output_options['output_files'] and 'concat' in output_options['output_files']</filter>
        </data>
        <data name="graph" format="txt" from_work_dir="ITSx_out.graph" label="${tool.name} on ${on_string}: graphical representation">
            <filter>output_options['output_files'] and 'graph' in output_options['output_files']</filter>
        </data>
        <data name="summary" format="txt" from_work_dir="ITSx_out.summary.txt" label="${tool.name} on ${on_string}: summary">
            <filter>output_options['output_files'] and 'summary' in output_options['output_files']</filter>
        </data>
        <data name="detailed_results" format="tabular" from_work_dir="ITSx_out.extraction.results" label="${tool.name} on ${on_string}: detailed results">
            <filter>output_options['output_files'] and 'detailed_results' in output_options['output_files']</filter>
        </data>
        <data name="positions_probable" format="tabular" from_work_dir="ITSx_out.hmmer.table" label="${tool.name} on ${on_string}: probable ITS positions">
            <filter>output_options['output_files'] and 'table' in output_options['output_files']</filter>
        </data>
        <data name="positions_found" format="tabular" from_work_dir="ITSx_out.positions.txt" label="${tool.name} on ${on_string}: positions">
            <filter>output_options['output_files'] and 'positions' in output_options['output_files']</filter>
        </data>
        <data name="problematic" format="tabular" from_work_dir="ITSx_out.problematic.txt" label="${tool.name} on ${on_string}: problematic reads"/>
    </outputs>
    <tests>
        <!--Default parameters-->
        <test expect_num_outputs="9">
            <param name="i" value="metagenomics.fasta"/>
            <section name="sequence_section">
                <param name="character_code" value="f"/>
                <param name="E" value="0.00005"/>
                <param name="S" value="0"/>
                <param name="N" value="2"/>
                <param name="selection_priority" value="score"/>
                <param name="allow_single_domain" value="0"/>
                <param name="allow_reorder" value="false"/>
                <param name="complement" value="true"/>
                <param name="heuristics" value="false"/>
                <param name="nhmmer" value="false"/>
            </section>
            <section name="output_options">
                <param name="output_files" value="summary,fasta,positions,table,not_found"/>
                <param name="preserve" value="false"/>
                <param name="save_regions" value="ITS1,ITS2"/>
                <param name="only_output" value="false"/>
                <param name="partial" value="0"/>
                <param name="minlen" value="0"/>
                <param name="truncate" value="false"/>
            </section>
            <output name="not_detection_fasta" file="test_01_not_detected.fasta" ftype="fasta"/>
            <output name="not_detection_txt" file="test_01_not_detected.txt" ftype="txt"/>
            <output name="its2" file="test_01_its2.fasta" ftype="fasta"/>
            <output name="summary" file="test_01_summary.txt" ftype="txt" lines_diff="4"/>
            <output name="positions_found" file="test_01_found.tabular" ftype="tabular"/>
            <output name="positions_probable" ftype="tabular">
                <assert_contents>
                    <has_text text="10088_SRR5572434|r"/>
                    <has_text text="19705_SRR5572434|fr"/>
                </assert_contents>
            </output>
            <output name="problematic" file="test_01_problematic.tabular" ftype="tabular"/>
        </test>
        <!--Custom options-->
        <test expect_num_outputs="9">
            <param name="i" value="metagenomics.fasta"/>
            <section name="sequence_section">
                <param name="character_code" value="f"/>
                <param name="E" value="0.0005"/>
                <param name="S" value="1"/>
                <param name="N" value="2"/>
                <param name="selection_priority" value="score"/>
                <param name="allow_single_domain" value="0.1"/>
                <param name="allow_reorder" value="true"/>
                <param name="complement" value="true"/>
                <param name="heuristics" value="true"/>
                <param name="nhmmer" value="false"/>
            </section>
            <section name="output_options">
                <param name="output_files" value="summary,fasta,positions,table,not_found"/>
                <param name="preserve" value="false"/>
                <param name="save_regions" value="ITS1,ITS2"/>
                <param name="only_output" value="false"/>
                <param name="partial" value="0"/>
                <param name="minlen" value="20"/>
                <param name="truncate" value="true"/>
            </section>
            <output name="not_detection_fasta" file="test_02_not_detected.fasta" ftype="fasta"/>
            <output name="not_detection_txt" file="test_02_not_detected.txt" ftype="txt"/>
            <output name="its2" file="test_02_its2.fasta" ftype="fasta"/>
            <output name="summary" file="test_02_summary.txt" ftype="txt" lines_diff="4"/>
            <output name="positions_found" file="test_02_found.tabular" ftype="tabular"/>
            <output name="positions_probable" ftype="tabular">
                <assert_contents>
                    <has_text text="10088_SRR5572434|r"/>
                    <has_text text="19705_SRR5572434|fr"/>
                </assert_contents>
            </output>
            <output name="problematic" file="test_02_problematic.tabular" ftype="tabular"/>
        </test>
    </tests>
    <help><![CDATA[

.. class:: infomark

**Purpose**

ITSx is an open source software utility to extract the highly variable ITS1 and ITS2 subregions from ITS sequences, which is commonly used as a molecular barcode for e.g. fungi. As the inclusion of parts of the neighbouring,
very conserved, ribosomal genes (SSU, 5S and LSU rRNA sequences) in the sequence identification process can lead to severely misleading results, ITSx identifies and extracts only the ITS regions themselves.

ITSx accepts input in the FASTA format. As it pre-processes the input sequences, it is possible to input both aligned and unaligned FASTA files, containing both DNA and RNA sequences.

----

.. class:: infomark

**Algorithm and implementation**

The main design goal for ITSx is to achieve fast and accurate extraction of ITS regions in large data sets, without introducing a large number of false positives. To be able to reach a high speed, ITSx relies on
the HMMER3 software, which allows for extremely fast comparisons of HMM-profiles to a sequence set. To achieve high detection accuracy, ITSx uses multiple HMM-profiles built from the conserved domains flanking
the ITS regions (SSU, 5.8S and LSU), representing a large number of species groups. This enabled ITSx to extract ITS regions from all eukaryote lineages for which a substantial number of reference ITS sequences
were available as of 2012. The 2017 release of 1.1 saw the introduction of additional HMM profiles to cover even more lineages, particularly within the fungi.

While the default settings of ITSx should be usable  in most situations, you should consider if they suitable for your purposes and for your data set. If the data set is small, this can be done by running the
software multiple times on the data, with different settings, and analyse the outcome. On larger data sets, it might be more feasible to only run ITSx on a subset of the sequences for testing. The graphical output
is very useful for determining whether ITSx performs as desired on the data, as the positions of the found conserved domains can be easily investigated. If domains are missing, the criteria might be set to be too
stringent. If they are not in sequential order (from SSU to LSU with the 5.8S in between), that might be an indication that there is something wrong with the input sequences

----

.. class:: infomark

**Notes on PCR primers and short input sequences**

One obvious use of ITSx would be to run it on sequence datasets generated by PCR amplicon studies of the ITS region, to extract ITS1 and/or ITS2 regions, and to sort
out non-target sequences in the data prior to further analysis. ITSx uses the conserved SSU, 5S and LSU genes to located and orient the ITS regions. To do this, the
software requires at least ~20 bp. (of at least one) of those genes to be present for each input sequence – preferably 25 bp. However, some primer pairs targeting the ITS
region will not include sufficiently large portions of these genes to be detected with the accuracy required by ITSx by default. This may lead to that fewer, or even none, of
the input sequences are recognized as ITS containing.


If the dataset is known to contain only ITS sequences, a remedy to this problem can be to lower the stringency of ITSx, using the -E option. By default, this is set to 1e-5,
but this can be increased to say 0.01, or even 1 to allow for detection of shorter portions of the conserved genes – down to some 15 bases. This feature comes at the
price of an increase proportion of false-positive matches, but in the case of ITS-only datasets this will be less of a concern. Note that this should normally be done only for
datasets that are known to contain only ITS sequences, and that caution needs to be taken in the downstream analysis so that false-positive extractions can be avoided (e.g.
investigate spuriously long or short ITS sequences with a sound degree of scepticism). Conversely, going for very stringent settings may come at the price of sensitivity as
sequences with deviant genes (or of reduced read quality) may be missed. The default settings of the software are calibrated with environmental datasets in mind, to keep
false-positive extractions at a minimum.

If the problem instead is that only one of those genes is present on the input sequence, another, even more polished, solution exists. In order to score a sequence as an ITS
sequence, ITSx prefers to see that the sequence produces matches to at least two HMMs (such as 3’ SSU and 5’ 5.8S). This helps ITSx to keep the number of falsepositive
identifications low. However, the software also allows sequences that produce a match to only one HMM (such as 3’ SSU) to be scored as ITS sequences, provided
that the match is particularly stringent. The stringency of this parameter is controlled through the --allow_single_domain switch (see above). The default value is 1e-9. If
ITSx is used on very short sequences (e.g. 3’ SSU + 100 bp. ITS1), then only one HMM can be expected to match, and the E-value of that match will be compared to
the value of this parameter. Therefore, if the dataset at hand contains sequences which should only contain a single conserved region, using e.g. --allow_single_domain
"1e-5,0" can be enough to go from no matches to all matches. However, as with the -E option above, increasing the E-value of this parameter comes at the expense of
higher risk for false positives.

  ]]></help>
    <expand macro="citations" />
</tool>
author	bgruening
date	Mon, 02 May 2022 13:03:49 +0000
parents
children