Mercurial > repos > diodupima > coast_search_diamond

<macros>
    <token name="@TOOL_VERSION@">0.2.0</token>
    <xml name="requirements">
        <requirement type="package" version="0.2.0">coast</requirement>
    </xml>
    <xml name="citations_coast">
        <citation type="bibtex">@misc{noauthor_coast_nodate,
                title = {{COAST} - {Compartive} {Ominc} {Alignment} {Search} {Tool}},
                url = {https://gitlab.com/coast_tool/COAST},
                abstract = {Alignment search tool that identifies similar proteomes},
                language = {en},
                urldate = {2021-06-22},
            }
        </citation>
    </xml>
    <xml name="citations_taxonkit">
        <citation type="bibtex">@article{shen_taxonkit_2021,
                abstract = {The National Center for Biotechnology Information (NCBI) Taxonomy is widely applied in biomedical and ecological studies. Typical demands include querying taxonomy identifier (TaxIds) by taxonomy names, querying complete taxonomic lineages by TaxIds, listing descendants of given TaxIds, and others. However, existed tools are either limited in functionalities or inefficient in terms of runtime. In this work, we present TaxonKit, a command-line toolkit for comprehensive and efficient manipulation of NCBI Taxonomy data. TaxonKit comprises seven core subcommands providing functions, including TaxIds querying, listing, filtering, lineage retrieving and reformatting, lowest common ancestor computation, and TaxIds change tracking. The practical functions, competitive processing performance, scalability with different scales of datasets and good accessibility could facilitate taxonomy data manipulations. TaxonKit provides free access under the permissive MIT license on GitHub, Brewsci, and Bioconda. The documents are also available at https://bioinf.shenwei.me/taxonkit/.},
                author = {Shen, Wei and Ren, Hong},
                doi = {10.1016/j.jgg.2021.03.006},
                file = {ScienceDirect Snapshot:/home/dm/Zotero/storage/Q3KYT6QS/S1673852721000837.html:text/html},
                issn = {1673-8527},
                journal = {Journal of Genetics and Genomics},
                keywords = {Lineage; NCBI Taxonomy; TaxId; TaxId changelog; TaxonKit},
                language = {en},
                month = apr,
                shorttitle = {{TaxonKit}},
                title = {{TaxonKit}: {A} practical and efficient {NCBI} taxonomy toolkit},
                url = {https://www.sciencedirect.com/science/article/pii/S1673852721000837},
                urldate = {2021-06-21},
                year = {2021}
            }
        </citation>
    </xml>
    <xml name="citations_diamond">
        <citation type="bibtex">@article{buchfink_sensitive_2021,
                title = {Sensitive protein alignments at tree-of-life scale using {DIAMOND}},
                volume = {18},
                issn = {1548-7091, 1548-7105},
                url = {http://www.nature.com/articles/s41592-021-01101-x},
                doi = {10.1038/s41592-021-01101-x},
                abstract = {Abstract
                    We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP.},
                language = {en},
                number = {4},
                urldate = {2021-04-14},
                journal = {Nature Methods},
                author = {Buchfink, Benjamin and Reuter, Klaus and Drost, Hajk-Georg},
                month = apr,
                year = {2021},
                pages = {366--368},
                file = {Full Text:/home/dm/Zotero/storage/6HKCWF6S/Buchfink et al. - 2021 - Sensitive protein alignments at tree-of-life scale.pdf:application/pdf},
            }
        </citation>
    </xml>
        <xml name="citations_blast">
        <citation type="bibtex">@article{camacho_blast_2009,
                title = {{BLAST}+: architecture and applications},
                volume = {10},
                issn = {1471-2105},
                shorttitle = {{BLAST}+},
                url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2803857/},
                doi = {10.1186/1471-2105-10-421},
                abstract = {Background
            Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications.

            Results
            We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site.

            Conclusion
            The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.},
                urldate = {2021-04-14},
                journal = {BMC Bioinformatics},
                author = {Camacho, Christiam and Coulouris, George and Avagyan, Vahram and Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden, Thomas L},
                month = dec,
                year = {2009},
                pmid = {20003500},
                pmcid = {PMC2803857},
                pages = {421},
                file = {PubMed Central Full Text PDF:/home/dm/Zotero/storage/5FCYSMW5/Camacho et al. - 2009 - BLAST+ architecture and applications.pdf:application/pdf},
            }
        </citation>
    </xml>
    <xml name="input_query">
        <conditional name="query_type">
            <param name="source" type="select" label="Select the type of input file">
                <option value="coast_gb">COAST from GenBank</option>
                <option value="coast_fa">COAST from FASTA</option>
            </param>
            <when value="coast_gb">
                <param name="query_file" type="data" format="GenBank" label="Load a query proteome in Genebank"/>
                <param name="query_key" type="select" label="List the GB file Features to be used as Proteins, do so in a way to prevent duplicated proteins.">
                    <option value="CDS" selected="true">CDS</option>
                    <option value="product">product</option>
                </param>
            </when>
            <when value="coast_fa">
                <param name="query_file" type="data" format="FASTA" label="Load a query proteome in FASTA"/>
            </when>
        </conditional>
    </xml>
    <token name="@QUERY@"><![CDATA[
        "$query_type.query_file"
    ]]></token>
    <token name="@QUERY_KEYWORDS@"><![CDATA[
        #if $query_type.source == 'coast_gb'
            --keywords '$query_type.query_key'
        #end if
    ]]></token>

    <xml name="protein_db">
        <param name="db" type="select" optional="false" label="BLAST-Ready protein sequences database.">
            <options from_data_table="blastdb" />
        </param>
    </xml>
    <token name="@DB@"><![CDATA[
        "$db"
    ]]></token>

    <xml name="protein_db_diamond">
        <param name="db" type="select" optional="false" label="Diamond protein sequences database.">
            <options from_data_table="diamond_database" />
        </param>
    </xml>

    <xml name="output_format">
        <param name="outfmt" type="select" optional="true" multiple="true" display="checkboxes" label="Select outputs">
            <option value="b" selected="true">Best-hits tabular file</option>
            <option value="a" selected="true">Results tabular file</option>
<!--            <option value="r" selected="true">Summarized Report</option>-->
        </param>
    </xml>
    <token name="@OUTPUT_FORMAT@"><![CDATA[
        #if $outfmt
            --outfmt
            #for $format in $outfmt
                '${format}'
            #end for
        #end if
    ]]></token>
    <token name="@OUTPUT@"><![CDATA[
        --quiet
    ]]></token>

    <xml name="aai_filter">
        <param name="aai" type="integer" value="10" label="AAIc filtering score">
            <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
        </param>
        <param name="min_cov" type="integer" value="50" label="Minimum Coverage for AAIbd hit selection">
            <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
        </param>
        <param name="min_id" type="integer" value="40" label="Minimum Amino Acid Identity for AAIbd hit selection">
            <validator type="in_range" min="0" max="100" message="Value not in the permitted range. Only values from O to 100 allowed."/>
        </param>
    </xml>
    <token name="@AAI_FILTER@"><![CDATA[
        --aai '$aai'
        --cov '$min_cov'
        --id '$min_id'
    ]]></token>

    <xml name="hypothetical_filter">
        <param name="hypothetical" type="boolean" checked="false" label="Filter hypothetical proteins from query. Read description for more information." truevalue="--filter_hypothetical" falsevalue=""/>
    </xml>
    <token name="@HYPO_FILTER@"><![CDATA[
        #if $hypothetical
            '$hypothetical'
        #end if
    ]]></token>

    <xml name="results_alignment">
        <data format_source="tabular" format="tabular" name="blast_results" label="COAST - Batch alignment results" from_work_dir="blast_results.tab"/>
    </xml>

    <xml name="results_report">
        <data format_source="html"  format="html" name="coast_report" label="COAST - Summarized report" from_work_dir="coast_report.html">
            <filter>"r" in outfmt</filter>
        </data>
        <data format_source="tabular" format="tabular" label="COAST - Best-hits table" name="bh_results" from_work_dir="bh_results.tab">
            <filter>"b" in outfmt</filter>
        </data>
        <data format_source="tabular" format="tabular" label="COAST - Results table" name="coast_results" from_work_dir="coast_results.tab">
            <filter>"a" in outfmt</filter>
        </data>
    </xml>

    <xml name="blast_taxon_filter">
        <conditional name="filter_type">
            <param name="taxon_filter_type" type="select" label="Type of taxonomic filter">
                <option value="taxidlist_dm">Pre-defined taxonomic filters</option>
                <option value="taxidlist_user">User-provided file based list</option>
                <option value="taxonlist">Comma separated list</option>
            </param>
            <when value="taxidlist_dm">
                <param name="taxidlist" type="select" optional="true" label="Select pre-defined taxonomic filters">
                    <options from_data_table="coast_taxonomic_filters" />
                </param>
            </when>
            <when value="taxidlist_user">
                <param name="taxidlist" type="data" format="txt" optional="true" label="Load file with filtering taxids."/>
            </when>
            <when value="taxonlist">
                <param name="taxonlist" type="text" optional="true" label="Comma separated list of TAXIDs nodes, ranking species or lower"/>
            </when>
        </conditional>
    </xml>
    <token name="@BLAST_TAX_FILTER@"><![CDATA[
        #if $filter_type.taxon_filter_type == "taxidlist_dm"
            --taxidlist '$filter_type.taxidlist.fields.path'
        #end if
        #if $filter_type.taxon_filter_type == "taxidlist_user"
            --taxidlist '$filter_type.taxidlist'
        #end if
        #if $filter_type.taxon_filter_type == "taxonlist"
            --taxonlist '$filter_type.taxonlist'
        #end if
    ]]></token>

    <xml name="diamond_taxon_filter">
        <conditional name="filter_type">
            <param name="taxon_filter_type" type="select" label="Type of taxonomic filter">
                <option value="taxonlist_pre_defined">Pre-defined taxonomic filters</option>
                <option value="taxonlist">Comma separated list</option>
            </param>
        <when value="taxonlist_pre_defined">
            <param name="taxonlist" type="select" optional="true" label="Select pre-defined taxonomic filters">
                <option value="10239">Viruses - 10239</option>
                <option value="2157">Archaea - 2157</option>
                <option value="2">Bacteria - 2</option>
            </param>
        </when>
        <when value="taxonlist">
            <param name="taxonlist" type="text" optional="true" label="Comma separated list of TAXIDs nodes, ranking species or lower"/>
        </when>
        </conditional>
    </xml>
    <token name="@DIAMOND_TAX_FILTER@"><![CDATA[
        #if $taxonlist
            --taxonlist '$taxonlist'
        #end if
    ]]></token>

    <xml name="generic_aln_options">
        <param name="threshold_no" type="float" size="15" value="0.001" optional="true" label="E-Value Threshold"/>
        <param name="scoring_matrix" type="select" optional="true" label="Scoring matrix">
            <option value="BLOSUM45">BLOSUM45</option>
            <option value="BLOSUM50">BLOSUM50</option>
            <option value="BLOSUM62">BLOSUM62</option>
            <option value="BLOSUM80">BLOSUM80</option>
            <option value="BLOSUM90">BLOSUM90</option>
            <option value="PAM250">PAM250</option>
            <option value="PAM70">PAM70</option>
            <option value="PAM30">PAM30</option>
        </param>
        <param name="gap_open" type="integer" optional="true" label="Gap opening penalty">
            <validator type="in_range" min="0" max="50" message="Value not in the permitted range. Only values from O to 50 allowed."/>
        </param>
        <param name="gap_ext" type="integer" optional="true" label="Gap extension penalty">
            <validator type="in_range" min="0" max="50" message="Value not in the permitted range. Only values from O to 50 allowed."/>
        </param>
    </xml>
    <token name="@GENERIC_ALN_OPTIONS@"><![CDATA[
        #if $aln_adv.scoring_matrix
            --matrix '$aln_adv.scoring_matrix'
        #end if
        #if $aln_adv.threshold_no
            --evalue '$aln_adv.threshold_no'
        #end if
        #if $aln_adv.gap_open
            --gapopen '$aln_adv.gap_open'
        #end if
        #if $aln_adv.gap_ext
            --gapextend '$aln_adv.gap_ext'
        #end if
    ]]></token>

    <xml name="blast_aln_options">
        <param name="task" type="select" optional="true" label="Type of BLAST">
            <option value="blast">blast</option>
            <option value="blastp-fast">blastp-fast</option>
            <option value="blastp-short">blastp-short</option>
        </param>
    </xml>
    <token name="@BLAST_ALN_OPTIONS@"><![CDATA[
        #if $aln_adv.task
            --task '$aln_adv.task'
        #end if
    ]]></token>

    <xml name="diamond_aln_options">
        <param name="diamond_sens" type="select" label="Select the desired sensibility">
            <option value="sensitive" selected="true">sensitive</option>
            <option value="more-sensitive">more sensitive</option>
            <option value="very-sensitive">very sensitive</option>
            <option value="ultra-sensitive">ultra sensitive</option>
        </param>
    </xml>
    <token name="@DIAMOND_ALN_OPTIONS@"><![CDATA[
        #if $aln_adv.diamond_sens
            --sens '$aln_adv.diamond_sens'
        #end if
    ]]></token>

    <xml name="merlin_db_selection">
        <param name="db" type="select" label="Select the desired database">
            <option value="UniProtKB_SwissProt">SwissProt</option>
            <option value="UniProtKB_Trembl">Trembl</option>
        </param>
    </xml>
    <token name="@TIME_WARNING@"><![CDATA[
.. class:: warningmark

**WARNING** Proteome wide search time is affected by the its size and database size. This might result in slow queries.
Please use taxonomic filters to decrease search time significantly.

    ]]></token>
    <token name="@GENERAL_DESC@"><![CDATA[

COAST is tool designed to identify close proteomes for a user provided query, particulary for virus, using conventional alignment tools.
The close proteomes are provided at NCBI's taxonomy node level. For more information you can visit https://coast-tools.readthedocs.io

    ]]></token>
<token name="@AAI_DESC@"><![CDATA[

Indices and Metrics
___________________

**AAIc - Average Amino Acid Identity coast**

The AAIc is an attempt modify the AAI into a measure to compare proteomes for all annotated proteins.
Low identity hits will be considered, when they are usually removed by the traditional method.
Proteins that have no match at all will be also considered, as having 0 identity match.
It provides a way to compare the actual annotation and select organisms, even if more taxonomically distant, with proteins that could be
relevant for the function determination in hypothetical proteins, as an example.
For this the best hit is selected by the highest identity.

**AAIbd - Average Amino Acid Identity blast-diamond**

The AAIbd, is a implementation of a similar calculation to that of the original
AAI, but calculated only one way. It has by default a coverage and identity
of 50 and 40 respectively. This values are also used by EzAAI, based in the recent study
done by Nicholson et. all in 2020. The best hit is then selected by the the
highest identity.
The main purpose of this metric is to provide the user with an
estimate of how close taxonomically that Taxonomic node might be. The designation **bd** is used
to distinguish it from the original AAIb. It identifies that the score might be
produced using either BLAST results or diamond results.

The following options might be used to calibrate this selection to the user's preference:

- Minimum Identity: Minimum Amino Acid Identity, for hit selection for the AAIbd calculation;
- Minimum Coverage: Minimum coverage, for hit selection for the AAIbd calculation.

**HITSPP - Hits Per Protein**

The score is calculated by the quotient of the count of all the hits all proteins got, by the number of proteins in the query
proteome.
This will help the user understand how represented the proteome’s proteins might be in that particular database.

.. class:: warningmark

**WARNING** Very high values, above 100, might indicate that the taxonomic node very represented in the database.
Intermediate steps only deal with up to 500 hits per proteins, before best-hit selection.
As such, a small number of organisms with very high HITSPP scores can reduce the amount of organisms returned.

    ]]></token>
    <token name="@OUT_DESC@"><![CDATA[

Outputs
_______

**Batch alignment results**  This is a non-optional output. It contains the all alignment search results for all proteins in the proteome. It can also be used to generated new outputs from the COAST Report tool, using different parameters.

**Summarized report**  Is an HTML document that contains a list of filtered results ordered by AAIc. This report includes an heatmap visualization for protein identities.
It also contains metadata for the COAST job.

**Best-hits table** Tabular file with all the individual selected best-hits for each protein in the proteome. These are the hits selected for the AAIc calculation.

**Results table** Tabular file with aggregated metrics for each proteome match. Aggregated for taxid.

    ]]></token>
    <token name="@TAX_FILTER_WARNING@"><![CDATA[

Taxonomic Filtering
___________________

Taxonomic based filtering is present in both BLAST and diamond. It is **THE** key for short COAST run times in large databases.

Most organisms in a database, like nr or Trembl, are not useful in the close proteome identification process.
When users, for example, try to identify similar viruses the bacteria and eukaryotes in the same database will only slow the search down.
You should determine how wide you desire the search to be and identify the corresponding TAXID node.
Some of these filters are provided along with this tool.

    ]]></token>
    <token name="@HYPO_FILTER_WARNING@"><![CDATA[
.. class:: warningmark

**WARNING - Experimental feature** Hypothetical protein filtering can lead to worse results. Should only be used when few of the proteins have corresponding best-hits and the database might lack poorly studied proteins.

    ]]></token>

</macros>
author	diodupima
date	Wed, 17 Nov 2021 11:09:10 +0000
parents	00921ff6b0b7
children