Mercurial > repos > iuc > annotatemyids

<tool id="annotatemyids" name="annotateMyIDs" version="3.14.0+galaxy0">
    <description>annotate a generic set of identifiers</description>
    <xrefs>
        <xref type="bio.tools">annotatemyids</xref>
    </xrefs>
    <requirements>
        <requirement type="package" version="3.14.0">bioconductor-org.hs.eg.db</requirement>
        <requirement type="package" version="3.14.0">bioconductor-org.mm.eg.db</requirement>
        <requirement type="package" version="3.14.0">bioconductor-org.dm.eg.db</requirement>
        <requirement type="package" version="3.14.0">bioconductor-org.dr.eg.db</requirement>
        <requirement type="package" version="3.14.0">bioconductor-org.rn.eg.db</requirement>
        <requirement type="package" version="3.14.0">bioconductor-org.at.tair.db</requirement>
    </requirements>
    <version_command><![CDATA[
echo $(R --version | grep version | grep -v GNU)", org.Hs.eg.db version" $(R --vanilla --slave -e "library(org.Hs.eg.db); cat(sessionInfo()\$otherPkgs\$org.Hs.eg.db\$Version)" 2> /dev/null | grep -v -i "WARNING: ")", org.Dr.eg.db version" $(R --vanilla --slave -e "library(org.Dr.eg.db); cat(sessionInfo()\$otherPkgs\$org.Dr.eg.db\$Version)" 2> /dev/null | grep -v -i "WARNING: ")", org.Dm.eg.db version" $(R --vanilla --slave -e "library(org.Dm.eg.db); cat(sessionInfo()\$otherPkgs\$org.Dm.eg.db\$Version)" 2> /dev/null | grep -v -i "WARNING: ")", org.Mm.eg.db version" $(R --vanilla --slave -e "library(org.Mm.eg.db); cat(sessionInfo()\$otherPkgs\$org.Mm.eg.db\$Version)" 2> /dev/null | grep -v -i "WARNING: ")", org.Rn.eg.db version" $(R --vanilla --slave -e "library(org.Rn.eg.db); cat(sessionInfo()\$otherPkgs\$org.Rn.eg.db\$Version)" 2> /dev/null | grep -v -i "WARNING: ")
    ]]></version_command>
    <command detect_errors="exit_code"><![CDATA[
        #if $rscriptOpt:
            cp '${annotatemyids_script}' '${out_rscript}' &&
        #end if
        Rscript '${annotatemyids_script}'
    ]]>
    </command>
    <configfiles>
        <configfile name="annotatemyids_script"><![CDATA[
options( show.error.messages=F, error = function () { cat( geterrmessage(), file=stderr() ); q( "no", 1, F ) } )

# we need that to not crash galaxy with an UTF8 error on German LC settings.
loc <- Sys.setlocale("LC_MESSAGES", "en_US.UTF-8")

id_type <- "${id_type}"
organism <- "${organism}"
output_cols <- "${output_cols}"
file_has_header <- ${file_has_header}
remove_dups <- ${remove_dups}

input <- read.table('$id_file', header=file_has_header, sep="\t", quote="")
ids <- as.character(input[, 1])

if(organism == "Hs"){
    suppressPackageStartupMessages(library(org.Hs.eg.db))
    db <- org.Hs.eg.db
} else if (organism == "Mm"){
    suppressPackageStartupMessages(library(org.Mm.eg.db))
    db <- org.Mm.eg.db
} else if (organism == "Dm"){
    suppressPackageStartupMessages(library(org.Dm.eg.db))
    db <- org.Dm.eg.db
} else if (organism == "Dr"){
    suppressPackageStartupMessages(library(org.Dr.eg.db))
    db <- org.Dr.eg.db
} else if (organism == "Rn"){
    suppressPackageStartupMessages(library(org.Rn.eg.db))
    db <- org.Rn.eg.db
} else if (organism == "At"){
    suppressPackageStartupMessages(library(org.At.tair.db))
    db <- org.At.tair.db
} else {
    cat(paste("Organism type not supported", organism))
}

cols <- unlist(strsplit(output_cols, ","))
result <- select(db, keys=ids, keytype=id_type, columns=cols)

if(remove_dups) {
    result <- result[!duplicated(result$${id_type}),]
}

write.table(result, file='$out_tab', sep="\t", row.names=FALSE, quote=FALSE)

    ]]></configfile>
    </configfiles>
    <inputs>
        <param name="id_file" type="data" format="tabular" label="File with IDs" help="A tabular file with the first column containing one of the supported types of identifier, see Help below." />
        <param name="file_has_header" type="boolean" truevalue="TRUE" falsevalue="FALSE" checked="False" label="File has header?" help="If this option is set to Yes, the tool will assume that the input file has a column header in the first row and the identifers commence on the second line. Default: No" />
        <param name="organism" type="select" label="Organism" help="Select the organism the identifiers are from">
            <option value="Hs" selected="true">Human</option>
            <option value="Mm">Mouse</option>
            <option value="Rn">Rat</option>
            <option value="Dm">Fruit fly</option>
            <option value="Dr">Zebrafish</option>
            <option value="At">Arabidopsis thaliana</option>
        </param>
        <param name="id_type" type="select" label="ID Type" help="Select the type of IDs in your input file">
            <option value="ENSEMBL" selected="true">Ensembl Gene</option>
            <option value="ENSEMBLPROT">Ensembl Protein</option>
            <option value="ENSEMBLTRANS">Ensembl Transcript</option>
            <option value="ENTREZID">Entrez</option>
            <option value="FLYBASE">FlyBase</option>
            <option value="GO">GO</option>
            <option value="PATH">KEGG</option>
            <option value="MGI">MGI</option>
            <option value="REFSEQ">RefSeq</option>
            <option value="SYMBOL">Gene Symbol</option>
            <option value="ZFIN">Zfin</option>
        </param>
        <param name="output_cols" type="select" multiple="True" display="checkboxes" label="Output columns" help="Choose the columns you want in the output table. Note that selecting some columns such as GO or KEGG could make the table very large as some genes may be associated with many terms. Default: ENSEMBL, ENTREZID, SYMBOL, GENENAME">
            <option value="ALIAS">ALIAS</option>
            <option value="ENSEMBL" selected="True">ENSEMBL</option>
            <option value="ENTREZID" selected="True">ENTREZID</option>
            <option value="EVIDENCE">EVIDENCE</option>
            <option value="SYMBOL" selected="True">SYMBOL</option>
            <option value="GENENAME" selected="True">GENENAME</option>
            <option value="REFSEQ">REFSEQ</option>
            <option value="GO">GO</option>
            <option value="ONTOLOGY">ONTOLOGY</option>
            <option value="PATH">KEGG</option>
        </param>
        <param name="remove_dups" type="boolean" truevalue="TRUE" falsevalue="FALSE" checked="False" label="Remove duplicates?" help="If this option is set to Yes, only the first occurrence of each input Gene ID will be kept. Default: No" />
        <param name="rscriptOpt" type="boolean" truevalue="TRUE" falsevalue="FALSE" checked="False" label="Output Rscript?" help="If this option is set to Yes, the Rscript used to annotate the IDs will be provided as a text file in the output. Default: No" />
    </inputs>
    <outputs>
        <data name="out_tab" format="tabular" label="${tool.name} on ${on_string}: Annotated IDs" />
        <data name="out_rscript" format="txt" label="${tool.name} on ${on_string}: Rscript">
            <filter>rscriptOpt is True</filter>
        </data>
    </outputs>
    <tests>
        <!-- Ensure output table works -->
        <test expect_num_outputs="1">
            <param name="id_file" value="genelist.txt" ftype="tabular"/>
            <param name="id_type" value="SYMBOL"/>
            <param name="organism" value="Hs"/>
            <output name="out_tab" file="out.tab" />
        </test>
        <!-- Ensure Ensembl IDs input and Rscript output work -->
        <test expect_num_outputs="2">
            <param name="id_file" value="ensembl_ids.tab" ftype="tabular"/>
            <param name="id_type" value="ENSEMBL"/>
            <param name="organism" value="Hs"/>
            <param name="rscriptOpt" value="True" />
            <output name="out_tab" file="out_ensembl.tab" />
            <output name="out_rscript" file="out_rscript.txt" compare="sim_size" />
        </test>
        <!-- Ensure GO and KEGG output work -->
        <test expect_num_outputs="1">
            <param name="id_file" value="ensembl_ids.tab" ftype="tabular"/>
            <param name="id_type" value="ENSEMBL"/>
            <param name="organism" value="Hs"/>
            <param name="output_cols" value="ENSEMBL,GO,ONTOLOGY,EVIDENCE" />
            <output name="out_tab" file="out_gokegg.tab" />
        </test>
        <!-- Ensure duplicate Gene ID removal works -->
        <test expect_num_outputs="1">
            <param name="id_file" value="ensembl_ids.tab" ftype="tabular"/>
            <param name="id_type" value="ENSEMBL"/>
            <param name="organism" value="Hs"/>
            <param name="output_cols" value="ENSEMBL,GO,ONTOLOGY,EVIDENCE" />
            <param name="remove_dups" value="True" />
            <output name="out_tab" file="out_gokegg_dupsrem.tab" />
        </test>
        <!-- Arabidopsis database -->
        <test expect_num_outputs="1">
            <param name="id_file" value="genelist_arabidopsis.txt" ftype="tabular"/>
            <param name="id_type" value="SYMBOL"/>
            <param name="organism" value="At"/>
            <param name="output_cols" value="GO,ENTREZID" />
            <output name="out_tab" file="out_arabidopsis.tab" />
        </test>
    </tests>
    <help><![CDATA[

.. class:: infomark

**What it does**

This tool can get annotation for a generic set of IDs, using the Bioconductor_ annotation data packages. Supported organisms are human, mouse, rat, fruit fly and zebrafish. The org.db packages that are used here are primarily based on mapping using Entrez Gene identifiers. More information on the annotation packages can be found at the Bioconductor website, for example, information on the human annotation package (org.Hs.eg.db) can be found here_.

Examples of what this tool can be used for are:

* adding gene names to IDs
* mapping between IDs e.g. Entrez, Ensembl, Symbols
* adding GO and KEGG identifiers

.. _Bioconductor: https://www.bioconductor.org/
.. _here: http://bioconductor.org/packages/release/data/annotation/manuals/org.Hs.eg.db/man/org.Hs.eg.db.pdf

-----

**Inputs**

A tab-delimited file with identifiers in the first column. If the file contains a header row, select the file has a header option in the tool form above.

Example:

    =============== =======================
    **GeneID**      *Additional Columns...*
    --------------- -----------------------
    ENSG00000091831
    ENSG00000082175
    ENSG00000141736
    ENSG00000012048
    ENSG00000139618
    ENSG00000129514
    ENSG00000171862
    ENSG00000141510
    =============== =======================

    ID types supported for input are:

    * **ENSEMBL**: Ensembl gene IDs
    * **ENSEMBLPROT**: Ensembl protein IDs
    * **ENSEMBLTRANS**: Ensembl transcript IDs
    * **ENTREZID**: Entrez gene Identifiers
    * **FLYBASE**: FlyBase accession numbers
    * **GO**: GO Identifiers
    * **MGI**: Jackson Laboratory MGI gene accession numbers
    * **PATH**: KEGG Pathway Identifiers
    * **REFSEQ**: Refseq Identifiers
    * **SYMBOL**: The official gene symbol
    * **ZFIN**: Zfin accession numbers

.. class:: warningmark

This tool uses the ``select`` function from the Bioconductor AnnotationDBi_ package. Note that if you request columns that have multiple matches for your IDs, select will return *one row in the output for each possible match*. This has the effect that if you request multiple columns and some of them have a many-to-one relationship to the IDs, things will continue to multiply accordingly. So it's not a good idea to request a large number of columns unless you know what you are asking for should have a one-to-one relationship with the initial set of IDs. In general, if you need to retrieve a column like **GO** or **KEGG**, that has a many-to-one relationship to the original IDs, it is most useful to extract that separately.

.. _AnnotationDBi: https://www.bioconductor.org/packages/devel/bioc/manuals/AnnotationDbi/man/AnnotationDbi.pdf

-----

**Outputs**

If the input IDs are Ensembl, the default output will be similar to below, containing four columns. Other columns, such as GO and KEGG terms, can be selected above to be added as additional columns.

Example:

    =============== ============ ========== =================================
    **ENSEMBL**     **ENTREZID** **SYMBOL** **GENENAME**
    --------------- ------------ ---------- ---------------------------------
    ENSG00000091831  2099        ESR1       estrogen receptor 1
    ENSG00000082175  5241        PGR        progesterone receptor
    ENSG00000141736  2064        ERBB2      erb-b2 receptor tyrosine kinase 2
    ENSG00000012048  672         BRCA1      breast cancer 1
    ENSG00000139618  675         BRCA2      breast cancer 2
    ENSG00000129514  3169        FOXA1      forkhead box A1
    ENSG00000171862  5728        PTEN       phosphatase and tensin homolog
    ENSG00000141510  7157        TP53       tumor protein p53
    =============== ============ ========== =================================

    Columns available for output include many of the ID columns already described under Inputs above and also:

    * **ALIAS**: Commonly used gene symbols
    * **EVIDENCE**: Evidence codes for GO associations with a gene of interest
    * **GENENAME**: The full gene name
    * **ONTOLOGY**: For GO Identifiers, which Gene Ontology (BP, CC, or MF)

    ]]></help>
        <citations>
            <citation type="bibtex">
                @unpublished{None,
                author = {Mark Dunning},
                title = {annotateMyIDs},
                year = {2017},
                eprint = {None},
                url = {}
         }</citation>
    </citations>
</tool>
author	iuc
date	Fri, 01 Jul 2022 12:24:49 +0000
parents	f29602ae449e
children	fe3ca740a485