annotate dante.xml @ 5:ad3bbf392135 draft

Uploaded
author petr-novak
date Wed, 26 Jun 2019 11:14:05 -0400
parents a5f1638b73be
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
1 <tool id="dante" name="Domain based ANnotation of Transposable Elements - DANTE" version="1.0.0">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
2 <requirements>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
3 <requirement type="package">last</requirement>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
4 <requirement type="package">numpy</requirement>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
5 <requirement type="package" version="1.0.0">dante</requirement>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
6 </requirements>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
7 <stdio>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
8 <regex match="Traceback" source="stderr" level="fail" description="Unknown error" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
9 <regex match="error" source="stderr" level="fail" description="Unknown error" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
10 </stdio>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
11 <description> Tool for annotation of transposable elements based on the similarity to conserved protein domains database. </description>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
12 <command>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
13 python3 ${__tool_directory__}/dante.py --query ${input} --domain_gff ${DomGff}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
14 --protein_database ${__tool_data_path__ }/protein_domains/${db_type}_pdb
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
15 --classification ${__tool_data_path__ }/protein_domains/${db_type}_class
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
16 </command>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
17 <inputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
18 <param format="fasta" type="data" name="input" label="Choose your input sequence" help="Input DNA must be in proper fasta format, multi-fasta containing more sequences is allowed" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
19
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
20 <param name="db_type" type="select" label="Select taxon and protein domain database version (REXdb)" help="">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
21 <options from_file="rexdb_versions.txt">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
22 <column name="name" index="0"/>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
23 <column name="value" index="1"/>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
24 </options>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
25 </param>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
26
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
27 </inputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
28
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
29 <outputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
30 <data format="gff3" name="DomGff" label="Unfiltered GFF3 file of ALL protein domains from dataset ${input.hid}" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
31 </outputs>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
32
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
33 <help>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
34
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
35 THIS IS A PRIMARY OUTPUT THAT SHOULD UNDERGO FURTHER QUALITY FILTERING TO GET RID OFF POTENTIAL FALSE POSITIVE DOMAINS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
36
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
37 **WHAT IT DOES**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
38
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
39 This tool uses external aligning programme `LAST`_ and RepeatExplorer database of TE protein domains(REXdb) (Viridiplantae and Metazoa)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
40
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
41 .. _LAST: http://last.cbrc.jp/
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
42
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
43 *Lastal* runs similarity search to find hits between query DNA sequence and our database of protein domains from all Viridiplantae repetitive elements. Hits with overlapping positions in the sequence (even through other hits) forms a cluster which represents one potential protein domain. Strand orientation is taken into consideration when forming the clusters which means each cluster is built from forward or reverse stranded hits exclusively. The clusters are subsequently processed separately; within one cluster positions are scanned base-by-base and classification strings are assigned for each of them based on the database sequences which were mapped on that place. These asigned classification strings consist of a domain type as well as class and lineage of the repetitive element where the database protein comes from. Different classification levels are separated by "|" character. Every hit is scored according to the scoring matrix used for DNA-protein alignment (BLOSUM80). For single position only the hits reaching certain percentage (80% by default) of the overall best score within the whole cluster are reported. One cluster of overlapping hits represents one domain region and is recorded as one line in the resulting GFF3 file. Regarding the classition strings assigned to one region (cluster) there are three situations that can occur:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
44
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
45 1. There is a single classification string assigned to each position as well as classifications along all the positions in the region are mutually uniform, in this case domain's final classification is equivalent to this unique classification.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
46 2. There are multiple classification strings assigned to one cluster, i.e. one domain, which leads to classification to the common (less specific) level of all the strings
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
47 3. There is a conflict at the domain type level, domains are reported with slash (e.g. RT/INT) and the classification is in this case ambiguous
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
48
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
49 **There are 2 outputs produced by this tool:**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
50
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
51 1. GFF3 file of all proteins domains built from all hits found by LAST. Domains are reported per line as regions (start - end) on the original DNA sequence including the seq ID, alignment score and strand orientation. The last "Attributes" column contains several semicolon-separated information related to annotation, repetitive classification, alignment and its quality. This file can undergo further filtering using *Protein Domain Filter* tool
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
52
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
53 - Attributes reported always:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
54
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
55 Name
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
56 type of domain; if ambiguous reported with slash
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
57
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
58 Final_classification
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
59 definite classification based on all partial classifications of Region_hits_classifications attribute or
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
60 "Ambiguous_domain" when there is an ambiguous domain type
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
61
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
62 Region_Hits_Classifications
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
63 all hits classifications (comma separated) from a certain domain region that reach the set score threshold; in case of multiple annotations the square brackets indicate the number of bases having this particular classification
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
64
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
65 - Attributes only reported in case of unambiguous domain type (all the attributes including quality information are related to the Best_Hit of the region):
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
66
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
67 Best_hit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
68 classification and position of the best alignment with the highest score within the cluster; in the square brackets is the percentage of the whole cluster range that this best hit covers
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
69
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
70 Best_Hit_DB_Pos
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
71 showing which part of the original datatabase domain corresponding to the Best Hit was aligned on query DNA (e.g. **Best_Hit_DB_Pos=17:75of79** means the Best Hit reported in GFF represents region from 17th to 75th of total 79 aminoacids in the original domain from the database)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
72
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
73 DB_Seq
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
74 database protein sequence of the best hit mapped to the query DNA
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
75
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
76 Query_Seq
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
77 alignment sequence of the query DNA for the best hit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
78
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
79 Identity
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
80 ratio of identical amino acids in alignment sequence to the length of alignment
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
81
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
82 Similarity
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
83 ratio of alignment positions with positive score (according to the scoring matrix) to the length of alignment
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
84
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
85 Relat_Length
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
86 ratio of gapless length of the aligned protein sequence to the whole length of the database protein
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
87
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
88 Relat_Interruptions
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
89 number of the interruptions (frameshifts + stop codons) in aligned translated query sequence per each starting 100 AA
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
90
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
91 Hit_to_DB_Length
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
92 proportion of alignment length to the original length of the protein domain from database
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
93
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
94
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
95
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
96 !NOTE: Tool can in average process 0.5 Gbps of the DNA sequence per day. This is only a rough estimate and it is highly dependent on input data (repetive elements occurence) as well as computing resources. Maximum running time of the tool is 7 days.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
97
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
98 </help>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
99 </tool>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
100