comparison README.md @ 0:77d9f2ecb28a draft

Uploaded
author petr-novak
date Wed, 03 Jul 2019 02:45:00 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:77d9f2ecb28a
1 # Domain based annotation of transposable elements - DANTE #
2
3 ### Authors
4 Nina Hostakova, Petr Novak, Pavel Neumann, Jiri Macas
5 Biology Centre CAS, Czech Republic
6
7
8 ### Introduction
9
10 * Protein Domains Finder [dante.py]
11 * Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database.
12 * Domains searching is accomplished engaging LASTAL alignment tool.
13 * Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them.
14
15 * Proteins Domains Filter [dante_gff_output_filtering.py]
16 * filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering
17
18 ### DEPENDENCIES ###
19
20 * python3.4 or higher with packages:
21 * numpy
22 * biopython
23 * [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher
24 * ProfRep/DANTE modules:
25 * configuration.py
26
27
28 ### Protein Domains Finder ###
29
30 This tool provides **preliminary** output of all domains types which are not filtered for quality.
31
32 #### INPUTS ####
33
34 * DNA sequence [multiFasta]
35
36 #### OUTPUTS ####
37
38 * **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.
39
40 #### USAGE ####
41
42 usage: dante.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs
43 CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB]
44 [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE]
45 [-wd WIN_DOM] [-od OVERLAP_DOM]
46
47 optional arguments:
48 -h, --help show this help message and exit
49 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
50 output domains gff format (default: None)
51 -nld NEW_LDB, --new_ldb NEW_LDB
52 create indexed database files for lastal in case of
53 working with new protein db (default: False)
54 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
55 specify if you want to change the output directory
56 (default: None)
57 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
58 percentage of the best score in the cluster to be
59 tolerated when assigning annotations per base
60 (default: 80)
61 -wd WIN_DOM, --win_dom WIN_DOM
62 window to process large input sequences sequentially
63 (default: 10000000)
64 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
65 overlap of sequences in two consecutive windows
66 (default: 10000)
67
68 required named arguments:
69 -q QUERY, --query QUERY
70 input DNA sequence to search for protein domains in a
71 fasta format. Multifasta format allowed. (default:
72 None)
73 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
74 protein domains database file (default: None)
75 -cs CLASSIFICATION, --classification CLASSIFICATION
76 protein domains classification file (default: None)
77
78
79
80 #### HOW TO RUN EXAMPLE ####
81 ./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE
82
83 When running for the first time with a new database use -nld option allowing lastal to create indexed database files:
84
85 -nld True
86
87 use other arguments if you wish to rename your outputs or they will be created automatically with standard names
88
89 ### Protein Domains Filter ###
90
91 The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA.
92
93 WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERING USING THE DEFAULT PARAMETRES (optimized for Viridiplantae species)
94
95 #### INPUTS ####
96 * GFF3 file produced by protein_domains.py OR already filtered GFF3
97
98 #### Filtering options ####
99 * QUALITY:
100 - Min relative length of alignemnt to the protein domain from DB (without gaps)
101 - Identity
102 - Similarity (scoring matrix: BLOSUM80)
103 - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA
104 - Max alignment proportion to the original length of database domain sequence
105 * DOMAIN TYPE: 'Name' attribute in GFF - see choices bellow
106 Records for ambiguous domain type (e.g. INT/RH) are filtered out automatically
107
108 * MOBILE ELEMENT TYPE:
109 arbitrary substring of the element classification ('Final_Classification' attribute in GFF)
110
111 #### OUTPUTS ####
112 * filtered GFF3 file
113 * fasta file of translated protein sequences for the aligned domains that match the filtering criteria
114 ! as it is taken from the best hit alignment reported by LAST, it does not neccessary cover the whole region reported as domain in GFF
115
116 #### USAGE ####
117
118 usage: dante_gff_output_filtering.py [-h] -dg DOM_GFF [-ouf DOMAINS_FILTERED]
119 [-dps DOMAINS_PROT_SEQ]
120 [-thl {float range 0.0..1.0}]
121 [-thi {float range 0.0..1.0}]
122 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
123 [-mlen MAX_LEN_PROPORTION]
124 [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}]
125 [-el ELEMENT_TYPE] [-dir OUTPUT_DIR]
126
127
128
129 optional arguments:
130 -h, --help show this help message and exit
131 -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED
132 output filtered domains gff file (default: None)
133 -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ
134 output file containg domains protein sequences
135 (default: None)
136 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
137 proportion of alignment length threshold (default:
138 0.8)
139 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
140 proportion of alignment identity threshold (default:
141 0.35)
142 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
143 threshold for alignment proportional similarity
144 (default: 0.45)
145 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
146 interruptions (frameshifts + stop codons) tolerance
147 threshold per 100 AA (default: 3)
148 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
149 maximal proportion of alignment length to the original
150 length of protein domain from database (default: 1.2)
151 -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}
152 filter output domains based on the domain type
153 (default: All)
154 -el ELEMENT_TYPE, --element_type ELEMENT_TYPE
155 filter output domains by typing substring from
156 classification (default: )
157 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
158 specify if you want to change the output directory
159 (default: None)
160
161 required named arguments:
162 -dg DOM_GFF, --dom_gff DOM_GFF
163 basic unfiltered gff file of all domains (default:
164 None)
165
166
167
168 #### HOW TO RUN EXAMPLE ####
169 e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements:
170
171 ./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy
172
173
174 ### Extract Domains Nucleotide Sequences ###
175
176 This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species.
177
178 #### INPUTS ####
179
180 * original DNA sequence in multifasta format to extract the domains from
181 * GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type)
182 * Domains database classification table (to check the classification level)
183
184 #### OUTPUTS ####
185
186 * fasta files of domains nucleotide sequences for individual transposons lineages
187 * txt file of domains counts extracted for individual lineages
188
189 **- For GALAXY usage all concatenated in a single fasta file**
190
191 #### USAGE ####
192 usage: dante_gff_to_dna.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs
193 CLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]
194
195 optional arguments:
196 -h, --help show this help message and exit
197 -i INPUT_DNA, --input_dna INPUT_DNA
198 path to input DNA sequence
199 -d DOMAINS_GFF, --domains_gff DOMAINS_GFF
200 GFF file of protein domains
201 -cs CLASSIFICATION, --classification CLASSIFICATION
202 protein domains classification file
203 -out OUT_DIR, --out_dir OUT_DIR
204 output directory
205 -ex EXTENDED, --extended EXTENDED
206 extend the domains edges if not the whole datatabase
207 sequence was aligned
208
209 #### HOW TO RUN EXAMPLE ####
210 ./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True
211
212
213
214
215
216
217
218
219