annotate README.md @ 4:e27e86406f56 draft

Uploaded
author petr-novak
date Wed, 26 Jun 2019 10:23:50 -0400
parents a5f1638b73be
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
1 # REPEATS ANNOTATION TOOLS FOR ASSEMBLIES #
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
2
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
3
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
4 ## 1. PROFREP ##
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
5 *- **PROF**iles of **REP**eats -*
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
6
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
7 The ProfRep main tool engages outputs of RepeatExplorer for repeats annotation in DNA sequences (typically assemblies but not necessarily). Moreover, it provides repetitive profiles of the sequence, pointing out quantitative representation of individual repeats along the sequence as well as the overall repetitiveness.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
8
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
9 ### DEPENDENCIES ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
10
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
11 * python 3.4 or higher with packages:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
12 * numpy
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
13 * matplotlib
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
14 * biopython
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
15 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279690/) or higher
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
16 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
17 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
18 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
19
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
20 * ProfRep Modules:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
21 * gff.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
22 * visualization.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
23 * configuration.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
24 * protein_domains.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
25 * domains_filtering.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
26
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
27 * Profrep databases
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
28
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
29 There are precompiled profrep annotation dataset for limited number of species. List of species can be find in file [prepared_datasets.txt](tool_data/prepared_datasets). Databases include large files and must be downloaded from our website:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
30
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
31 cd tool_data
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
32 wget http://repeatexplorer.org/repeatexplorer/wp-content/uploads/profrep.tar.gz
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
33 tar xzvf profrep.tar.gz
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
34
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
35
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
36 #### INPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
37
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
38 * **DNA sequence(s) to annotate** [multiFASTA]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
39
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
40 * **Species specific dataset** available from RepeatExplorer archive consisting of:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
41
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
42 * NGS reads sequences [multiFASTA]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
43 * In RE archive: *seqclust -> sequences -> sequences.fasta*
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
44 * CLS file of clusters and belonging reads [multiFASTA]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
45 * in RE archive: *seqclust -> clustering -> hitsort.cls*
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
46 * Classification table [TSV, CSV]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
47 * in RE archive: *PROFREP_CLASSIFICATION_TEMPLATE.csv* (automatic classification)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
48
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
49
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
50 #### OUTPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
51
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
52 * **HTML summary report,JBrowse Data Directory** showing basic information and repetitive profile graphs as well as protein domains (optional) for individual sequences (up to 50). This output also serves as an data directory for [JBrowse](https://jbrowse.org/) genome browser. You can create a standalone JBrowse instance for further detailed visualization of the output tracks using Galaxy-integrated tool. This output can also be downloaded as an archive containing all relevant data for visualization via locally installed JBrowse server (see more about visualization in OUTPUT VISUALIZATION below)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
53 * **Ns GFF** - reports unspecified (N) bases regions in the sequence
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
54 * **Repeats GFF** - reports repetitive regions of a certain length (defaultly **80**) and above hits/copy numbers threshold (defaultly **3**)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
55 * **Domains GFF** - reports protein domains, classification of domain, chain orientation and alignment sequences
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
56 * Log file
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
57
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
58
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
59 ### Running ProfRep ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
60
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
61 usage: profrep.py [-h] -q QUERY -rdb READS -a ANN_TBL -c CLS [-id DB_ID]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
62 [-bs BIT_SCORE] [-m MAX_ALIGNMENTS] [-e E_VALUE]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
63 [-df DUST_FILTER] [-ws WORD_SIZE] [-t TASK] [-n NEW_DB]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
64 [-w WINDOW] [-o OVERLAP] [-pd PROTEIN_DOMAINS]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
65 [-pdb PROTEIN_DATABASE] [-cs CLASSIFICATION] [-wd WIN_DOM]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
66 [-od OVERLAP_DOM] [-thsc THRESHOLD_SCORE]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
67 [-thl {float range 0.0..1.0}] [-thi {float range 0.0..1.0}]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
68 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
69 [-mlen MAX_LEN_PROPORTION] [-lg LOG_FILE] [-ouf OUTPUT_GFF]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
70 [-oug DOMAIN_GFF] [-oun N_GFF] [-hf HTML_FILE]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
71 [-hp HTML_PATH] [-cn COPY_NUMBERS] [-gs GENOME_SIZE]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
72 [-thr THRESHOLD_REPEAT] [-thsg THRESHOLD_SEGMENT]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
73 [-jb JBROWSE_BIN]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
74
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
75
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
76 optional arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
77 -h, --help show this help message and exit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
78
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
79 required arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
80 -q QUERY, --query QUERY
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
81 input DNA sequence in (multi)fasta format (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
82 None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
83 -rdb READS, --reads READS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
84 blast database of all sequencing reads (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
85 -a ANN_TBL, --ann_tbl ANN_TBL
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
86 clusters annotation table, tab-separated number of
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
87 cluster and its classification (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
88 -c CLS, --cls CLS cls file containing reads assigned to clusters
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
89 (hitsort.cls) (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
90
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
91 alternative required arguments - prepared datasets:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
92 -id DB_ID, --db_id DB_ID
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
93 annotation dataset ID (first column of datasets table)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
94 (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
95
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
96 optional arguments - BLAST Search:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
97 -bs BIT_SCORE, --bit_score BIT_SCORE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
98 bitscore threshold (default: 50)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
99 -m MAX_ALIGNMENTS, --max_alignments MAX_ALIGNMENTS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
100 blast filtering option: maximal number of alignments
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
101 in the output (default: 10000000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
102 -e E_VALUE, --e_value E_VALUE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
103 blast setting option: e-value (default: 0.1)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
104 -df DUST_FILTER, --dust_filter DUST_FILTER
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
105 dust filters low-complexity regions during BLAST
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
106 search (default: '20 64 1')
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
107 -ws WORD_SIZE, --word_size WORD_SIZE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
108 blast search option: initial word size for alignment
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
109 (default: 11)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
110 -t TASK, --task TASK type of blast to be triggered (default: blastn)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
111 -n NEW_DB, --new_db NEW_DB
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
112 create a new blast database, USE THIS OPTION IF YOU
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
113 RUN PROFREP WITH NEW DATABASE FOR THE FIRST TIME
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
114 (default: True)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
115
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
116 optional arguments - Parallel Processing:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
117 -w WINDOW, --window WINDOW
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
118 sliding window size for parallel processing (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
119 5000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
120 -o OVERLAP, --overlap OVERLAP
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
121 overlap for parallely processed regions, set greater
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
122 than a read size (default: 150)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
123
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
124 optional arguments - Protein Domains:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
125 -pd PROTEIN_DOMAINS, --protein_domains PROTEIN_DOMAINS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
126 use module for protein domains (default: False)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
127 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
128 protein domains database (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
129 -cs CLASSIFICATION, --classification CLASSIFICATION
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
130 protein domains classification file (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
131 -wd WIN_DOM, --win_dom WIN_DOM
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
132 protein domains module: sliding window to process
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
133 large input sequences sequentially (default: 10000000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
134 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
135 protein domains module: overlap of sequences in two
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
136 consecutive windows (default: 10000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
137 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
138 protein domains module: percentage of the best score
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
139 within the cluster to significant domains (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
140 80)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
141 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
142 proportion of alignment length threshold (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
143 0.8)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
144 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
145 proportion of alignment identity threshold (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
146 0.35)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
147 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
148 threshold for alignment proportional similarity
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
149 (default: 0.45)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
150 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
151 interruptions (frameshifts + stop codons) tolerance
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
152 threshold per 100 AA (default: 3)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
153 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
154 maximal proportion of alignment length to the original
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
155 length of protein domain from database (default: 1.2)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
156
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
157 optional arguments - Output Paths:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
158 -lg LOG_FILE, --log_file LOG_FILE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
159 path to log file (default: log.txt)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
160 -ouf OUTPUT_GFF, --output_gff OUTPUT_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
161 path to output gff of repetitive regions (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
162 output_repeats.gff)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
163 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
164 path to output gff of protein domains (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
165 output_domains.gff)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
166 -oun N_GFF, --n_gff N_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
167 path to output gff of N regions (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
168 N_regions.gff)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
169 -hf HTML_FILE, --html_file HTML_FILE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
170 path to output html file (default: output.html)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
171 -hp HTML_PATH, --html_path HTML_PATH
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
172 path to html extra files (default: profrep_output_dir)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
173
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
174 optional arguments - Copy Numbers/Hits :
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
175 -cn COPY_NUMBERS, --copy_numbers COPY_NUMBERS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
176 convert hits to copy numbers (default: False)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
177 -gs GENOME_SIZE, --genome_size GENOME_SIZE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
178 genome size is required when converting hits to copy
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
179 numbers and you use custom data (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
180 -thr THRESHOLD_REPEAT, --threshold_repeat THRESHOLD_REPEAT
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
181 threshold for hits/copy numbers per position to be
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
182 considered repetitive (default: 3)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
183 -thsg THRESHOLD_SEGMENT, --threshold_segment THRESHOLD_SEGMENT
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
184 threshold for the length of repetitive segment to be
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
185 reported (default: 80)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
186
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
187 optional arguments - Enviroment Variables:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
188 -jb JBROWSE_BIN, --jbrowse_bin JBROWSE_BIN
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
189 path to JBrowse bin directory (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
190
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
191
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
192 #### HOW TO RUN EXAMPLE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
193
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
194 ./protein.py --query PATH_TO_DNA_SEQ --reads PATH_TO_READS --ann_tbl PATH_TO_CLUSTERS_CLASSIFICATION --cls PATH_TO_hitsort.cls
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
195
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
196 When running for the first time with a new reads database use:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
197
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
198 --new_db True
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
199
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
200
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
201 ### ProfRep Data Preparation ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
202
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
203 In case of using custom input datasets these tools can be used for easy obtaining the correct files and to prepare the reduced datasets to speed up the main ProfRep analysis:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
204
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
205 * Extract Data For ProfRep (extract_data_for_profrep.py)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
206 * ProfRep DB Reducing (profrep_db_reducing.py)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
207
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
208 ### ProfRep Supplementary Tools ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
209
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
210 These additional tools can be used for further work with the ProfRep outputs:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
211
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
212 * ProfRep Refiner (profrep_refining.py)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
213 * ProfRep Masker (profrep_masking.py)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
214 * GFF Region Selector (gff_selection.py)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
215
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
216 ### FOR MORE INFO ABOUT PREPARATION AND SUPPLEMENTARY TOOLS PLEASE READ PROFREP WIKI ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
217
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
218 ## 2. DANTE ##
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
219 *- **D**omain based **AN**notation of **T**ransposable **E**lements -*
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
220
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
221
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
222 * Protein Domains Finder [protein_domains.py]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
223 * Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
224 * Domains searching is accomplished engaging LASTAL alignment tool.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
225 * Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
226
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
227 * Proteins Domains Filter [domains_filtering.py]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
228 * filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
229
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
230
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
231 ### DEPENDENCIES ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
232
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
233 * python3.4 or higher with packages:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
234 * numpy
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
235 * biopython
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
236 * [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
237 * ProfRep/DANTE modules:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
238 * configuration.py
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
239
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
240
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
241 ### Protein Domains Finder ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
242
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
243 This tool provides **preliminary** output of all domains types which are not filtered for quality.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
244
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
245 #### INPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
246
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
247 * DNA sequence [multiFasta]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
248
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
249 #### OUTPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
250
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
251 * **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
252
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
253 #### USAGE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
254 usage: protein_domains.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
255 CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
256 [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
257 [-wd WIN_DOM] [-od OVERLAP_DOM]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
258
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
259 optional arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
260 -h, --help show this help message and exit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
261 -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
262 output domains gff format (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
263 -nld NEW_LDB, --new_ldb NEW_LDB
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
264 create indexed database files for lastal in case of
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
265 working with new protein db (default: False)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
266 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
267 specify if you want to change the output directory
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
268 (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
269 -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
270 percentage of the best score in the cluster to be
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
271 tolerated when assigning annotations per base
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
272 (default: 80)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
273 -wd WIN_DOM, --win_dom WIN_DOM
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
274 window to process large input sequences sequentially
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
275 (default: 10000000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
276 -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
277 overlap of sequences in two consecutive windows
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
278 (default: 10000)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
279
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
280 required named arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
281 -q QUERY, --query QUERY
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
282 input DNA sequence to search for protein domains in a
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
283 fasta format. Multifasta format allowed. (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
284 None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
285 -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
286 protein domains database file (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
287 -cs CLASSIFICATION, --classification CLASSIFICATION
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
288 protein domains classification file (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
289
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
290
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
291
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
292 #### HOW TO RUN EXAMPLE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
293 ./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
294
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
295 When running for the first time with a new database use -nld option allowing lastal to create indexed database files:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
296
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
297 -nld True
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
298
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
299 use other arguments if you wish to rename your outputs or they will be created automatically with standard names
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
300
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
301 ### Protein Domains Filter ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
302
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
303 The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
304
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
305 WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERING USING THE DEFAULT PARAMETRES (optimized for Viridiplantae species)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
306
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
307 #### INPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
308 * GFF3 file produced by protein_domains.py OR already filtered GFF3
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
309
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
310 #### Filtering options ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
311 * QUALITY:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
312 - Min relative length of alignemnt to the protein domain from DB (without gaps)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
313 - Identity
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
314 - Similarity (scoring matrix: BLOSUM80)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
315 - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
316 - Max alignment proportion to the original length of database domain sequence
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
317 * DOMAIN TYPE: 'Name' attribute in GFF - see choices bellow
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
318 Records for ambiguous domain type (e.g. INT/RH) are filtered out automatically
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
319
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
320 * MOBILE ELEMENT TYPE:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
321 arbitrary substring of the element classification ('Final_Classification' attribute in GFF)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
322
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
323 #### OUTPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
324 * filtered GFF3 file
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
325 * fasta file of translated protein sequences for the aligned domains that match the filtering criteria
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
326 ! as it is taken from the best hit alignment reported by LAST, it does not neccessary cover the whole region reported as domain in GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
327
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
328 #### USAGE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
329 usage: domains_filtering.py [-h] -dg DOM_GFF [-ouf DOMAINS_FILTERED]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
330 [-dps DOMAINS_PROT_SEQ]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
331 [-thl {float range 0.0..1.0}]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
332 [-thi {float range 0.0..1.0}]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
333 [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
334 [-mlen MAX_LEN_PROPORTION]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
335 [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
336 [-el ELEMENT_TYPE] [-dir OUTPUT_DIR]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
337
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
338
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
339
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
340 optional arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
341 -h, --help show this help message and exit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
342 -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
343 output filtered domains gff file (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
344 -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
345 output file containg domains protein sequences
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
346 (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
347 -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
348 proportion of alignment length threshold (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
349 0.8)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
350 -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
351 proportion of alignment identity threshold (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
352 0.35)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
353 -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
354 threshold for alignment proportional similarity
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
355 (default: 0.45)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
356 -ir INTERRUPTIONS, --interruptions INTERRUPTIONS
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
357 interruptions (frameshifts + stop codons) tolerance
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
358 threshold per 100 AA (default: 3)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
359 -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
360 maximal proportion of alignment length to the original
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
361 length of protein domain from database (default: 1.2)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
362 -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
363 filter output domains based on the domain type
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
364 (default: All)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
365 -el ELEMENT_TYPE, --element_type ELEMENT_TYPE
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
366 filter output domains by typing substring from
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
367 classification (default: )
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
368 -dir OUTPUT_DIR, --output_dir OUTPUT_DIR
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
369 specify if you want to change the output directory
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
370 (default: None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
371
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
372 required named arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
373 -dg DOM_GFF, --dom_gff DOM_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
374 basic unfiltered gff file of all domains (default:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
375 None)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
376
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
377
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
378
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
379 #### HOW TO RUN EXAMPLE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
380 e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
381
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
382 ./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
383
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
384
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
385 ### Extract Domains Nucleotide Sequences ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
386
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
387 This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species.
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
388
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
389 #### INPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
390
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
391 * original DNA sequence in multifasta format to extract the domains from
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
392 * GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
393 * Domains database classification table (to check the classification level)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
394
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
395 #### OUTPUTS ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
396
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
397 * fasta files of domains nucleotide sequences for individual transposons lineages
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
398 * txt file of domains counts extracted for individual lineages
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
399
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
400 **- For GALAXY usage all concatenated in a single fasta file**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
401
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
402 #### USAGE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
403 usage: extract_domains_seqs.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
404 CLASSIFICATION [-out OUT_DIR] [-ex EXTENDED]
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
405
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
406 optional arguments:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
407 -h, --help show this help message and exit
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
408 -i INPUT_DNA, --input_dna INPUT_DNA
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
409 path to input DNA sequence
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
410 -d DOMAINS_GFF, --domains_gff DOMAINS_GFF
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
411 GFF file of protein domains
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
412 -cs CLASSIFICATION, --classification CLASSIFICATION
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
413 protein domains classification file
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
414 -out OUT_DIR, --out_dir OUT_DIR
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
415 output directory
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
416 -ex EXTENDED, --extended EXTENDED
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
417 extend the domains edges if not the whole datatabase
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
418 sequence was aligned
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
419
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
420 #### HOW TO RUN EXAMPLE ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
421 ./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
422
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
423 ### GALAXY implementation ###
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
424
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
425 #### Dependencies ####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
426
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
427 * python3.4 or higher with packages:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
428 * numpy
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
429 * matplotlib
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
430 * biopython
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
431 * [BLAST 2.2.28+](https://www.ncbi.nlm.nih.gov/books/NBK279671/) or higher
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
432 * [LAST](http://last.cbrc.jp/doc/last.html) 744 or higher:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
433 * [download](http://last.cbrc.jp/)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
434 * [install](http://last.cbrc.jp/doc/last.html)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
435 * [wigToBigWig](http://hgdownload.cse.ucsc.edu/admin/exe/)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
436 * [cd-hit](http://weizhongli-lab.org/cd-hit/)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
437 * [JBrowse](http://jbrowse.org/install/) - **Only bin needed, does not have to be installed under a web server**
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
438
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
439 #### Source ######
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
440
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
441 https://nina_h@bitbucket.org/nina_h/profrep.git
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
442
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
443 branch "cerit" --> only Pisum Sativum Terno in preparad annotation datasets
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
444
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
445 branch "develop"/"master" --> extended internal database of species (not published, or for internal purposes)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
446
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
447 #### Configuration #####
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
448
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
449 Add tools
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
450
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
451 <section name="Assembly annotation" id="annotation">
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
452 <label id="profrep_prepare" text="ProfRep Data Preparation" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
453 <tool file="profrep/extract_data_for_profrep.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
454 <tool file="profrep/db_reducing.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
455 <label id="profrep_main" text="Profrep" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
456 <tool file="profrep/profrep.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
457 <label id="profrep_supplementary" text="Profrep Supplementary" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
458 <tool file="profrep/profrep_refine.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
459 <tool file="profrep/profrep_masking.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
460 <tool file="profrep/gff_select_region.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
461 <label id="domains" text="DANTE" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
462 <tool file="profrep/protein_domains.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
463 <tool file="profrep/domains_filtering.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
464 <tool file="profrep/extract_domains_seqs.xml" />
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
465 </section>
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
466
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
467
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
468 to
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
469
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
470 $__root_dir__/config/tool_conf.xml
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
471
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
472 ------------------------------------------------------------------------
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
473
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
474 Place PROFREP_DB files to
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
475
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
476 $__tool_data_path__/profrep
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
477
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
478 *REMARK* PROFREP_DB files contain prepared annotation data for species in the roll-up menu:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
479
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
480 * sequences.fasta - including BLAST database files which was created by:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
481 makeblastdb -in >sequences.fasta -dbtype nucl
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
482 * hitosort.cls file
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
483 * classification table table
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
484
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
485 Place DANTE_DB files to
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
486
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
487 $__tool_data_path__/protein_domains
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
488
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
489 *REMARK* DANTE_DB files contain protein domains database files:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
490 * protein domains database including LASTAL database files which was created by:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
491 lastdb -p -cR01 >database_name< >database_name<
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
492 (lastal database files are actually enough, original datatabse table does not have to be present)
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
493 * classification table
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
494
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
495 ------------------------------------------------------------------------
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
496
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
497 Create
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
498
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
499 $__root_dir__/database/dependencies/profrep/1.0.0/env.sh
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
500
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
501 containing:
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
502
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
503 export JBROWSE_BIN=PATH_TO_JBROWSE_DIR/bin
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
504
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
505 ------------------------------------------------------------------------
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
506
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
507 Link the following files into galaxy tool-data dir
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
508
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
509 ln -s $__tool_directory__/profrep/domains_data/select_domain.txt $__tool_data_path__
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
510 ln -s $__tool_directory__/profrep/profrep_data/prepared_datasets.txt $__tool_data_path__
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
511
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
512
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
513
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
514
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
515
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
516
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
517
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
518
a5f1638b73be Uploaded
petr-novak
parents:
diff changeset
519