Mercurial > repos > petr-novak > dante
diff README.md @ 0:77d9f2ecb28a draft
Uploaded
author | petr-novak |
---|---|
date | Wed, 03 Jul 2019 02:45:00 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Wed Jul 03 02:45:00 2019 -0400 @@ -0,0 +1,219 @@ +# Domain based annotation of transposable elements - DANTE # + +### Authors + Nina Hostakova, Petr Novak, Pavel Neumann, Jiri Macas + Biology Centre CAS, Czech Republic + + +### Introduction + +* Protein Domains Finder [dante.py] + * Script performs scanning of given DNA sequence(s) in (multi)fasta format in order to discover protein domains using our protein domains database. + * Domains searching is accomplished engaging LASTAL alignment tool. + * Domains are subsequently annotated and classified - in case certain domain has multiple annotations assigned, classifation is derived from the common classification level of all of them. + +* Proteins Domains Filter [dante_gff_output_filtering.py] + * filters GFF3 output from previous step to obtain certain kind of domain and/or allows to adjust quality filtering + +### DEPENDENCIES ### + +* python3.4 or higher with packages: + * numpy + * biopython +* [lastal](http://last.cbrc.jp/doc/last.html) 744 or higher +* ProfRep/DANTE modules: + * configuration.py + + +### Protein Domains Finder ### + +This tool provides **preliminary** output of all domains types which are not filtered for quality. + +#### INPUTS #### + +* DNA sequence [multiFasta] + +#### OUTPUTS #### + +* **All protein domains GFF3** - individual domains are reported per line as regions (start-end) on the original DNA sequence including the seq ID and strand orientation. The last "Attributes" column contains several comma-separated information related to the domain annotation, alignment and its quality. This file can undergo further filtering using Protein Domain Filter tool. + +#### USAGE #### + + usage: dante.py [-h] -q QUERY -pdb PROTEIN_DATABASE -cs + CLASSIFICATION [-oug DOMAIN_GFF] [-nld NEW_LDB] + [-dir OUTPUT_DIR] [-thsc THRESHOLD_SCORE] + [-wd WIN_DOM] [-od OVERLAP_DOM] + + optional arguments: + -h, --help show this help message and exit + -oug DOMAIN_GFF, --domain_gff DOMAIN_GFF + output domains gff format (default: None) + -nld NEW_LDB, --new_ldb NEW_LDB + create indexed database files for lastal in case of + working with new protein db (default: False) + -dir OUTPUT_DIR, --output_dir OUTPUT_DIR + specify if you want to change the output directory + (default: None) + -thsc THRESHOLD_SCORE, --threshold_score THRESHOLD_SCORE + percentage of the best score in the cluster to be + tolerated when assigning annotations per base + (default: 80) + -wd WIN_DOM, --win_dom WIN_DOM + window to process large input sequences sequentially + (default: 10000000) + -od OVERLAP_DOM, --overlap_dom OVERLAP_DOM + overlap of sequences in two consecutive windows + (default: 10000) + + required named arguments: + -q QUERY, --query QUERY + input DNA sequence to search for protein domains in a + fasta format. Multifasta format allowed. (default: + None) + -pdb PROTEIN_DATABASE, --protein_database PROTEIN_DATABASE + protein domains database file (default: None) + -cs CLASSIFICATION, --classification CLASSIFICATION + protein domains classification file (default: None) + + + +#### HOW TO RUN EXAMPLE #### + ./protein_domains.py -q PATH_TO_INPUT_SEQ -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE + + When running for the first time with a new database use -nld option allowing lastal to create indexed database files: + + -nld True + + use other arguments if you wish to rename your outputs or they will be created automatically with standard names + +### Protein Domains Filter ### + +The script performs Protein Domains Finder output filtering for quality and/or extracting specific type of protein domain or mobile elements of origin. For the filtered domains it reports their translated protein sequence of original DNA. + +WHEN NO PARAMETERS GIVEN, IT PERFORMS QUALITY FILTERING USING THE DEFAULT PARAMETRES (optimized for Viridiplantae species) + +#### INPUTS #### +* GFF3 file produced by protein_domains.py OR already filtered GFF3 + +#### Filtering options #### +* QUALITY: + - Min relative length of alignemnt to the protein domain from DB (without gaps) + - Identity + - Similarity (scoring matrix: BLOSUM80) + - Interruption in the reading frame (frameshifts + stop codons) per every starting 100 AA + - Max alignment proportion to the original length of database domain sequence +* DOMAIN TYPE: 'Name' attribute in GFF - see choices bellow +Records for ambiguous domain type (e.g. INT/RH) are filtered out automatically + +* MOBILE ELEMENT TYPE: +arbitrary substring of the element classification ('Final_Classification' attribute in GFF) + +#### OUTPUTS #### +* filtered GFF3 file +* fasta file of translated protein sequences for the aligned domains that match the filtering criteria + ! as it is taken from the best hit alignment reported by LAST, it does not neccessary cover the whole region reported as domain in GFF + +#### USAGE #### + + usage: dante_gff_output_filtering.py [-h] -dg DOM_GFF [-ouf DOMAINS_FILTERED] + [-dps DOMAINS_PROT_SEQ] + [-thl {float range 0.0..1.0}] + [-thi {float range 0.0..1.0}] + [-ths {float range 0.0..1.0}] [-ir INTERRUPTIONS] + [-mlen MAX_LEN_PROPORTION] + [-sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}] + [-el ELEMENT_TYPE] [-dir OUTPUT_DIR] + + + + optional arguments: + -h, --help show this help message and exit + -ouf DOMAINS_FILTERED, --domains_filtered DOMAINS_FILTERED + output filtered domains gff file (default: None) + -dps DOMAINS_PROT_SEQ, --domains_prot_seq DOMAINS_PROT_SEQ + output file containg domains protein sequences + (default: None) + -thl {float range 0.0..1.0}, --th_length {float range 0.0..1.0} + proportion of alignment length threshold (default: + 0.8) + -thi {float range 0.0..1.0}, --th_identity {float range 0.0..1.0} + proportion of alignment identity threshold (default: + 0.35) + -ths {float range 0.0..1.0}, --th_similarity {float range 0.0..1.0} + threshold for alignment proportional similarity + (default: 0.45) + -ir INTERRUPTIONS, --interruptions INTERRUPTIONS + interruptions (frameshifts + stop codons) tolerance + threshold per 100 AA (default: 3) + -mlen MAX_LEN_PROPORTION, --max_len_proportion MAX_LEN_PROPORTION + maximal proportion of alignment length to the original + length of protein domain from database (default: 1.2) + -sd {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO}, --selected_dom {All,GAG,INT,PROT,RH,RT,aRH,CHDCR,CHDII,TPase,YR,HEL1,HEL2,ENDO} + filter output domains based on the domain type + (default: All) + -el ELEMENT_TYPE, --element_type ELEMENT_TYPE + filter output domains by typing substring from + classification (default: ) + -dir OUTPUT_DIR, --output_dir OUTPUT_DIR + specify if you want to change the output directory + (default: None) + + required named arguments: + -dg DOM_GFF, --dom_gff DOM_GFF + basic unfiltered gff file of all domains (default: + None) + + + +#### HOW TO RUN EXAMPLE #### +e.g. getting quality filtered integrase(INT) domains of all gypsy transposable elements: + + ./domains_filtering.py -dom_gff PATH_TO_INPUT_GFF -pdb PATH_TO_PROTEIN_DB -cs PATH_TO_CLASSIFICATION_FILE --selected_dom INT --element_type Ty3/gypsy + + +### Extract Domains Nucleotide Sequences ### + +This tool extracts nucleotide sequences of protein domains from reference DNA based on DANTE's output. It can be used e.g. for deriving phylogenetic relations of individual mobile elements classes within a species. + +#### INPUTS #### + +* original DNA sequence in multifasta format to extract the domains from +* GFF3 file of protein domains (**DANTE's output** - preferably filtered for quality and specific domain type) +* Domains database classification table (to check the classification level) + +#### OUTPUTS #### + +* fasta files of domains nucleotide sequences for individual transposons lineages +* txt file of domains counts extracted for individual lineages + +**- For GALAXY usage all concatenated in a single fasta file** + +#### USAGE #### + usage: dante_gff_to_dna.py [-h] -i INPUT_DNA -d DOMAINS_GFF -cs + CLASSIFICATION [-out OUT_DIR] [-ex EXTENDED] + + optional arguments: + -h, --help show this help message and exit + -i INPUT_DNA, --input_dna INPUT_DNA + path to input DNA sequence + -d DOMAINS_GFF, --domains_gff DOMAINS_GFF + GFF file of protein domains + -cs CLASSIFICATION, --classification CLASSIFICATION + protein domains classification file + -out OUT_DIR, --out_dir OUT_DIR + output directory + -ex EXTENDED, --extended EXTENDED + extend the domains edges if not the whole datatabase + sequence was aligned + +#### HOW TO RUN EXAMPLE #### + ./extract_domains_seqs.py --domains_gff PATH_PROTEIN_DOMAINS_GFF --input_dna PATH_TO_INPUT_DNA --classification PROTEIN_DOMAINS_DB_CLASS_TBL --extended True + + + + + + + + +