Mercurial > repos > petr-novak > repeatrxplorer
comparison README.md @ 0:1d1b9e1b2e2f draft
Uploaded
author | petr-novak |
---|---|
date | Thu, 19 Dec 2019 10:24:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:1d1b9e1b2e2f |
---|---|
1 # RepeatExplorer2 with TAREAN (Tandem Repeat Analyzer) # | |
2 ------------------------------------------------------------------------------- | |
3 New version of RepeatExplorer with TAndem REpeat ANalyzer | |
4 | |
5 ## Authors | |
6 Petr Novak, Jiri Macas, Pavel Neumann | |
7 Biology Centre CAS, Czech Republic | |
8 | |
9 ## Change log | |
10 | |
11 [link](CHANGELOG.md) | |
12 | |
13 | |
14 | |
15 ## Instalation ## | |
16 To use RepeatExplorer without installation, We recommend to use our freely | |
17 available galaxy server at | |
18 [https://repeatexplorer-elixir.cerit-sc.cz](https://repeatexplorer-elixir.cerit-sc.cz). | |
19 This server is provided in frame of ELIXIR-CZ project. Additionally, the galaxy | |
20 server includs also additional tools useful data preprocessing, quality contraol | |
21 and genome annotation. | |
22 | |
23 For command line version from standalone installation, follow the instruction below: | |
24 | |
25 | |
26 To download source using git command: | |
27 | |
28 git clone https://bitbucket.org/petrnovak/repex_tarean.git | |
29 cd repex_tarean | |
30 | |
31 We recommend to install dependencies using conda (conda can be installed using [miniconda](https://docs.conda.io/en/latest/miniconda.html)). The required environment can be prepared using command: | |
32 | |
33 conda env create -f environment.yml | |
34 | |
35 activate prepared environment using: | |
36 | |
37 conda activate repeatexplorer | |
38 | |
39 In the `repex_tarean` direcory compile source and prepare databases using: | |
40 | |
41 make | |
42 | |
43 Support for 32-bit executables is required. If you are using Ubuntu distribution you can add 32-bit support by running: | |
44 | |
45 sudo dpkg --add-architecture i386 | |
46 sudo apt-get update | |
47 sudo apt-get install libc6:i386 libncurses5:i386 libstdc++6:i386 | |
48 | |
49 | |
50 to verify installation you can run clustering on example data: | |
51 | |
52 ./seqclust -p -v tmp/clustering_output test_data/LAS_paired_10k.fas | |
53 | |
54 | |
55 ## Protein databases | |
56 | |
57 Repeatexplorer2 utilize REXdb database of protein domains for repeat annotation and classification. Structure of database is described on [http://repeatexplorer.org/](http://repeatexplorer.org/). Current version of database for repeatexplorer is fetched from bitbucket repository [https://bitbucket.org/petrnovak/re_databases]https://bitbucket.org/petrnovak/re_databases() during compilation using make command | |
58 | |
59 | |
60 ## RepeatExplorer command line options | |
61 | |
62 usage: seqclust [-h] [-p] [-A] [-t] [-l LOGFILE] [-m {float range 0.0..100.0}] | |
63 [-M {0,float range 0.1..1}] [-o {float range 30.0..80.0}] | |
64 [-c CPU] [-s SAMPLE] [-P PREFIX_LENGTH] [-v OUTPUT_DIR] | |
65 [-r MAX_MEMORY] [-d DATABASE DATABASE] [-C] [-k] | |
66 [-a {2,3,4,5}] | |
67 [-tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}] | |
68 [-opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SHORT,OXFORD_NANOPORE}] | |
69 [-D {BLASTX_W2,BLASTX_W3,DIAMOND}] | |
70 sequences | |
71 | |
72 RepeatExplorer: | |
73 Repetitive sequence discovery and clasification from NGS data | |
74 | |
75 | |
76 | |
77 positional arguments: | |
78 sequences | |
79 | |
80 optional arguments: | |
81 -h, --help show this help message and exit | |
82 -p, --paired | |
83 -A, --automatic_filtering | |
84 -t, --tarean_mode analyze only tandem reapeats without additional classification | |
85 -l LOGFILE, --logfile LOGFILE | |
86 log file, logging goes to stdout if not defines | |
87 -m {float range 0.0..100.0}, --mincl {float range 0.0..100.0} | |
88 -M {0,float range 0.1..1}, --merge_threshold {0,float range 0.1..1} | |
89 threshold for mate-pair based cluster merging, default 0 - no merging | |
90 -o {float range 30.0..80.0}, --min_lcov {float range 30.0..80.0} | |
91 minimal overlap coverage - relative to longer sequence length, default 55 | |
92 -c CPU, --cpu CPU number of cpu to use, if 0 use max available | |
93 -s SAMPLE, --sample SAMPLE | |
94 use only sample of input data[by default max reads is used | |
95 -P PREFIX_LENGTH, --prefix_length PREFIX_LENGTH | |
96 If you wish to keep part of the sequences name, | |
97 enter the number of characters which should be | |
98 kept (1-10) instead of zero. Use this setting if | |
99 you are doing comparative analysis | |
100 -v OUTPUT_DIR, --output_dir OUTPUT_DIR | |
101 -r MAX_MEMORY, --max_memory MAX_MEMORY | |
102 Maximal amount of available RAM in kB if not set | |
103 clustering tries to use whole available RAM | |
104 -d DATABASE DATABASE, --database DATABASE DATABASE | |
105 fasta file with database for annotation and name of database | |
106 -C, --cleanup remove unncessary large files from working directory | |
107 -k, --keep_names keep sequence names, by default sequences are renamed | |
108 -a {2,3,4,5}, --assembly_min {2,3,4,5} | |
109 Assembly is performed on individual clusters, by default | |
110 clusters with size less then 5 are not assembled. If you | |
111 want need assembly of smaller cluster set *assmbly_min* | |
112 accordingly | |
113 -tax {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0}, --taxon {VIRIDIPLANTAE3.0,VIRIDIPLANTAE2.2,METAZOA2.0,METAZOA3.0} | |
114 Select taxon and protein database version | |
115 -opt {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SHORT,OXFORD_NANOPORE}, --options {ILLUMINA,ILLUMINA_DUST_OFF,ILLUMINA_SHORT,OXFORD_NANOPORE} | |
116 -D {BLASTX_W2,BLASTX_W3,DIAMOND}, --domain_search {BLASTX_W2,BLASTX_W3,DIAMOND} | |
117 Detection of protein domains can be performed by either blastx or | |
118 diamond" program. options are: | |
119 BLASTX_W2 - blastx with word size 2 (slowest, the most sesitive) | |
120 BLASTX_W3 - blastx with word size 3 (default) | |
121 DIAMOND - diamond program (significantly faster, less sensitive) | |
122 To use this option diamond program must be installed in your PATH | |
123 | |
124 | |
125 | |
126 ## Galaxy toolshed | |
127 TODO | |
128 | |
129 ## Reproducibility | |
130 To make clustering reproducible between runs with the | |
131 same data, environment variable PYTHONHASHSEED must be set: | |
132 | |
133 export PYTHONHASHSEED=0 | |
134 | |
135 ## Disk space requirements | |
136 Large sqlite database for temporal data is created in OS specific temp directory- usually /tmp/ | |
137 To use alternative location, it is necessary specify `TEMP` environment variable. | |
138 | |
139 ## CPU and RAM requirements | |
140 | |
141 Resources requirements can be set either from command line arguments `--max-memory` and `--cpu` or | |
142 using environment variables `TAREAN_MAX_MEM` and `TAREAN_CPU`. If not set, pipeline use all | |
143 available resources | |
144 | |
145 ## How cite | |
146 | |
147 If you use RepeatExplorer for general repeat characterization in your work please cite: | |
148 | |
149 - [Novak, P., Neumann, P., Pech, J., Steinhaisl, J., Macas, J. (2013) - RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next generation sequence read. Bioinformatics 29:792-793](http://bioinformatics.oxfordjournals.org/content/29/6/792) | |
150 | |
151 or | |
152 | |
153 - [Novak, P., Neumann, P., Macas, J. (2010) - Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11 :37](http://www.biomedcentral.com/1471-2105/11/378) | |
154 | |
155 If you use TAREAN for satellite detection and characterization please cite: | |
156 | |
157 - [Novak, P., Robledillo, L.A.,Koblizkova, A., Vrbova, I., Neumann, P., Macas, J. (2017) - TAREAN: a computational tool for identification and characterization of satellite DNA from unassembled short reads. Nucleic Acid Research](https://doi.org/10.1093/nar/gkx257) | |
158 |