# HG changeset patch
# User petr-novak
# Date 1595589967 14400
# Node ID 422485508110d0252ea7b45c960ef0ee49fc1586
# Parent 43c4250c67611f8d34d541309f9ba5dd6a0f489f
Uploaded
diff -r 43c4250c6761 -r 422485508110 repex_full_clustering.xml
--- a/repex_full_clustering.xml Thu Apr 30 07:42:45 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,307 +0,0 @@
-
-
-
-
-
-
-
-
- Improved version or repeat discovery and characterization using graph-based sequence clustering
-
- last
- imagemagick
- mafft
- blast
- diamond
- blast-legacy
- r-igraph
- r-data.tree
- r-stringr
- r-r2html
- r-hwriter
- r-dt
- r-scales
- r-plotrix
- r-png
- r-plyr
- r-dplyr
- r-optparse
- r-dbi
- r-rsqlite
- r-rserve
- bioconductor-biostrings
- repex_tarean_testing
- REPEX
- REPEX_VERSION
- pyrserve
-
-
- export PYTHONHASHSEED=0;
- \${REPEX}/seqclust --sample ${read_sampling.sample} --output_dir=tarean_output --logfile=${log} --cleanup $paired --taxon $taxon
-
- #if $advanced_options.advanced:
- --mincl $advanced_options.size_threshold $advanced_options.keep_names $advanced_options.automatic_filtering -D $advanced_options.blastx.options_blastx
- --assembly_min $advanced_options.assembly_min_cluster_size
-
- #if $advanced_options.comparative.options_comparative:
- --prefix_length $advanced_options.comparative.prefix_length
- #end if
-
- #if $advanced_options.custom_library.options_custom_library:
- -d $advanced_options.custom_library.library extra_database
- #end if
-
- #if $advanced_options.options.options:
- -opt $advanced_options.options.options
- #end if
- #end if
- ${FastaFile} >stdout.log 2> stderr.log ;
- echo "STDOUT CONTENT:" >> ${log} ;
- cat stdout.log >> ${log} ;
- echo "STDERR CONTENT:" >> ${log};
- cat stderr.log >> ${log} &&
- \${REPEX}/stderr_filter.py stderr.log &&
- cd tarean_output &&
- zip -r ${ReportArchive}.zip * &&
- mv ${ReportArchive}.zip ${ReportArchive} &&
- cp index.html ${ReportFile} &&
- mkdir ${ReportFile.files_path} &&
- cp -r --parents libdir ${ReportFile.files_path} &&
- cp -r --parents seqclust/clustering/superclusters ${ReportFile.files_path} &&
- cp -r --parents seqclust/clustering/clusters ${ReportFile.files_path} &&
- cp seqclust/clustering/hitsort.cls ${ReportFile.files_path}/seqclust/clustering/hitsort.cls &&
- cp *.png ${ReportFile.files_path}/ &&
- cp *.csv ${ReportFile.files_path}/ &&
- cp *.html ${ReportFile.files_path}/ &&
- cp *.css ${ReportFile.files_path}/ &&
- cp *.fasta ${ReportFile.files_path}/ 2>>$log && rm -r ../tarean_output || :
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- **HELP**
-
- RepeatExplorer2 clustering is a computational pipeline for unsupervised
- identification of repeats from unassembled sequence reads. The
- pipeline uses low-pass whole genome sequence reads and performs graph-based
- clustering. Resulting clusters, representing all types of repeats, are then
- examined to identify and classify into repeats groups.
-
- **Input data**
-
- The analysis requires either **single** or **paired-end reads** generated
- by whole genome shotgun sequencing provided as a single fasta-formatted file.
- Generally, paired-end reads provide significantly better results than single
- reads. Reads should be of uniform length (optimal size range is 100-200 nt) and
- the number of analyzed reads should represent less than 1x genome equivalent
- (genome coverage of 0.01 - 0.50 x is recommended). Reads should be
- quality-filtered (recommended filtering : quality score >=10 over 95% of bases
- and no Ns allowed) and only **complete read pairs** should be submitted for
- analysis. When paired reads are used, input data must be **interlaced** format
- as fasta file:
-
- example of interlaced input format::
-
- >0001_f
- CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
- >0001_r
- GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
- >0002_f
- ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
- >0002_r
- TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
- >0003_f
- TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
- >0003_r
- TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
- ...
-
-
- **Comparative analysis**
-
- For comparative analysis sequence names must contain code (prefix) for each group.
- Prefix in sequences names must be of fixed length.
-
- Example of labeling two groups with where **group code length** is 2 and is used to distinguish groups - AA and BB ::
-
- >AA0001_f
- CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
- >AA0001_r
- GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
- >AA0002_f
- ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
- >AA0002_r
- TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
- >BB0001_f
- TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
- >BB0001_r
- TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
- >BB0002_f
- TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
- >BB0002_r
- TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
-
-
- To prepare quality filtered and interlaced input fasta file from fastq
- files, use `Preprocessing of paired-reads`__ tool.
-
- .. __: tool_runner?tool_id=paired_fastq_filtering
-
-
- **Additional parameters**
-
- **Sample size** defines how many reads should be used in calculation.
- Default setting with 500,000 reads will enable detection of high copy
- repeats within several hours of computation time. For higher
- sensitivity the sample size can be set higher. Since sample size affects
- the memory usage, this parameter may be automatically adjusted to lower
- value during the run. Maximum sample size which can be processed depends on
- the repetitiveness of analyzed genome.
-
-
- **Select taxon and protein domain database version (REXdb)**. Classification
- of transposable elements is based on the similarity to our reference database
- of transposable element protein domains (**REXdb**). Standalone database for Viridiplantae species
- can be obtained on `repeatexplorer.org`__. Classification
- system used in REXdb is described in article `Systematic survey of plant
- LTR-retrotransposons elucidates phylogenetic relationships of their
- polyprotein domains and provides a reference for element classification`__
- Database for Metazoa species is still under development so use it with caution.
-
- .. __: http://repeatexplorer.org
- .. __: https://doi.org/10.1186/s13100-018-0144-1
-
- **Select parameters for protein domain search** REXdb is compared with s
- equence clusters either using blastx or diamond aligner. Diamond program
- is about three time faster than blastx with word size 3.
-
- **Similarity search options** By default sequence reads are compared using
- mgblast program. Default threshold is explicitly set to 90% sequence
- similarity spanning at least 55% of the read length (in the case of reads
- differing in length it applies to the longer one). Additionally, sequence
- overlap must be at least 55 nt. If you select option for shorter reads
- than 100 nt, minimum overlap 55 nt is not required.
-
- By default,
- mgblast search use DUST program to filter out
- low-complexity sequences. If you want
- to increase sensitivity of detection of satellites with shorter monomer
- use option with '*no masking of low complexity repeats*'. Note that omitting
- DUST filtering will significantly increase running times
-
-
- **Automatic filtering of abundant satellite repeats** perform clustering on
- smaller dataset of sequence reads to detect abundant high confidence
- satellite repeats. If such satellites are detected, sequence reads derived
- from these satellites are depleted from input dataset. This step enable more
- sensitive detection of less abundant repeats as more reads can be used
- in clustering step.
-
- **Use custom repeat database**. This option allows users to perform similarity
- comparison of identified repeats to their custom databases. The repeat class must
- be encoded in FASTA headers of database entries in order to allow correct
- parsing of similarity hits. Required format for custom database sequence name is: ::
-
- >reapeatname#class/subclass
-
-
- **Output**
-
- List of clusters identified as putative satellite repeats, their genomic
- abundance and various cluster characteristics.
-
- Output includes a **HTML summary** with table listing of all analyzed
- clusters. More detailed information about clusters is provided in
- additional files and directories. All results are also provided as
- downloadable **zip archive**. Additionally a **log file** reporting
- the progress of the computational pipeline is provided.
-
-
-
-
diff -r 43c4250c6761 -r 422485508110 repex_tarean.xml
--- a/repex_tarean.xml Thu Apr 30 07:42:45 2020 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,251 +0,0 @@
-
-
-
-
-
-
-
- Identification of genomic tandem repeats from NGS data
-
- imagemagick
- mafft
- blast
- diamond
- blast-legacy
- r-igraph
- r-data.tree
- r-stringr
- r-r2html
- r-hwriter
- r-dt
- r-scales
- r-plotrix
- r-png
- r-plyr
- r-dplyr
- r-optparse
- r-dbi
- r-rsqlite
- r-rserve
- bioconductor-biostrings
- repex_tarean_testing
- REPEX
- REPEX_VERSION
- pyrserve
-
-
- export PYTHONHASHSEED=0;
- \${REPEX}/seqclust --paired --sample ${read_sampling.sample} --output_dir=tarean_output --logfile=${log} --cleanup --tarean_mode
- #if $advanced_options.advanced:
- --mincl $advanced_options.size_threshold $advanced_options.keep_names $advanced_options.automatic_filtering -M $advanced_options.merging
- #if $advanced_options.custom_library.options_custom_library :
- -d $advanced_options.custom_library.library extra_database
- #end if
- #if $advanced_options.options.options:
- -opt $advanced_options.options.options
- #end if
- #else:
- -M 0.2
-
- #end if
- ${FastaFile} >stdout.log 2> stderr.log ;
- echo "STDOUT CONTENT:" >> ${log} ;
- cat stdout.log >> ${log} ;
- echo "STDERR CONTENT:" >> ${log} ;
- cat stderr.log >> ${log} &&
- \${REPEX}/stderr_filter.py stderr.log &&
- cd tarean_output &&
- zip -r ${ReportArchive}.zip * &&
- mv ${ReportArchive}.zip ${ReportArchive} &&
- cp index.html ${ReportFile} &&
- mkdir ${ReportFile.files_path} &&
- cp -r --parents libdir ${ReportFile.files_path} &&
- cp -r --parents seqclust/clustering/superclusters ${ReportFile.files_path} &&
- cp -r --parents seqclust/clustering/clusters ${ReportFile.files_path} &&
- cp seqclust/clustering/hitsort.cls ${ReportFile.files_path}/seqclust/clustering/hitsort.cls &&
- cp *.png ${ReportFile.files_path}/ &&
- cp *.csv ${ReportFile.files_path}/ &&
- cp *.html ${ReportFile.files_path}/ &&
- cp *.css ${ReportFile.files_path}/ &&
- cp *.fasta ${ReportFile.files_path}/ 2>>$log && rm -r ../tarean_output || :
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- **HELP**
-
- TAREAN - TAndem REpeat ANalyzer is a computational pipeline for
- **unsupervised identification of satellite repeats** from unassembled
- sequence reads. The pipeline uses low-pass paired-end whole genome
- sequence reads and performs graph-based clustering. The resulting
- clusters, representing all types of repeats present in the genome, are
- then examined to identify those containing circular structures indicative
- of tandem repeats. A poster summarizing TAREAN principles and
- implementation can be found `here.`__
-
-
- .. __: http://w3lamc.umbr.cas.cz/lamc/?page_id=312
-
- **Input data**
-
-
- The analysis requires **paired-end reads** generated by whole genome
- shotgun sequencing. The data should be provided as a single input file in
- fasta format with the reads interlaced (see example below). All the pairs
- must be complete, i.e. both "forward" and "reverse" sequence reads must be
- present. The reads should all be trimmed to the same length. The optimal
- size range is between 100 and 200 nucleotides. The number of reads to be
- analyzed should not exceed 1x coverage of the genome. Genome coverage
- between 0.01 and 0.5x is recommended. The reads should be filtered for
- quality. The recommended quality filtering is as follows: each read should
- have a quality score >=10 for 95% of the bases, i.e. if your reads are 100
- base pairs long, then a read only passes this quality threshold if 95
- bases have a quality of 10 or higher. Additionally, any reads containing
- indeterminate base pairs (indicated as N in the reads) should be removed.
- Finally, if either one of the reads in a pair fails to meet the
- aforementioned thresholds, **both** sequences should be removed.
- example of interlaced input format::
-
- >0001_f
- CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
- >0001_r
- GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
- >0002_f
- ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
- >0002_r
- TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
- >0003_f
- TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
- >0003_r
- TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
- ...
-
-
- To perform the quality filtering on your fastQ formatted data as described
- above, and to interlace your paired-end sequence reads,
- please use the `Preprocessing of paired-reads`__ tool.
-
- .. __: tool_runner?tool_id=paired_fastq_filtering
-
-
- **Additional parameters**
-
- **Sample size** defines how many reads will be used during the computation.
- The default setting of 500,000 reads will enable detection of high copy
- number satellites within several hours. For higher
- sensitivity the sample size can be increased. Since the sample size affects
- memory usage, this parameter may be automatically adjusted to a lower value
- during the run. The maximum sample size which can be processed depends on the
- repetitiveness of the analyzed genome. This significantly limits the number of reads
- that can be analyzed with the TAREAN pipeline.
-
- **Perform cluster merging**. Families of repetitive elements are
- frequently split into multiple clusters rather than being represented as a
- single one. If you do not want to merge clusters based on the presence
- of broken read pairs, disable this option.
-
- **Use custom repeat database**. This option allows users to perform similarity
- comparison of identified repeats to their custom databases. The repeat class should
- be encoded in FASTA headers of database entries in order to allow correct
- parsing of similarity hits.
-
- **Similarity search options** By default sequence reads are compared using
- mgblast program. Default threshold is explicitly set to 90% sequence
- similarity spanning at least 55% of the read length (in the case of reads
- differing in length it applies to the longer one). Additionally, sequence
- overlap must be at least 55 nt. If you select option for shorter reads
- than 100 nt, minimum overlap 55 nt is not required.
-
- By default,
- mgblast search use DUST program to filter out
- low-complexity sequences. If you want
- to increase sensitivity of detection of satellites with shorter monomer
- use option with '*no masking of low complexity repeats*'. Note that omitting
- DUST filtering will significantly increase running times
-
- **Output**
-
- A list of clusters identified as putative satellite repeats, their genomic
- abundance and various cluster characteristics are provided. Length and
- consensus sequences of reconstructed monomers are also shown and
- accompanied by a detailed output from kmer-based reconstruction including
- sequences and sequence logos of alternative variants of monomer sequences.
-
- The output includes an **HTML summary** with a table listing all analyzed
- clusters. More detailed information about clusters is provided in
- additional files and directories. All results are also provided as a
- downloadable **zip archive**. Since read clustering results in
- thousands of clusters, the search for satellite repeats is limited to
- a subset of the largest ones corresponding to the most abundant genomic
- repeats. The default setting of the pipeline is to analyze all clusters containing at least
- 0.01% of the input reads. Besides the satellite repeats, three other
- groups of clusters are reported in the output (1) LTR-retrotransposons,
- (2) 45S and 5S rDNA and (3) all remaining clusters passing the size
- threshold. As (1) and (2) contain sequences with circular
- graphs, their consensus is calculated in the same way as for satellite
- repeats. Additionally a **log file** reporting the progress of the
- computational pipeline is provided.
-
-
-
-
-
diff -r 43c4250c6761 -r 422485508110 tool_dependencies.xml
--- a/tool_dependencies.xml Thu Apr 30 07:42:45 2020 -0400
+++ b/tool_dependencies.xml Fri Jul 24 07:26:07 2020 -0400
@@ -1,9 +1,28 @@
-
+
-
-
-
- prepare repex database and scripts
+
+
+
+ https://bitbucket.org/petrnovak/repex_tarean/get/7fa000f91080.zip
+
+ make
+
+
+ $TMP_WORK_DIR/petrnovak-repex_tarean-7fa000f91080
+ $INSTALL_DIR
+
+
+
+ $INSTALL_DIR
+
+
+ "version: 0.3.8-458(7fa000f) branch: almitey"
+
+
+
+
+ repeatexplorer executables and databases
-
-
\ No newline at end of file
+
+
+