# HG changeset patch
# User petr-novak
# Date 1588246965 14400
# Node ID 43c4250c67611f8d34d541309f9ba5dd6a0f489f
Uploaded
diff -r 000000000000 -r 43c4250c6761 repex_full_clustering.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/repex_full_clustering.xml Thu Apr 30 07:42:45 2020 -0400
@@ -0,0 +1,307 @@
+
+
+
+
+
+
+
+
+ Improved version or repeat discovery and characterization using graph-based sequence clustering
+
+ last
+ imagemagick
+ mafft
+ blast
+ diamond
+ blast-legacy
+ r-igraph
+ r-data.tree
+ r-stringr
+ r-r2html
+ r-hwriter
+ r-dt
+ r-scales
+ r-plotrix
+ r-png
+ r-plyr
+ r-dplyr
+ r-optparse
+ r-dbi
+ r-rsqlite
+ r-rserve
+ bioconductor-biostrings
+ repex_tarean_testing
+ REPEX
+ REPEX_VERSION
+ pyrserve
+
+
+ export PYTHONHASHSEED=0;
+ \${REPEX}/seqclust --sample ${read_sampling.sample} --output_dir=tarean_output --logfile=${log} --cleanup $paired --taxon $taxon
+
+ #if $advanced_options.advanced:
+ --mincl $advanced_options.size_threshold $advanced_options.keep_names $advanced_options.automatic_filtering -D $advanced_options.blastx.options_blastx
+ --assembly_min $advanced_options.assembly_min_cluster_size
+
+ #if $advanced_options.comparative.options_comparative:
+ --prefix_length $advanced_options.comparative.prefix_length
+ #end if
+
+ #if $advanced_options.custom_library.options_custom_library:
+ -d $advanced_options.custom_library.library extra_database
+ #end if
+
+ #if $advanced_options.options.options:
+ -opt $advanced_options.options.options
+ #end if
+ #end if
+ ${FastaFile} >stdout.log 2> stderr.log ;
+ echo "STDOUT CONTENT:" >> ${log} ;
+ cat stdout.log >> ${log} ;
+ echo "STDERR CONTENT:" >> ${log};
+ cat stderr.log >> ${log} &&
+ \${REPEX}/stderr_filter.py stderr.log &&
+ cd tarean_output &&
+ zip -r ${ReportArchive}.zip * &&
+ mv ${ReportArchive}.zip ${ReportArchive} &&
+ cp index.html ${ReportFile} &&
+ mkdir ${ReportFile.files_path} &&
+ cp -r --parents libdir ${ReportFile.files_path} &&
+ cp -r --parents seqclust/clustering/superclusters ${ReportFile.files_path} &&
+ cp -r --parents seqclust/clustering/clusters ${ReportFile.files_path} &&
+ cp seqclust/clustering/hitsort.cls ${ReportFile.files_path}/seqclust/clustering/hitsort.cls &&
+ cp *.png ${ReportFile.files_path}/ &&
+ cp *.csv ${ReportFile.files_path}/ &&
+ cp *.html ${ReportFile.files_path}/ &&
+ cp *.css ${ReportFile.files_path}/ &&
+ cp *.fasta ${ReportFile.files_path}/ 2>>$log && rm -r ../tarean_output || :
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ **HELP**
+
+ RepeatExplorer2 clustering is a computational pipeline for unsupervised
+ identification of repeats from unassembled sequence reads. The
+ pipeline uses low-pass whole genome sequence reads and performs graph-based
+ clustering. Resulting clusters, representing all types of repeats, are then
+ examined to identify and classify into repeats groups.
+
+ **Input data**
+
+ The analysis requires either **single** or **paired-end reads** generated
+ by whole genome shotgun sequencing provided as a single fasta-formatted file.
+ Generally, paired-end reads provide significantly better results than single
+ reads. Reads should be of uniform length (optimal size range is 100-200 nt) and
+ the number of analyzed reads should represent less than 1x genome equivalent
+ (genome coverage of 0.01 - 0.50 x is recommended). Reads should be
+ quality-filtered (recommended filtering : quality score >=10 over 95% of bases
+ and no Ns allowed) and only **complete read pairs** should be submitted for
+ analysis. When paired reads are used, input data must be **interlaced** format
+ as fasta file:
+
+ example of interlaced input format::
+
+ >0001_f
+ CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
+ >0001_r
+ GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
+ >0002_f
+ ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
+ >0002_r
+ TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
+ >0003_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >0003_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+ ...
+
+
+ **Comparative analysis**
+
+ For comparative analysis sequence names must contain code (prefix) for each group.
+ Prefix in sequences names must be of fixed length.
+
+ Example of labeling two groups with where **group code length** is 2 and is used to distinguish groups - AA and BB ::
+
+ >AA0001_f
+ CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
+ >AA0001_r
+ GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
+ >AA0002_f
+ ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
+ >AA0002_r
+ TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
+ >BB0001_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >BB0001_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+ >BB0002_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >BB0002_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+
+
+ To prepare quality filtered and interlaced input fasta file from fastq
+ files, use `Preprocessing of paired-reads`__ tool.
+
+ .. __: tool_runner?tool_id=paired_fastq_filtering
+
+
+ **Additional parameters**
+
+ **Sample size** defines how many reads should be used in calculation.
+ Default setting with 500,000 reads will enable detection of high copy
+ repeats within several hours of computation time. For higher
+ sensitivity the sample size can be set higher. Since sample size affects
+ the memory usage, this parameter may be automatically adjusted to lower
+ value during the run. Maximum sample size which can be processed depends on
+ the repetitiveness of analyzed genome.
+
+
+ **Select taxon and protein domain database version (REXdb)**. Classification
+ of transposable elements is based on the similarity to our reference database
+ of transposable element protein domains (**REXdb**). Standalone database for Viridiplantae species
+ can be obtained on `repeatexplorer.org`__. Classification
+ system used in REXdb is described in article `Systematic survey of plant
+ LTR-retrotransposons elucidates phylogenetic relationships of their
+ polyprotein domains and provides a reference for element classification`__
+ Database for Metazoa species is still under development so use it with caution.
+
+ .. __: http://repeatexplorer.org
+ .. __: https://doi.org/10.1186/s13100-018-0144-1
+
+ **Select parameters for protein domain search** REXdb is compared with s
+ equence clusters either using blastx or diamond aligner. Diamond program
+ is about three time faster than blastx with word size 3.
+
+ **Similarity search options** By default sequence reads are compared using
+ mgblast program. Default threshold is explicitly set to 90% sequence
+ similarity spanning at least 55% of the read length (in the case of reads
+ differing in length it applies to the longer one). Additionally, sequence
+ overlap must be at least 55 nt. If you select option for shorter reads
+ than 100 nt, minimum overlap 55 nt is not required.
+
+ By default,
+ mgblast search use DUST program to filter out
+ low-complexity sequences. If you want
+ to increase sensitivity of detection of satellites with shorter monomer
+ use option with '*no masking of low complexity repeats*'. Note that omitting
+ DUST filtering will significantly increase running times
+
+
+ **Automatic filtering of abundant satellite repeats** perform clustering on
+ smaller dataset of sequence reads to detect abundant high confidence
+ satellite repeats. If such satellites are detected, sequence reads derived
+ from these satellites are depleted from input dataset. This step enable more
+ sensitive detection of less abundant repeats as more reads can be used
+ in clustering step.
+
+ **Use custom repeat database**. This option allows users to perform similarity
+ comparison of identified repeats to their custom databases. The repeat class must
+ be encoded in FASTA headers of database entries in order to allow correct
+ parsing of similarity hits. Required format for custom database sequence name is: ::
+
+ >reapeatname#class/subclass
+
+
+ **Output**
+
+ List of clusters identified as putative satellite repeats, their genomic
+ abundance and various cluster characteristics.
+
+ Output includes a **HTML summary** with table listing of all analyzed
+ clusters. More detailed information about clusters is provided in
+ additional files and directories. All results are also provided as
+ downloadable **zip archive**. Additionally a **log file** reporting
+ the progress of the computational pipeline is provided.
+
+
+
+
diff -r 000000000000 -r 43c4250c6761 repex_tarean.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/repex_tarean.xml Thu Apr 30 07:42:45 2020 -0400
@@ -0,0 +1,251 @@
+
+
+
+
+
+
+
+ Identification of genomic tandem repeats from NGS data
+
+ imagemagick
+ mafft
+ blast
+ diamond
+ blast-legacy
+ r-igraph
+ r-data.tree
+ r-stringr
+ r-r2html
+ r-hwriter
+ r-dt
+ r-scales
+ r-plotrix
+ r-png
+ r-plyr
+ r-dplyr
+ r-optparse
+ r-dbi
+ r-rsqlite
+ r-rserve
+ bioconductor-biostrings
+ repex_tarean_testing
+ REPEX
+ REPEX_VERSION
+ pyrserve
+
+
+ export PYTHONHASHSEED=0;
+ \${REPEX}/seqclust --paired --sample ${read_sampling.sample} --output_dir=tarean_output --logfile=${log} --cleanup --tarean_mode
+ #if $advanced_options.advanced:
+ --mincl $advanced_options.size_threshold $advanced_options.keep_names $advanced_options.automatic_filtering -M $advanced_options.merging
+ #if $advanced_options.custom_library.options_custom_library :
+ -d $advanced_options.custom_library.library extra_database
+ #end if
+ #if $advanced_options.options.options:
+ -opt $advanced_options.options.options
+ #end if
+ #else:
+ -M 0.2
+
+ #end if
+ ${FastaFile} >stdout.log 2> stderr.log ;
+ echo "STDOUT CONTENT:" >> ${log} ;
+ cat stdout.log >> ${log} ;
+ echo "STDERR CONTENT:" >> ${log} ;
+ cat stderr.log >> ${log} &&
+ \${REPEX}/stderr_filter.py stderr.log &&
+ cd tarean_output &&
+ zip -r ${ReportArchive}.zip * &&
+ mv ${ReportArchive}.zip ${ReportArchive} &&
+ cp index.html ${ReportFile} &&
+ mkdir ${ReportFile.files_path} &&
+ cp -r --parents libdir ${ReportFile.files_path} &&
+ cp -r --parents seqclust/clustering/superclusters ${ReportFile.files_path} &&
+ cp -r --parents seqclust/clustering/clusters ${ReportFile.files_path} &&
+ cp seqclust/clustering/hitsort.cls ${ReportFile.files_path}/seqclust/clustering/hitsort.cls &&
+ cp *.png ${ReportFile.files_path}/ &&
+ cp *.csv ${ReportFile.files_path}/ &&
+ cp *.html ${ReportFile.files_path}/ &&
+ cp *.css ${ReportFile.files_path}/ &&
+ cp *.fasta ${ReportFile.files_path}/ 2>>$log && rm -r ../tarean_output || :
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ **HELP**
+
+ TAREAN - TAndem REpeat ANalyzer is a computational pipeline for
+ **unsupervised identification of satellite repeats** from unassembled
+ sequence reads. The pipeline uses low-pass paired-end whole genome
+ sequence reads and performs graph-based clustering. The resulting
+ clusters, representing all types of repeats present in the genome, are
+ then examined to identify those containing circular structures indicative
+ of tandem repeats. A poster summarizing TAREAN principles and
+ implementation can be found `here.`__
+
+
+ .. __: http://w3lamc.umbr.cas.cz/lamc/?page_id=312
+
+ **Input data**
+
+
+ The analysis requires **paired-end reads** generated by whole genome
+ shotgun sequencing. The data should be provided as a single input file in
+ fasta format with the reads interlaced (see example below). All the pairs
+ must be complete, i.e. both "forward" and "reverse" sequence reads must be
+ present. The reads should all be trimmed to the same length. The optimal
+ size range is between 100 and 200 nucleotides. The number of reads to be
+ analyzed should not exceed 1x coverage of the genome. Genome coverage
+ between 0.01 and 0.5x is recommended. The reads should be filtered for
+ quality. The recommended quality filtering is as follows: each read should
+ have a quality score >=10 for 95% of the bases, i.e. if your reads are 100
+ base pairs long, then a read only passes this quality threshold if 95
+ bases have a quality of 10 or higher. Additionally, any reads containing
+ indeterminate base pairs (indicated as N in the reads) should be removed.
+ Finally, if either one of the reads in a pair fails to meet the
+ aforementioned thresholds, **both** sequences should be removed.
+ example of interlaced input format::
+
+ >0001_f
+ CGTAATATACATACTTGCTAGCTAGTTGGATGCATCCAACTTGCAAGCTAGTTTGATG
+ >0001_r
+ GATTTGACGGACACACTAACTAGCTAGTTGCATCTAAGCGGGCACACTAACTAACTAT
+ >0002_f
+ ACTCATTTGGACTTAACTTTGATAATAAAAACTTAAAAAGGTTTCTGCACATGAATCG
+ >0002_r
+ TATGTTGAAAAATTGAATTTCGGGACGAAACAGCGTCTATCGTCACGACATAGTGCTC
+ >0003_f
+ TGACATTTGTGAACGTTAATGTTCAACAAATCTTTCCAATGTCTTTTTATCTTATCAT
+ >0003_r
+ TATTGAAATACTGGACACAAATTGGAAATGAAACCTTGTGAGTTATTCAATTTATGTT
+ ...
+
+
+ To perform the quality filtering on your fastQ formatted data as described
+ above, and to interlace your paired-end sequence reads,
+ please use the `Preprocessing of paired-reads`__ tool.
+
+ .. __: tool_runner?tool_id=paired_fastq_filtering
+
+
+ **Additional parameters**
+
+ **Sample size** defines how many reads will be used during the computation.
+ The default setting of 500,000 reads will enable detection of high copy
+ number satellites within several hours. For higher
+ sensitivity the sample size can be increased. Since the sample size affects
+ memory usage, this parameter may be automatically adjusted to a lower value
+ during the run. The maximum sample size which can be processed depends on the
+ repetitiveness of the analyzed genome. This significantly limits the number of reads
+ that can be analyzed with the TAREAN pipeline.
+
+ **Perform cluster merging**. Families of repetitive elements are
+ frequently split into multiple clusters rather than being represented as a
+ single one. If you do not want to merge clusters based on the presence
+ of broken read pairs, disable this option.
+
+ **Use custom repeat database**. This option allows users to perform similarity
+ comparison of identified repeats to their custom databases. The repeat class should
+ be encoded in FASTA headers of database entries in order to allow correct
+ parsing of similarity hits.
+
+ **Similarity search options** By default sequence reads are compared using
+ mgblast program. Default threshold is explicitly set to 90% sequence
+ similarity spanning at least 55% of the read length (in the case of reads
+ differing in length it applies to the longer one). Additionally, sequence
+ overlap must be at least 55 nt. If you select option for shorter reads
+ than 100 nt, minimum overlap 55 nt is not required.
+
+ By default,
+ mgblast search use DUST program to filter out
+ low-complexity sequences. If you want
+ to increase sensitivity of detection of satellites with shorter monomer
+ use option with '*no masking of low complexity repeats*'. Note that omitting
+ DUST filtering will significantly increase running times
+
+ **Output**
+
+ A list of clusters identified as putative satellite repeats, their genomic
+ abundance and various cluster characteristics are provided. Length and
+ consensus sequences of reconstructed monomers are also shown and
+ accompanied by a detailed output from kmer-based reconstruction including
+ sequences and sequence logos of alternative variants of monomer sequences.
+
+ The output includes an **HTML summary** with a table listing all analyzed
+ clusters. More detailed information about clusters is provided in
+ additional files and directories. All results are also provided as a
+ downloadable **zip archive**. Since read clustering results in
+ thousands of clusters, the search for satellite repeats is limited to
+ a subset of the largest ones corresponding to the most abundant genomic
+ repeats. The default setting of the pipeline is to analyze all clusters containing at least
+ 0.01% of the input reads. Besides the satellite repeats, three other
+ groups of clusters are reported in the output (1) LTR-retrotransposons,
+ (2) 45S and 5S rDNA and (3) all remaining clusters passing the size
+ threshold. As (1) and (2) contain sequences with circular
+ graphs, their consensus is calculated in the same way as for satellite
+ repeats. Additionally a **log file** reporting the progress of the
+ computational pipeline is provided.
+
+
+
+
+
diff -r 000000000000 -r 43c4250c6761 tool_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tool_dependencies.xml Thu Apr 30 07:42:45 2020 -0400
@@ -0,0 +1,9 @@
+
+
+
+
+
+ prepare repex database and scripts
+
+
+
\ No newline at end of file