RepeatExplorer documentation

Cluster annotation table

Cluster
cluster index, contain link to individual cluster report
Supercluster
Supercluster index, contains link inf individual supercluster report
Proportion[%]
Proportion of the reads in the cluster with respect to the amount of number of analyzed sequence.
Proportions adjusted[%]
Adjusted genome proportion can differ from unadjusted value if the Perform automatic filtering of abundant satellite repeats was on. Sequences belonging to high abundance satellites were partially removed from all-to-all comparison and clustering. This causes that the Genome proportion estimate for these satellite is underestimated. Adjusted Genome proportion provide corrected estimate of ‘real’ genomic proportion for particular satellite repeat.
Number of reads
number of reads in the cluster
Graph layout
Preview of graph based visualization of sequence reads cluster. More detailed graph layout can be foun in individual cluster reports
Similarity hits
summarize the proportion of reads in the clusters with similarity to REXdb or DNA reference databases. Only hits with proportion above 0.1% are shown
LTR detection
Show if the LTR with primer binding site was detected on contig assembly and what type of tRNA is used for priming.
Satellite probability
provide empirical probability that cluster represent satellite
TAREAN classification
TAREAN divides clusters into five categories described in box 9.
Consensus length
For clusters analyzed by TAREAN module, the best estimate of monomer length is shown.
Consensus
The best consensus estimate reconstructed by TAREAN module
Kmer analysis
if cluster was analyzed by TAREAN, this field contains the link to the detailed TAREAN kmer analysis (box 10)
Connected component index C, Pair completeness index P, Kmer coverage
statistics reported by TAREAN module
|V|
Number of vertices of the graph
|E|
Number of edges of the graph

Supercluster annotation table

Supercluster
supercluster index
Reads
number of reads in supercluster
Automatic classification
Result of automatic supercluster classification
Similarity hits
Number similarity hits against REXdb and DNA database are shown in the classification tree structure together with the number of reads assigned to putative satellite cluster and information about detection of LTR/PBS. The parts of the tree without any evidences are pruned off.
TAREAN annotation
Clusters which are part of supercluster and classified by TAREAN as putative satellite are listed here
Clusters
hyperlinked list of clusters which are part of the superclusters.

Tandem repeat analysis

TAREAN divides clusters into five categories with corresponding files in the archive:

Summary tables from TAREAN html report include following information:

Cluster
cluster identifier
Proportion[%]
(Number of sequences in cluster/Number of sequences in clustering) x 100%
Proportion adjusted[%]
Number of reads
Number of reads in the cluster
Satellite probability
Empirical probability estimate that cluster sequences are derived from satellite repeat. This estimate is based on analysis of manually anotated and experimentaly validated satellite repeats
Consensus length
Consensus
Consensus sequence is outcome of kmer-based analysis and represents the most probable satellite monomer sequence, other alternative consensus sequences are included in individual cluster reports
Graph layout
Graph-based visualization of similarities among sequence reads
Kmer analysis
hyperlink to Individual clusters TAREAN kmer report (fig X, box 10)
Connected component index C
Proportion of nodes of the graph which are part of the the largest strongly connected component
Pair completeness index P
Proportion of reads with available mate-pair within the same cluster
Kmer coverage
Sum of relative frequencies of all kmers used for consensus sequence reconstruction
|V|
Number of vertices of the graph
|E|
Number of edges of the graph
PBS score
Primer binding site detection score
Similarity hits
similarity hits based on the search using blastn/blastx against built-in databases of known sequences. By default, this will contain similarity hits to built in database which include rDNA sequences, plastid and mitochondrial sequences. If TAREAN was run within RepeatExplorer2 pipeline, it will also contain information about similarity hist against REXdb database.

In individual clusters TAREAN report contain other variant of consensus sequences sorted by kmer coverage score. For each consensus, corresponding de-Bruijn graph representation and corresponding sequence logo is shown.

TAREAN k-mer analysis report

TAREAN module generates kmer analysis report for each cluster assigned to a putative satellite, rDNA or a putative LTR category. Monomer sequences of putative tandem repeats are reconstructed using k-mer based method using the most frequent k-mers. Several k-mer lengths are evaluated and the best estimated of monomer consensus sequence are reported. Kmer analysis summary contain the following information:

k-mer length
length of the k-mer used for monomer reconstruction
Variant index
Each kmer of given length can yield multiple consensus variant. Variants are indexed
k-mer coverage score
is sum of proportions of all k-mer used for reconstruction of particular monomer. If the value is 1 then all kmers from corresponding cluster were used for reconstruction of monomer meaning that there is no variability. The more variable the monomer, the lower the k-mer coverage score.
Consensus length
length of estimated monomer
Consensus
consensus sequence shows the consensus sequence extracted from position probability matrix.
k-mer bases graph
the visualization of de-Bruijn graph. Each vertex corespond to single k-mer. Size of vertex is proportional to the kmer frequency. Path which was used to reconstruct monomer sequence is grey out.
Sequence logo
visualization of position probability matrices for corresponding consensus variant.