0
|
1 #+TITLE: TAREAN output description
|
|
2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" />
|
|
3 #+LANGUAGE: en
|
|
4
|
|
5 * Introduction
|
|
6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories:
|
|
7 + high confidence satellites
|
|
8 + low confidence satellites
|
|
9 + potential LTR elements
|
|
10 + rDNA
|
|
11 + other clusters
|
|
12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report.
|
|
13
|
|
14 * Main HTML report
|
|
15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads)
|
|
16 ** Table legend
|
|
17 + Cluster :: Cluster identifier
|
|
18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/
|
|
19 + Size :: Number of reads in the cluster
|
|
20 + Satellite probability :: Empirical probability estimate that cluster sequences
|
|
21 are derived from satellite repeat. This estimate is based on analysis of more
|
|
22 than xxx clusters including yyy manually anotated and zzz experimentaly
|
|
23 validated satellite repeats
|
|
24 + Consensus :: Consensus sequence is outcome of kmer-based
|
|
25 analysis and represents the most probable satellite monomer
|
|
26 sequence
|
|
27 + Kmer analysis ::
|
|
28 link to analysis report for individual clusters
|
|
29 + Graph layout :: Graph-based visualization of similarities among sequence
|
|
30 reads
|
|
31 + Connected component index :: Proportion of nodes of the graph which are part
|
|
32 of the the largest strongly connected component
|
|
33 + Pair completeness index :: Proportion of reads with available
|
|
34 mate-pair within the same cluster
|
|
35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus
|
|
36 sequence reconstruction
|
|
37 + |V| :: Number of vertices of the graph
|
|
38 + |E| :: Number of edges of the graph
|
|
39 + PBS score :: Primer binding site detection score
|
|
40 + The longest ORF length :: Length of the longest open reading frame found in
|
|
41 any of the possible six reading frames. Search was done on dimer of
|
|
42 consensus so ORFs can be longer than 'monomer' length
|
|
43 + Similarity-based annotation :: Annotation based on
|
|
44 similarity search using blastn/blastx against database of known
|
|
45 repeats.
|
|
46 * Detailed cluster report
|
|
47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent).
|
|
48 ** Table legend
|
|
49 - kmer :: length of kmer used for consensus reconstruction.
|
|
50 - variant :: identifier of consensus variant.
|
|
51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction.
|
|
52 - monomer length :: length of the consensus
|
|
53 - consensus :: consensus sequence without ambiguous bases.
|
|
54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of
|
|
55 vertices corresponds to k-mer frequencies, Paths in the graph which was used
|
|
56 for reconstruction of consensus sequences is gray colored.
|
|
57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices.
|
|
58
|
|
59 * Structure of the output archive
|
|
60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following
|
|
61 files and directories:
|
|
62
|
|
63 #+BEGIN_SRC files & directories
|
|
64 .
|
|
65 .
|
|
66 ├── clusters_info.csv <------------ list of clusters in tab delimited format
|
|
67 ├── index.html <------------ main html report
|
|
68 ├── seqclust
|
|
69 │ ├── assembly # not implemented yet
|
|
70 │ ├── blastn <------------ results of read comparison with DNA database
|
|
71 │ ├── blastx <------------ results of read comparison with protein database
|
|
72 │ ├── clustering
|
|
73 │ │ ├── clusters
|
|
74 │ │ │ ├── dir_CL0001 <----┐- detailed information about clusters
|
|
75 │ │ │ ├── dir_CL0002 <----│
|
|
76 │ │ │ ├── dir_CL0003 <----│
|
|
77 │ │ │ .... <----┘
|
|
78 │ │ │
|
|
79 │ │ └── hitsort.cls <--------- list of reads in individual clusters
|
|
80 │ ├── mgblast
|
|
81 │ ├── prerun
|
|
82 │ └── sequences <--------- input reads
|
|
83 ├── summary # not implemented yet
|
|
84 ├── TR_consensus_rank_1_.fasta <-- reconstructed monomer sequences for HIGH confidence satellites
|
|
85 ├── TR_consensus_rank_2_.fasta <-- reconstructed monomer sequences for LOW confidence satellites
|
|
86 ├── TR_consensus_rank_3_.fasta <-- reconstructed sequences of potential LTR elements
|
|
87 └── TR_consensus_rank_4_.fasta <-- reconstructed consensus for rDNA
|
|
88
|
|
89 #+END_SRC
|
|
90
|
|
91 List of all clusters which is available in HTML file =index.html= is also
|
|
92 available in tab delimited format in the file =clusters_info.csv= which can be
|
|
93 easily viewed and edited in spreadsheet editing programs. List of all clusters
|
|
94 and the corresponding reads is in the file =hitsort.cls= which has the following
|
|
95 format:
|
|
96
|
|
97 : >CL1 11
|
|
98 : 134234r 55494f 85525f 136746r 96742f 91926f 239729r 105445f 222518r 136402r 9013
|
|
99 : >CL2 10
|
|
100 : 76205r 120735r 69527r 12235r 176778f 189307f 131952f 163507f 100038r 178475r
|
|
101 : >CL3 6
|
|
102 : 99835r 222598f 29715r 102023f 99524r 30116f
|
|
103 : >CL4 6
|
|
104 : 51723r 69073r 218774r 146425f 136314r 41744f
|
|
105 : >CL5 5
|
|
106 : 70686f 65565f 234078r 50430r 68247r
|
|
107
|
|
108 where =CL1 11= is the cluster ID followed by number of reads in the cluster;
|
|
109 next line contains list of all read names belonging to the cluster.
|
|
110 ** structure of cluster directories
|
|
111
|
|
112 Detailed information for each cluster is stored is subdirectories:
|
|
113
|
|
114 #+BEGIN_SRC folder directories
|
|
115 dir_CL0011
|
|
116 ├── blast.csv <------------tab delimited file, all-to-all comparison od reads within cluster
|
|
117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object
|
|
118 ├── CL11.GL <-----------------undirected graph representation of cluster saved as R igraph object
|
|
119 ├── CL11.png <-----------┐- images with graph visualization
|
|
120 ├── CL11_tmb.png <-----------┘
|
|
121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats
|
|
122 ├── reads_all.fas <---------------- all reads included in the cluster in fasta format
|
|
123 ├── reads.fas <---------------- subset of reads used for monomer reconstruction
|
|
124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation
|
|
125 └── tarean
|
|
126 ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants
|
|
127 ├── ggmin.RData
|
|
128 ├── img
|
|
129 │ ├── graph_11mer_1.png <-----┐
|
|
130 │ ├── graph_11mer_2.png <-----│
|
|
131 │ ├── graph_15mer_2.png <-----│
|
|
132 │ ├── graph_15mer_3.png <-----│
|
|
133 │ ├── graph_15mer_4.png <-----│ images of kmer-based graphs used for reconstruction of
|
|
134 │ ├── graph_19mer_2.png <-----│ monomer variants
|
|
135 │ ├── graph_19mer_4.png <-----│
|
|
136 │ ├── graph_19mer_5.png <-----│
|
|
137 │ ├── graph_23mer_2.png <-----│
|
|
138 │ ├── graph_27mer_3.png <-----┘
|
|
139 │ │
|
|
140 │ ├── logo_11mer_1.png <-----┐
|
|
141 │ ├── logo_11mer_2.png <-----│
|
|
142 │ ├── logo_15mer_2.png <-----│
|
|
143 │ ├── logo_15mer_3.png <-----│
|
|
144 │ ├── logo_15mer_4.png <-----│ images with DNA logos representing consensus sequences
|
|
145 │ ├── logo_19mer_2.png <-----│ of monomer variants
|
|
146 │ ├── logo_19mer_4.png <-----│
|
|
147 │ ├── logo_19mer_5.png <-----│
|
|
148 │ ├── logo_23mer_2.png <-----│
|
|
149 │ └── logo_27mer_3.png <-----┘
|
|
150 │
|
|
151 ├── ppm_11mer_1.csv <-----┐
|
|
152 ├── ppm_11mer_2.csv <-----│
|
|
153 ├── ppm_15mer_2.csv <-----│
|
|
154 ├── ppm_15mer_3.csv <-----│
|
|
155 ├── ppm_15mer_4.csv <-----│ position probability matrices for individual monomer
|
|
156 ├── ppm_19mer_2.csv <-----│ variants derived from k-mer frequencies
|
|
157 ├── ppm_19mer_4.csv <-----│
|
|
158 ├── ppm_19mer_5.csv <-----│
|
|
159 ├── ppm_23mer_2.csv <-----│
|
|
160 ├── ppm_27mer_3.csv <-----┘
|
|
161 │
|
|
162 ├── reads_oriented.fas_11.kmers <-----┐
|
|
163 ├── reads_oriented.fas_15.kmers <-----│
|
|
164 ├── reads_oriented.fas_19.kmers <-----│ k-mer frequencies calculated on oriented reads
|
|
165 ├── reads_oriented.fas_23.kmers <-----│ for k-mer lengths 11 - 27
|
|
166 ├── reads_oriented.fas_27.kmers <-----┘
|
|
167 ├── reads_oriented.fasblast_out.cvs <---------┐results of blastn search against database of tRNA
|
|
168 ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection
|
|
169 ├── reads_oriented.fasblast_out.cvs_R.csv <----┘
|
|
170 └── report.html <--- cluster analysisHTML summary
|
|
171 #+END_SRC
|
|
172
|
|
173
|
|
174
|