comparison lib/tarean_output_help.org @ 0:1d1b9e1b2e2f draft

Uploaded
author petr-novak
date Thu, 19 Dec 2019 10:24:45 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:1d1b9e1b2e2f
1 #+TITLE: TAREAN output description
2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" />
3 #+LANGUAGE: en
4
5 * Introduction
6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories:
7 + high confidence satellites
8 + low confidence satellites
9 + potential LTR elements
10 + rDNA
11 + other clusters
12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report.
13
14 * Main HTML report
15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads)
16 ** Table legend
17 + Cluster :: Cluster identifier
18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/
19 + Size :: Number of reads in the cluster
20 + Satellite probability :: Empirical probability estimate that cluster sequences
21 are derived from satellite repeat. This estimate is based on analysis of more
22 than xxx clusters including yyy manually anotated and zzz experimentaly
23 validated satellite repeats
24 + Consensus :: Consensus sequence is outcome of kmer-based
25 analysis and represents the most probable satellite monomer
26 sequence
27 + Kmer analysis ::
28 link to analysis report for individual clusters
29 + Graph layout :: Graph-based visualization of similarities among sequence
30 reads
31 + Connected component index :: Proportion of nodes of the graph which are part
32 of the the largest strongly connected component
33 + Pair completeness index :: Proportion of reads with available
34 mate-pair within the same cluster
35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus
36 sequence reconstruction
37 + |V| :: Number of vertices of the graph
38 + |E| :: Number of edges of the graph
39 + PBS score :: Primer binding site detection score
40 + The longest ORF length :: Length of the longest open reading frame found in
41 any of the possible six reading frames. Search was done on dimer of
42 consensus so ORFs can be longer than 'monomer' length
43 + Similarity-based annotation :: Annotation based on
44 similarity search using blastn/blastx against database of known
45 repeats.
46 * Detailed cluster report
47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent).
48 ** Table legend
49 - kmer :: length of kmer used for consensus reconstruction.
50 - variant :: identifier of consensus variant.
51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction.
52 - monomer length :: length of the consensus
53 - consensus :: consensus sequence without ambiguous bases.
54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of
55 vertices corresponds to k-mer frequencies, Paths in the graph which was used
56 for reconstruction of consensus sequences is gray colored.
57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices.
58
59 * Structure of the output archive
60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following
61 files and directories:
62
63 #+BEGIN_SRC files & directories
64 .
65 .
66 ├── clusters_info.csv <------------ list of clusters in tab delimited format
67 ├── index.html <------------ main html report
68 ├── seqclust
69 │   ├── assembly # not implemented yet
70 │   ├── blastn <------------ results of read comparison with DNA database
71 │   ├── blastx <------------ results of read comparison with protein database
72 │   ├── clustering
73 │   │   ├── clusters
74 │   │   │   ├── dir_CL0001 <----┐- detailed information about clusters
75 │   │   │   ├── dir_CL0002 <----│
76 │   │   │   ├── dir_CL0003 <----│
77 │ │ │ .... <----┘
78 │ │ │
79 │   │   └── hitsort.cls <--------- list of reads in individual clusters
80 │   ├── mgblast
81 │   ├── prerun
82 │   └── sequences <--------- input reads
83 ├── summary # not implemented yet
84 ├── TR_consensus_rank_1_.fasta <-- reconstructed monomer sequences for HIGH confidence satellites
85 ├── TR_consensus_rank_2_.fasta <-- reconstructed monomer sequences for LOW confidence satellites
86 ├── TR_consensus_rank_3_.fasta <-- reconstructed sequences of potential LTR elements
87 └── TR_consensus_rank_4_.fasta <-- reconstructed consensus for rDNA
88
89 #+END_SRC
90
91 List of all clusters which is available in HTML file =index.html= is also
92 available in tab delimited format in the file =clusters_info.csv= which can be
93 easily viewed and edited in spreadsheet editing programs. List of all clusters
94 and the corresponding reads is in the file =hitsort.cls= which has the following
95 format:
96
97 : >CL1 11
98 : 134234r 55494f 85525f 136746r 96742f 91926f 239729r 105445f 222518r 136402r 9013
99 : >CL2 10
100 : 76205r 120735r 69527r 12235r 176778f 189307f 131952f 163507f 100038r 178475r
101 : >CL3 6
102 : 99835r 222598f 29715r 102023f 99524r 30116f
103 : >CL4 6
104 : 51723r 69073r 218774r 146425f 136314r 41744f
105 : >CL5 5
106 : 70686f 65565f 234078r 50430r 68247r
107
108 where =CL1 11= is the cluster ID followed by number of reads in the cluster;
109 next line contains list of all read names belonging to the cluster.
110 ** structure of cluster directories
111
112 Detailed information for each cluster is stored is subdirectories:
113
114 #+BEGIN_SRC folder directories
115 dir_CL0011
116 ├── blast.csv <------------tab delimited file, all-to-all comparison od reads within cluster
117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object
118 ├── CL11.GL <-----------------undirected graph representation of cluster saved as R igraph object
119 ├── CL11.png <-----------┐- images with graph visualization
120 ├── CL11_tmb.png <-----------┘
121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats
122 ├── reads_all.fas <---------------- all reads included in the cluster in fasta format
123 ├── reads.fas <---------------- subset of reads used for monomer reconstruction
124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation
125 └── tarean
126 ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants
127 ├── ggmin.RData
128 ├── img
129 │   ├── graph_11mer_1.png <-----┐
130 │   ├── graph_11mer_2.png <-----│
131 │   ├── graph_15mer_2.png <-----│
132 │   ├── graph_15mer_3.png <-----│
133 │   ├── graph_15mer_4.png <-----│ images of kmer-based graphs used for reconstruction of
134 │   ├── graph_19mer_2.png <-----│ monomer variants
135 │   ├── graph_19mer_4.png <-----│
136 │   ├── graph_19mer_5.png <-----│
137 │   ├── graph_23mer_2.png <-----│
138 │   ├── graph_27mer_3.png <-----┘
139 │ │
140 │   ├── logo_11mer_1.png <-----┐
141 │   ├── logo_11mer_2.png <-----│
142 │   ├── logo_15mer_2.png <-----│
143 │   ├── logo_15mer_3.png <-----│
144 │   ├── logo_15mer_4.png <-----│ images with DNA logos representing consensus sequences
145 │   ├── logo_19mer_2.png <-----│ of monomer variants
146 │   ├── logo_19mer_4.png <-----│
147 │   ├── logo_19mer_5.png <-----│
148 │   ├── logo_23mer_2.png <-----│
149 │   └── logo_27mer_3.png <-----┘
150
151 ├── ppm_11mer_1.csv <-----┐
152 ├── ppm_11mer_2.csv <-----│
153 ├── ppm_15mer_2.csv <-----│
154 ├── ppm_15mer_3.csv <-----│
155 ├── ppm_15mer_4.csv <-----│ position probability matrices for individual monomer
156 ├── ppm_19mer_2.csv <-----│ variants derived from k-mer frequencies
157 ├── ppm_19mer_4.csv <-----│
158 ├── ppm_19mer_5.csv <-----│
159 ├── ppm_23mer_2.csv <-----│
160 ├── ppm_27mer_3.csv <-----┘
161
162 ├── reads_oriented.fas_11.kmers <-----┐
163 ├── reads_oriented.fas_15.kmers <-----│
164 ├── reads_oriented.fas_19.kmers <-----│ k-mer frequencies calculated on oriented reads
165 ├── reads_oriented.fas_23.kmers <-----│ for k-mer lengths 11 - 27
166 ├── reads_oriented.fas_27.kmers <-----┘
167 ├── reads_oriented.fasblast_out.cvs <---------┐results of blastn search against database of tRNA
168 ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection
169 ├── reads_oriented.fasblast_out.cvs_R.csv <----┘
170 └── report.html <--- cluster analysisHTML summary
171 #+END_SRC
172
173
174