Mercurial > repos > petr-novak > repeatrxplorer
comparison lib/tarean_output_help.org @ 0:1d1b9e1b2e2f draft
Uploaded
author | petr-novak |
---|---|
date | Thu, 19 Dec 2019 10:24:45 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:1d1b9e1b2e2f |
---|---|
1 #+TITLE: TAREAN output description | |
2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" /> | |
3 #+LANGUAGE: en | |
4 | |
5 * Introduction | |
6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories: | |
7 + high confidence satellites | |
8 + low confidence satellites | |
9 + potential LTR elements | |
10 + rDNA | |
11 + other clusters | |
12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report. | |
13 | |
14 * Main HTML report | |
15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads) | |
16 ** Table legend | |
17 + Cluster :: Cluster identifier | |
18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/ | |
19 + Size :: Number of reads in the cluster | |
20 + Satellite probability :: Empirical probability estimate that cluster sequences | |
21 are derived from satellite repeat. This estimate is based on analysis of more | |
22 than xxx clusters including yyy manually anotated and zzz experimentaly | |
23 validated satellite repeats | |
24 + Consensus :: Consensus sequence is outcome of kmer-based | |
25 analysis and represents the most probable satellite monomer | |
26 sequence | |
27 + Kmer analysis :: | |
28 link to analysis report for individual clusters | |
29 + Graph layout :: Graph-based visualization of similarities among sequence | |
30 reads | |
31 + Connected component index :: Proportion of nodes of the graph which are part | |
32 of the the largest strongly connected component | |
33 + Pair completeness index :: Proportion of reads with available | |
34 mate-pair within the same cluster | |
35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus | |
36 sequence reconstruction | |
37 + |V| :: Number of vertices of the graph | |
38 + |E| :: Number of edges of the graph | |
39 + PBS score :: Primer binding site detection score | |
40 + The longest ORF length :: Length of the longest open reading frame found in | |
41 any of the possible six reading frames. Search was done on dimer of | |
42 consensus so ORFs can be longer than 'monomer' length | |
43 + Similarity-based annotation :: Annotation based on | |
44 similarity search using blastn/blastx against database of known | |
45 repeats. | |
46 * Detailed cluster report | |
47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent). | |
48 ** Table legend | |
49 - kmer :: length of kmer used for consensus reconstruction. | |
50 - variant :: identifier of consensus variant. | |
51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction. | |
52 - monomer length :: length of the consensus | |
53 - consensus :: consensus sequence without ambiguous bases. | |
54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of | |
55 vertices corresponds to k-mer frequencies, Paths in the graph which was used | |
56 for reconstruction of consensus sequences is gray colored. | |
57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices. | |
58 | |
59 * Structure of the output archive | |
60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following | |
61 files and directories: | |
62 | |
63 #+BEGIN_SRC files & directories | |
64 . | |
65 . | |
66 ├── clusters_info.csv <------------ list of clusters in tab delimited format | |
67 ├── index.html <------------ main html report | |
68 ├── seqclust | |
69 │ ├── assembly # not implemented yet | |
70 │ ├── blastn <------------ results of read comparison with DNA database | |
71 │ ├── blastx <------------ results of read comparison with protein database | |
72 │ ├── clustering | |
73 │ │ ├── clusters | |
74 │ │ │ ├── dir_CL0001 <----┐- detailed information about clusters | |
75 │ │ │ ├── dir_CL0002 <----│ | |
76 │ │ │ ├── dir_CL0003 <----│ | |
77 │ │ │ .... <----┘ | |
78 │ │ │ | |
79 │ │ └── hitsort.cls <--------- list of reads in individual clusters | |
80 │ ├── mgblast | |
81 │ ├── prerun | |
82 │ └── sequences <--------- input reads | |
83 ├── summary # not implemented yet | |
84 ├── TR_consensus_rank_1_.fasta <-- reconstructed monomer sequences for HIGH confidence satellites | |
85 ├── TR_consensus_rank_2_.fasta <-- reconstructed monomer sequences for LOW confidence satellites | |
86 ├── TR_consensus_rank_3_.fasta <-- reconstructed sequences of potential LTR elements | |
87 └── TR_consensus_rank_4_.fasta <-- reconstructed consensus for rDNA | |
88 | |
89 #+END_SRC | |
90 | |
91 List of all clusters which is available in HTML file =index.html= is also | |
92 available in tab delimited format in the file =clusters_info.csv= which can be | |
93 easily viewed and edited in spreadsheet editing programs. List of all clusters | |
94 and the corresponding reads is in the file =hitsort.cls= which has the following | |
95 format: | |
96 | |
97 : >CL1 11 | |
98 : 134234r 55494f 85525f 136746r 96742f 91926f 239729r 105445f 222518r 136402r 9013 | |
99 : >CL2 10 | |
100 : 76205r 120735r 69527r 12235r 176778f 189307f 131952f 163507f 100038r 178475r | |
101 : >CL3 6 | |
102 : 99835r 222598f 29715r 102023f 99524r 30116f | |
103 : >CL4 6 | |
104 : 51723r 69073r 218774r 146425f 136314r 41744f | |
105 : >CL5 5 | |
106 : 70686f 65565f 234078r 50430r 68247r | |
107 | |
108 where =CL1 11= is the cluster ID followed by number of reads in the cluster; | |
109 next line contains list of all read names belonging to the cluster. | |
110 ** structure of cluster directories | |
111 | |
112 Detailed information for each cluster is stored is subdirectories: | |
113 | |
114 #+BEGIN_SRC folder directories | |
115 dir_CL0011 | |
116 ├── blast.csv <------------tab delimited file, all-to-all comparison od reads within cluster | |
117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object | |
118 ├── CL11.GL <-----------------undirected graph representation of cluster saved as R igraph object | |
119 ├── CL11.png <-----------┐- images with graph visualization | |
120 ├── CL11_tmb.png <-----------┘ | |
121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats | |
122 ├── reads_all.fas <---------------- all reads included in the cluster in fasta format | |
123 ├── reads.fas <---------------- subset of reads used for monomer reconstruction | |
124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation | |
125 └── tarean | |
126 ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants | |
127 ├── ggmin.RData | |
128 ├── img | |
129 │ ├── graph_11mer_1.png <-----┐ | |
130 │ ├── graph_11mer_2.png <-----│ | |
131 │ ├── graph_15mer_2.png <-----│ | |
132 │ ├── graph_15mer_3.png <-----│ | |
133 │ ├── graph_15mer_4.png <-----│ images of kmer-based graphs used for reconstruction of | |
134 │ ├── graph_19mer_2.png <-----│ monomer variants | |
135 │ ├── graph_19mer_4.png <-----│ | |
136 │ ├── graph_19mer_5.png <-----│ | |
137 │ ├── graph_23mer_2.png <-----│ | |
138 │ ├── graph_27mer_3.png <-----┘ | |
139 │ │ | |
140 │ ├── logo_11mer_1.png <-----┐ | |
141 │ ├── logo_11mer_2.png <-----│ | |
142 │ ├── logo_15mer_2.png <-----│ | |
143 │ ├── logo_15mer_3.png <-----│ | |
144 │ ├── logo_15mer_4.png <-----│ images with DNA logos representing consensus sequences | |
145 │ ├── logo_19mer_2.png <-----│ of monomer variants | |
146 │ ├── logo_19mer_4.png <-----│ | |
147 │ ├── logo_19mer_5.png <-----│ | |
148 │ ├── logo_23mer_2.png <-----│ | |
149 │ └── logo_27mer_3.png <-----┘ | |
150 │ | |
151 ├── ppm_11mer_1.csv <-----┐ | |
152 ├── ppm_11mer_2.csv <-----│ | |
153 ├── ppm_15mer_2.csv <-----│ | |
154 ├── ppm_15mer_3.csv <-----│ | |
155 ├── ppm_15mer_4.csv <-----│ position probability matrices for individual monomer | |
156 ├── ppm_19mer_2.csv <-----│ variants derived from k-mer frequencies | |
157 ├── ppm_19mer_4.csv <-----│ | |
158 ├── ppm_19mer_5.csv <-----│ | |
159 ├── ppm_23mer_2.csv <-----│ | |
160 ├── ppm_27mer_3.csv <-----┘ | |
161 │ | |
162 ├── reads_oriented.fas_11.kmers <-----┐ | |
163 ├── reads_oriented.fas_15.kmers <-----│ | |
164 ├── reads_oriented.fas_19.kmers <-----│ k-mer frequencies calculated on oriented reads | |
165 ├── reads_oriented.fas_23.kmers <-----│ for k-mer lengths 11 - 27 | |
166 ├── reads_oriented.fas_27.kmers <-----┘ | |
167 ├── reads_oriented.fasblast_out.cvs <---------┐results of blastn search against database of tRNA | |
168 ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection | |
169 ├── reads_oriented.fasblast_out.cvs_R.csv <----┘ | |
170 └── report.html <--- cluster analysisHTML summary | |
171 #+END_SRC | |
172 | |
173 | |
174 |