view lib/tarean_output_help.org @ 0:1d1b9e1b2e2f draft

Uploaded
author petr-novak
date Thu, 19 Dec 2019 10:24:45 -0500
parents
children
line wrap: on
line source

#+TITLE: TAREAN output description
#+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" />
#+LANGUAGE: en

* Introduction
TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories:
+ high confidence satellites
+ low confidence satellites
+ potential LTR elements
+ rDNA
+ other clusters
Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report.

* Main HTML report
This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads)
** Table legend
+ Cluster ::  Cluster identifier
+ Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/
+ Size :: Number of reads in the cluster
+ Satellite probability :: Empirical probability estimate that cluster sequences
     are derived from satellite repeat. This estimate is based on analysis of more
     than xxx clusters including yyy manually anotated and zzz experimentaly
     validated satellite repeats
+ Consensus :: Consensus sequence is outcome of kmer-based
     analysis and represents the most probable satellite monomer
     sequence
+ Kmer analysis ::
     link to analysis report for individual clusters
+ Graph layout :: Graph-based visualization of similarities among sequence
     reads
+ Connected component index :: Proportion of nodes of the graph which are part
     of the the largest strongly connected component
+ Pair completeness index ::  Proportion of reads with available
     mate-pair within the same cluster
+ Kmer coverage :: Sum of relative frequencies of all kmers used for consensus
     sequence reconstruction
+ |V| :: Number of vertices of the graph
+ |E| :: Number of edges of the graph
+ PBS score :: Primer binding site detection score
+ The longest ORF length :: Length of the longest open reading frame found in
     any of the possible six reading frames. Search was done on dimer of
     consensus so ORFs can be longer than 'monomer' length
+ Similarity-based annotation :: Annotation based on
     similarity search using blastn/blastx against database of known
     repeats.
* Detailed cluster report
Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent).
** Table legend
- kmer :: length of kmer used for consensus reconstruction.
- variant :: identifier of consensus variant.
- total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction.
- monomer length :: length of the consensus
- consensus :: consensus sequence without ambiguous bases. 
- graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of
     vertices corresponds to k-mer frequencies, Paths in the graph which was used
     for reconstruction of consensus sequences is gray colored.
- logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices.

* Structure of the output archive
Complete results from TAREAN analysis can by downloaded as zip archive which contains the following
files and directories:

#+BEGIN_SRC files & directories
.
.
├── clusters_info.csv <------------ list of clusters in tab delimited format 
├── index.html        <------------ main html report
├── seqclust
│   ├── assembly                  # not implemented yet
│   ├── blastn        <------------ results of read comparison with DNA database
│   ├── blastx        <------------ results of read comparison with protein database
│   ├── clustering
│   │   ├── clusters
│   │   │   ├── dir_CL0001  <----┐- detailed information about clusters
│   │   │   ├── dir_CL0002  <----│
│   │   │   ├── dir_CL0003  <----│
│   │   │   ....            <----┘
│   │   │   
│   │   └── hitsort.cls  <--------- list of reads in individual clusters
│   ├── mgblast
│   ├── prerun
│   └── sequences        <--------- input reads
├── summary                       # not implemented yet
├── TR_consensus_rank_1_.fasta  <-- reconstructed monomer sequences for HIGH confidence satellites
├── TR_consensus_rank_2_.fasta  <-- reconstructed monomer sequences for LOW confidence satellites
├── TR_consensus_rank_3_.fasta  <-- reconstructed sequences of potential LTR elements
└── TR_consensus_rank_4_.fasta  <-- reconstructed consensus for rDNA

#+END_SRC

List of all clusters which is available in HTML file =index.html= is also
available in tab delimited format in the file =clusters_info.csv= which can be
easily viewed and edited in spreadsheet editing programs. List of all clusters
and the corresponding reads is in the file =hitsort.cls= which has the following
format:

  :  >CL1    11
  :  134234r 55494f  85525f  136746r 96742f  91926f  239729r 105445f 222518r 136402r 9013
  :  >CL2    10
  :  76205r  120735r 69527r  12235r  176778f 189307f 131952f 163507f 100038r 178475r 
  :  >CL3    6
  :  99835r  222598f 29715r  102023f 99524r  30116f 
  :  >CL4    6
  :  51723r  69073r  218774r 146425f 136314r 41744f 
  :  >CL5    5
  :  70686f  65565f  234078r 50430r  68247r 

where =CL1 11= is the cluster ID followed by number of reads in the cluster;
next line contains list of all read names belonging to the cluster.
** structure of cluster directories

Detailed information for each cluster is stored is subdirectories:

#+BEGIN_SRC folder directories
dir_CL0011
├── blast.csv        <------------tab delimited file, all-to-all comparison od reads within cluster            
├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object
├── CL11.GL     <-----------------undirected graph representation of cluster saved as R igraph object
├── CL11.png         <-----------┐- images with graph visualization
├── CL11_tmb.png     <-----------┘
├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats
├── reads_all.fas   <---------------- all reads included in the cluster in fasta format
├── reads.fas      <---------------- subset of reads used for monomer reconstruction
├── reads_oriented.fas <------------ subset of reads all in the same orientation
└── tarean
    ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants
    ├── ggmin.RData
    ├── img
    │   ├── graph_11mer_1.png  <-----┐  
    │   ├── graph_11mer_2.png  <-----│
    │   ├── graph_15mer_2.png  <-----│
    │   ├── graph_15mer_3.png  <-----│
    │   ├── graph_15mer_4.png  <-----│ images of kmer-based graphs used for reconstruction of
    │   ├── graph_19mer_2.png  <-----│ monomer variants
    │   ├── graph_19mer_4.png  <-----│
    │   ├── graph_19mer_5.png  <-----│
    │   ├── graph_23mer_2.png  <-----│
    │   ├── graph_27mer_3.png  <-----┘
    │   │
    │   ├── logo_11mer_1.png  <-----┐  
    │   ├── logo_11mer_2.png  <-----│
    │   ├── logo_15mer_2.png  <-----│
    │   ├── logo_15mer_3.png  <-----│
    │   ├── logo_15mer_4.png  <-----│ images with DNA logos representing consensus sequences
    │   ├── logo_19mer_2.png  <-----│ of monomer variants
    │   ├── logo_19mer_4.png  <-----│
    │   ├── logo_19mer_5.png  <-----│
    │   ├── logo_23mer_2.png  <-----│
    │   └── logo_27mer_3.png  <-----┘

    ├── ppm_11mer_1.csv  <-----┐
    ├── ppm_11mer_2.csv  <-----│
    ├── ppm_15mer_2.csv  <-----│
    ├── ppm_15mer_3.csv  <-----│
    ├── ppm_15mer_4.csv  <-----│ position probability matrices for individual monomer
    ├── ppm_19mer_2.csv  <-----│ variants derived from k-mer frequencies
    ├── ppm_19mer_4.csv  <-----│
    ├── ppm_19mer_5.csv  <-----│
    ├── ppm_23mer_2.csv  <-----│
    ├── ppm_27mer_3.csv  <-----┘

    ├── reads_oriented.fas_11.kmers  <-----┐
    ├── reads_oriented.fas_15.kmers  <-----│
    ├── reads_oriented.fas_19.kmers  <-----│ k-mer frequencies calculated on oriented reads
    ├── reads_oriented.fas_23.kmers  <-----│ for k-mer lengths 11 - 27
    ├── reads_oriented.fas_27.kmers  <-----┘
    ├── reads_oriented.fasblast_out.cvs  <---------┐results of blastn search against database of tRNA
    ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection 
    ├── reads_oriented.fasblast_out.cvs_R.csv <----┘ 
    └── report.html       <--- cluster analysisHTML summary
#+END_SRC