annotate PsiCLASS-1.0.2/README.md @ 0:903fc43d6227 draft default tip

Uploaded
author lsong10
date Fri, 26 Mar 2021 16:52:45 +0000
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
1 PsiCLASS
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
2 =======
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
3
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
4 Described in:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
5
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
6 Song, L., Sabunciyan, S., Yang, G. and Florea, L. [A multi-sample approach increases the accuracy of transcript assembly](https://www.nature.com/articles/s41467-019-12990-0). *Nat Commun* 10, 5000 (2019)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
7
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
8 Copyright (C) 2018- and GNU GPL by Li Song, Liliana Florea
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
9
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
10 Includes portions copyright from:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
11
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
12 samtools - Copyright (C) 2008-, Genome Research Ltd, Heng Li
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
13
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
14 Commands, scripts and supporting data for the paper can be found [here](https://github.com/splicebox/PsiCLASS_paper/).
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
15
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
16 ### What is PsiCLASS?
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
17
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
18 PsiCLASS is a reference-based transcriptome assembler for single or multiple RNA-seq samples. Unlike conventional methods that analyze each sample separately and then merge the outcomes to create a unified set of meta-annotations, PsiCLASS takes a multi-sample approach, simultaneously analyzing all RNA-seq data sets in an experiment. PsiCLASS is both a transcript assembler and a meta-assembler, producing separate transcript sets for the individual samples and a unified set of meta-annotations. The algorithmic underpinnings of PsiCLASS include using a global subexon splice graph, statistical cross-sample feature (intron, subexon) selection methods, and an efficient dynamic programming algorithm to select a subset of transcripts from among those encoded in the graph, based on the read support in each sample. Lastly, the set of meta-annotations is selected from among the transcripts generated for individual samples by voting. While PsiCLASS is highly accurate and efficient for medium-to-large collections of RNA-seq data, its accuracy is equally high for small RNA-seq data sets (2-10 samples) and is competitive to reference methods for single samples. Additionally, its performance is robust with the aggregation method used, including the built-in voting and assembly-based approaches such as StringTie-merge and TACO. Therefore, it can be effectively used as a multi-sample and as a single-sample assembler, as well as in conventional assemble-and-merge protocols.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
19
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
20 ### Install
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
21
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
22 1. Clone the [GitHub repo](https://github.com/splicebox/psiclass), e.g. with `git clone https://github.com/splicebox/psiclass.git`
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
23 2. Run `make` in the repo directory
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
24
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
25 You will find the executable files in the downloaded directory. If you want to run PsiCLASS without specifying the directory, you can either add the directory of PsiCLASS to the environment variable PATH or create a soft link ("ln -s") of the file "psiclass" to a directory in PATH.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
26
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
27 PsiCLASS depends on [pthreads](http://en.wikipedia.org/wiki/POSIX_Threads) and samtools depends on [zlib](http://en.wikipedia.org/wiki/Zlib).
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
28
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
29
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
30 ### Usage
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
31
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
32 Usage: ./psiclass [OPTIONS]
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
33 Required:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
34 -b STRING: paths to the alignment BAM files; use comma to separate multiple BAM files
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
35 or
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
36 --lb STRING: path to the file listing the alignment BAM files
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
37 Optional:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
38 -s STRING: path to the trusted splice file (default: not used)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
39 -o STRING: prefix of output files (default: ./psiclass)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
40 -p INT: number of threads (default: 1)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
41 -c FLOAT: only use the subexons with classifier score <= than the given number. (default: 0.05)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
42 --sa FLOAT: the minimum average number of supported read for retained introns (default: 0.5)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
43 --vd FLOAT : the minimum average coverage depth of a transcript to be reported in voting (defaults: 1.0)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
44 --maxDpConstraintSize: the number of subexons a constraint can cover in DP. (default: 7. -1 for inf)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
45 --primaryParalog: use primary alignment to retain paralog genes (default: use unique alignments)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
46 --version: print version and exit
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
47 --stage INT: (default: 0)
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
48 0-start from the beginning - building the splice site file for each sample
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
49 1-start from building the subexon file for each samples
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
50 2-start from combining the subexon files across samples
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
51 3-start from assembling the transcripts for each sample
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
52 4-start from voting the consensus transcripts across samples
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
53
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
54 ### Practical notes
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
55
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
56 *Alignment compatibility.* PsiCLASS has been tuned to run on alignments generated with the tools [HISAT](https://ccb.jhu.edu/software/hisat/index.shtml) and [STAR](https://github.com/alexdobin/STAR).
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
57
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
58 When running PsiCLASS with STAR alignments, run STAR with the option `--outSAMstrandField intronMotif`, which will include the XS field indicating the strand in the BAM alignments. Further, when including alignments with *non-canonical splice sites*, use the provided `addXS` executable to add the XS field:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
59
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
60 samtools view -h in.bam | ./addXS reference_genome.fa | samtools view -bS - > out.bam
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
61
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
62 *Trusted introns from other sources.* By default, PsiCLASS determines a set of trusted introns from the input spliced alignments, to use in building the global subexon graph. Alternatively, the user can supply an external set of trusted introns, for instance extracted from the GENCODE gene annotations or judiciously selected from the input data using a tool like [JULIP](https://github.com/Guangyu-Yang/JULiP). This file must contain three columns:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
63
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
64 chr_name start_site end_site
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
65
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
66 *Voting optimization.* The default parameters for voting have been calibrated and perform near-optimally for a wide variety of data, including with varying levels of coverage and different library construction protocols. However, if further optimization is desired, to determine a better cutoff value one can run the voting stage (see [Usage](#usage) above) with different parameter values, and assess the performance against a reference set of gene annotations, such as [GENCODE](https://www.gencodegenes.org). The program 'grader', included in the package, can be used for this purpose. Note that the per sample sets of transcripts will remain unchanged.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
67
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
68 *Add gene name.* For many applications, it would be desirable to associate the known (annotated) gene name with each transcript. PsiCLASS provides the program "add-genename" for such purpose. "add-genename" takes as input a GTF file containing a reference set of gene annotations and a file listing the raw GTF files, and generates a new GTF file for each input raw GTF file by appending the annotated gene names. If a gene is not found in the annotation, "add-genename" will use "novel_INT" to represent its gene name. The program can be run as:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
69
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
70 ./add-genename annotation.gtf gtflist
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
71
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
72 ### Input/Output
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
73
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
74 The primary input to PsiCLASS is a set of BAM alignment files, one for each RNA-seq sample in the analysis. The program calculates a set of subexon files and a set of splice (intron) files, for the individual samples. (Optionally, one may specify a path to an external file of trusted introns as explained [above](#practical-notes).) The output consists of one GTF file of transcripts for each sample, and the GTF file of meta-annotations produced by voting, stored in the output directory:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
75
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
76 Sample-wise GTF files: (psiclass)_sample_{0,1,...,n-1}.gtf
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
77 Meta-assembly GTF file: (psiclass)_vote.gtf
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
78
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
79 where indices 0,1,...,n-1 match the order of the input BAM files.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
80
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
81 Subexon and splice (intron) files, and other auxiliary files, are in the subdirectories:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
82
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
83 Intron files: splice/*
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
84 Subexon graph files: subexon/*
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
85 Log file: (psiclass)_classes.log
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
86
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
87 ### Example
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
88
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
89 The directory './example' in this distribution contains two BAM files, along with an example of a BAM list file. Run PsiCLASS with:
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
90
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
91 ./psiclass -b example/s1.bam,example/s2.bam
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
92
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
93 or
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
94
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
95 ./psiclass --lb example/slist
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
96
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
97 The run will generate the files 'psiclass_sample_0.gtf' for 's1.bam', 'psiclass_sample_1.gtf' for 's2.bam', and the file 'psiclass_vote.gtf' containing the meta-assemblies.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
98
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
99 ### Terms of use
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
100
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
101 This program is free software; you can redistribute it and/or modify it
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
102 under the terms of the GNU General Public License as published by the
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
103 Free Software Foundation; either version 2 of the License, or (at your
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
104 option) any later version.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
105
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
106 This program is distributed in the hope that it will be useful,
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
107 but WITHOUT ANY WARRANTY; without even the implied warranty of
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
108 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
109 GNU General Public License for more details.
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
110
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
111 You should have received (LICENSE.txt) a copy of the GNU General
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
112 Public License along with this program; if not, you can obtain one from
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
113 http://www.gnu.org/licenses/gpl.txt or by writing to the Free Software
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
114 Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
115
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
116 ### Support
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
117
903fc43d6227 Uploaded
lsong10
parents:
diff changeset
118 Create a [GitHub issue](https://github.com/splicebox/PsiCLASS/issues).