comparison NEUMA-1.2.1/README @ 0:c44c43d185ef draft default tip

NEUMA-1.2.1 Uploaded
author chawhwa
date Thu, 08 Aug 2013 00:46:13 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:c44c43d185ef
1 ###########################
2 # README for NEUMA v1.2.0 #
3 ###########################
4
5 ## Additional information and files can be obtained from the NEUMA website, http://neuma.kobic.re.kr.
6 ## Inquiries can be written to duplexa@gmail.com.
7
8 # This version has removed the extra tab in the iNIR file. (version 1.2.1)
9 # This version has fixed the problem of missing some reads at the 5'end in paired-end data. (Version 1.2.1)
10
11 # This version has the following changes.(version 1.2.0)
12 ## The intermediate cmbt and MA files are not produced and mapping stat, insertlendis and read counts are directly computed from bowtie output files. This drastically improves on memory and speed and uses less hard disk space.
13 ## A subdirectory named 'readcount' will be generated in which gNIR, iNIR and gReadcount files will be placed.
14 ## A new option of generating all read counts (not only gene-wise or isoform-wise informative reads) for each gene (--gReadcount option). It is possible to get only these numbers and not compute NIR (--noNIR), EUMA, FVKM and LVKM (--noNEUMA).
15 ## auto_NEUMA_PE.pl can be used either for initial run to determine maximum insert length (--only_init) for after-initial run (--skip_init) ,or both (default).
16 ## Mismatches are allowed (--mm), but the alignments are filtered for all best-matching alignments.
17
18
19 # This version can handle the newer fastq format for paired-end case. (version 1.1.5)
20 # This version can handle the newer fastq format with a space in the sequencd ID. (version 1.1.4)
21 # This version includes two distinct strand-sepcificity options (S for forward and R for reverse strand). (version 1.1.3)
22 # This version includes a script that generates merged LVKM file and NIR files that can be used for diffNEUMA, to identify differentially expression genes/isoforms. (version 1.1.2)
23 # This version runs dos2unix for gene2NM and gene2symbol files in the beginning. (version 1.1.1)
24 # This version handles strand-specific data. (version 1.1.1)
25 # This version allows multi-thread option for bowtie. (version 1.1.0)
26 # This version uses a different way to take argument. (usage has changed). (version 1.1.0)
27 # This version fixed a bug in reading mapping stat file in the Ensembl mode. (version 1.0.5)
28 # This version handles Ensembl data. (version 1.0.4)
29 # This version handles SOLiD colorspace data. (version 1.0.3)
30
31
32 **** Table of contents ****
33
34 1. Installation
35 2. Bowtie
36 3. Ingredients
37 4. How to run
38 5. Output files
39 6. Preliminary run
40 7. Mergin output files
41
42 ***************************
43
44
45
46
47
48 1. Installation
49
50 No Installation is required. Simply, create a directory (eg. neumadir) and extract the .tar.gz file in the directory. Then, make all the perl scripts executable by the following command:
51
52 chmod a+x *.pl
53
54
55 2. Bowtie
56
57 Please download and install bowtie from the bowtie website http://bowtie-bio.sourceforge.net/index.shtml, following the developers' instruction. The reference index file must be created, using the same fasta file used for generating the gU and iU tables (gEUMA and iEUMA tables). The fasta file can be obtained from the NEUMA website.
58
59
60
61 3. Ingredients
62
63 Before you run NEUMA, you need the following files.
64
65 * a fasta file containing raw sequences (single-end)
66 or a pair of fasta-files containing raw sequences, each mate pair having the same unique ID (paired-end).
67 * bowtie reference index file
68 * gEUMA and iEUMA tables (single-end)
69 or gU table and iU table (paired-end)
70 * gene2NM file
71 * gene2symbol file
72
73
74 Note that the gene2NM file must be matched to the initial reference fasta file that gU and iU tables (gEUMA and iEUMA tables) were created from. The gU and iU tables (gEUMA and iEUMA tables) along with gene2NM and gene2symbol files can be obtained from the NEUMA website.
75
76
77
78 4. How to run
79
80 Two scripts, auto_NEUMA_PE.pl (paired-end) and auto_NEUMA_SE.pl (single-end) are what you need and these scripts run the other scripts automatically.
81
82
83 ## paired-end case
84 usage: ./auto_NEUMA_PE.pl [options] -L=<read_length> -D=<maxdist> -1=<input_file1(mate1)> -2=<input_file2(mate2)> -U=<Utable_prefix(fullpath, before .gU.table or .iU.table)> --g2m=<gene2NM_file> --g2s=<gene2symbol_file> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>
85
86 The order of Arguments and options can be arbitrary.
87
88 ** required arguments **
89 * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
90 * -D=<maxdist> : maxdist(eg.400) : maximum outer distance between mates (insert size). This must be identical to the maxdist used for generating gU and iU tables. (no default : D must be specified)
91 * -1=<input_file1(mate1)> : fasta or fastq file of a series of read mate 1
92 * -2=<input_file2(mate2)> : fasta or fastq file of a series of read mate 2. The ID of each mate 2 must be matched to that of mate 1 in case of fasta file.
93 * -U=<Utable_prefix(fullpath, before .gU.table or .iU.table)> : the path to the gU.table and iU.table files, except their extension. If you placed your gU.table and iU.table files in /home/paired-end/Utable/ and the file names are hg19.refMrna.L36.D250.gU.table and hg19.refMrna.L36.D250.iU.table, then the value for this argument is '/home/paired-end/Utable/hg19.refMrna.L36.D250'. The files must be matched to the read length, maxdist and reference transcriptome sequence.
94 * --g2m=<gene2NM_file> : gene2NM file, matched to the reference transcriptome model used for generating the gU and iU tables.
95 * --g2s=<gene2symbol_file> : gene2symbol file, containing at least all of the genes in the reference transcriptome sequence.
96 * -b=<bowtie_dir(eg.bin/bowtie-0.12.7)> : directory in which bowtie executable is installed. Either full path or relative path must be used and '~' must be avoided.
97 * --bi=<bowtieindex> : the index file prefix of the reference transcriptome sequence, created by bowtie (bowtie-build). It is the same string put as the reference index argument for the bowtie program. (see bowtie manual)
98 * -o=<outputdir> : a directory that will contain all the output files. If the directory does not exist, the program will create it automatically.
99 * -s=<sample_name> : name of the sample that will be used as the prefix of all the output files.
100
101 ** options **
102 * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
103 * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
104 * -t=<euma_cut> : EUMAcut(eg.50) : The cut off of EUMA that determines measurable genes and transcripts. (default : 50)
105 * -d=<data_type> : Refseq data(R) or Ensemble data(E). The R option uses the 'NM' and 'NR' prefices in RefSeq data to discriminate between mRNA and ncRNA. If your reference doesn't have these prefices, use the E option. (default : R)
106 * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
107 * --str=<strand_specificity> : Strand-specific(S) vs Non-Strand-specific(N). For strand-specific data, strand-specific U tables must be used. (default : N)
108 * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)
109 * --gReadcount : Compute the number of all reads (not just gene-wise and isoform-wise informative reads) for each gene.
110 * --noNIR : do not compute NIR. (This option must be used together with --noNEUMA)
111 * --noNEUMA : do not compute EUMA, FVKM and LVKM (but compute NIR unless used with --noNIR).
112 * --only_init : run only the initial part of bowtie-mapping and insert-lendis calculation, preferably with a relatively large -D value (which is way over expected maximum, eg. 1000). A more detailed description can be found below in section 6.
113 * --skip_init : given the maximum insert length to be used is decided, skip the initial bowtie mapping and insert-lendis part and use the output files from the previous run. (No need to run bowtie again with the new -D value because length filtering will be done at the downstream steps.
114
115
116 ## single-end case
117 usage: ./auto_NEUMA_SE.pl [options] -L=<read_length> -i=<input_file> -U=<EUMA_prefix(fullpath, before .gEUMA or .iEUMA)> --g2m=<gene2NM_file> --g2s=<gene2symbol_file> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>
118
119 The order of Arguments and options can be arbitrary.
120
121 ** required arguments **
122 * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
123 * -i=<input_file> : fasta file of a series of sequenced reads
124 * -U=<EUMA_prefix(fullpath, before .gEUMA or .iEUMA)> : the path to the gEUMA and tEUMA files, except their extension. If you placed your gEUMA and iEUMA files in /home/single-end/EUMA/ and the file names are hg19.refMrna.L36.single.gEUMA and hg19.refMrna.L36.single.iEUMA, then the value for this argument is '/home/single-end/EUMA/hg19.refMrna.L36.single'.
125 * --g2m=<gene2NM_file> : same as paired-end case
126 * --g2s=<gene2symbol_file> : same as paired-end case
127 * -b=<bowtiedir(eg.bin/bowtie-0.12.5)> : same as paired-end case
128 * --bi=<bowtieindex> : same as paired-end case
129 * -o=<outputdir> : same as paired-end case
130 * -s=<samplename> : same as paired-end case
131
132 ** options **
133 * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
134 * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
135 * -t=<euma_cut> : EUMAcut(eg.50) : The cut off of EUMA that determines measurable genes and transcripts. (default : 50)
136 * -d=<data_type> : Refseq data(R) or Ensemble data(E). The R option uses the 'NM' and 'NR' prefices in RefSeq data to discriminate between mRNA and ncRNA. If your reference doesn't have these prefices, use the E option. (default : R)
137 * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
138 * --str=<strand_specificity> : Strand-specific, forward(S), strand-specific, reverse (R), and Non-Strand-specific(N). For strand-specific data, strand-specific EUMA tables must be used. (default : N)
139 * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)
140 * --gReadcount : Compute the number of all reads (not just gene-wise and isoform-wise informative reads) for each gene.
141 * --noNIR : do not compute NIR. (This option must be used together with --noNEUMA)
142 * --noNEUMA : do not compute EUMA, FVKM and LVKM (but compute NIR unless used with --noNIR).
143
144
145 Example usages are as follows:
146
147 # paired-end
148 neumadir/auto_NEUMA_PE.pl --only_init -f=f -L=36 -D=1000 --mm2 -1=MKN28.1.fa -2=MKN28.2.fa -U=U.tables/hg19.refMrna.L36.D250 --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=bin/bowtie-0.12.7 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
149 # This command will run initial bowtie mapping, mapping stat and insert length distribution.
150 neumadir/auto_NEUMA_PE.pl --skip_init --gReadcount --noNEUMA --mm2 -f=f -L=36 -D=250 -1=MKN28.1.fa -2=MKN28.2.fa -U=U.tables/hg19.refMrna.L36.D250 --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=bin/bowtie-0.12.7 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
151 # This command will skip bowtie-mapping and insert length distribution and use the files from the previous (eg. above) run, and reports counts of all reads (gReadcount), gNIR and iNIR, and will not compute EUMA, FVKM, LVKM.
152 # Only reads with insert length up to 250 will be used, although the bowtie output from the previous run contains larger insert lengths.
153
154
155 # single-end
156 auto_NEUMA_SE.pl --noNEUMA --mm2 -L=36 -t=30 -p=2 -i=MKN28.txt -U=hg19.refMrna.L36.single --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=../bin/bowtie-0.12.5 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
157 # This command will only do bowtie-mapping and reports mapping stat and gNIR and iNIR.
158
159
160
161 5. OUTPUT Files
162
163 ## paired-end
164
165 For Refseq, the final log2(x+1) transformed values of FVKM (LVKM) can be found in outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.-EUMAcut.-NR.gLVKM (gene-wise) and outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.-EUMAcut.-NR.iLVKM (isoform-wise).
166
167 Files containing '-NR' excludes genes with no mRNA (eg. rRNA, tRNA, snRNA, snoRNA and other small RNA genes). If your sample is poly-A-selected, these RNAs cannot be quantified along with mRNAs. If your sample did not use any enrichment step and mRNAs and ncRNAs are represented in their exact proportion as in the cell, you can use the LVKM files without the '-NR' tag. The ncRNAs were removed at the last step because reads mapping to both an mRNA and an ncRNA must be considered a multi-read. For Ensembl, -NR version are not provided (ncRNAs are not removed).
168
169
170
171 The unadjusted and adjusted FVKM values (as 'FVK' and 'FVKM' for before and after sample-normalization, respectively) can be found in the LVKM files as well, along with gEUMA and iEUMA values and the number of isoforms and the number of measurable isoforms.
172
173 The mapping stat can be found in outputdir/mapping_stat.samplename.
174
175 The NIR and/or readcount values can be found in outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.gNIR, outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.iNIR and/or outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.gReadcount
176
177 The EUMA values can be found in outputdir/EUMA/samplename.ebwtname.maxinsMAX_DIST.mm0.gEUMA and outputdir/EUMA/samplename.ebwtname.maxinsMAX_DIST.mm0.iEUMA.
178
179 The insert length distribution can be found in outputdir/insertlendis/samplename.ebwtname.maxinsMAX_DIST.mm0.i.insertlendis.
180
181 If --mm option was used, replace 'mm0' in the above file names with 'mm1' or 'mm2'.
182
183
184 ## single-end
185
186 The formats for LVKM and NIR files(gMA & iMA) are the same as in the paired-end case. The LVKM file names are outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.single.-EUMAcut.-NR.gLVKM (gene-wise) and outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.single.-EUMAcut.-NR.iLVKM (isoform-wise). The NIR values can be found in outputdir/MA/samplename.ebwtname.maxinsMAX_DIST.mm0.single.all.gMA (the #uniq.common column) and outputdir/MA/samplename.ebwtname.maxins400.mm0.single.all.iMA (the #uniq column).
187
188 The mapping stat can be found in outputdir/mappint_stat.samplename.single.
189
190
191
192
193 6. Preliminary run
194
195 Given a new data set, users can run the first part of auto_NEUMA_PE.pl, without the U.tables, to find out the insert length distribution and mapping stats. Based on the length distribution, the user can determine a safe MAXDIST and request us to build U.tables (through the NEUMA website). After having the U.tables ready, the user can run the latter part of auto_NEUMA_PE.pl. Note that -U, --g2m and --g2s are not required for this run.
196
197 * Usages:
198
199 auto_NEUMA_PE.pl --only_init [options] -L=<read_length> -D=<maxdist> -1=<input_file1(mate1)> -2=<input_file2(mate2)> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>
200
201 ** required arguments **
202 * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
203 * -D=<maxdist> : maxdist(eg.400) : maximum outer distance between mates (insert size). This must be identical to the maxdist used for generating gU and iU tables. (no default : D must be specified)
204 * -1=<input_file1(mate1)> : fasta or fastq file of a series of read mate 1
205 * -2=<input_file2(mate2)> : fasta or fastq file of a series of read mate 2. The ID of each mate 2 must be matched to that of mate 1 in case of fasta file.
206 * -b=<bowtie_dir(eg.bin/bowtie-0.12.7)> : directory in which bowtie executable is installed. Either full path or relative path must be used and '~' must be avoided.
207 * --bi=<bowtieindex> : the index file prefix of the reference transcriptome sequence, created by bowtie (bowtie-build). It is the same string put as the reference index argument for the bowtie program. (see bowtie manual)
208 * -o=<outputdir> : a directory that will contain all the output files. If the directory does not exist, the program will create it automatically.
209 * -s=<sample_name> : name of the sample that will be used as the prefix of all the output files.
210
211 ** options **
212 * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
213 * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
214 * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
215 * --str=<strand_specificity> : Strand-specific(S) vs Non-Strand-specific(N). (default : N)
216 * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)
217
218 * The maxdist for this run can be set to something very large, eg. 1000. Then, for the real runs using the --skip_init option, the maxdist must be set identical to the one used to build the U.tables.
219
220 * For single-end reads, insert length distribution does not have to be pre-determined. Simply tell us the read length to request the EUMA tables for single-end data.
221
222 ** output **
223 The quickest way to check the output insert length distribution is to look at the insertlendis directory under outputdir.
224
225
226
227
228
229 7. Merging output files
230
231
232 1) merge_LVKM.pl
233
234 The script produces a text file that contains gLVKM or iLVKM values for all samples for all genes/isoforms.
235
236 usage: ./merge_LVKM.pl genewise/isoformwise[g/i] EUMAcut LVKM_out_dir NR(1/0) > output.gLVKM(iLVKM).merged
237
238 Example usage:
239 ./merge_LVKM.pl g 50 data/LVKM 0 > data/LVKM/all.gLVKM.merged
240 ./merge_LVKM.pl i 50 data/LVKM 0 > data/LVKM/all.iLVKM.merged
241
242 * genewise/isoformwise[g/i] : put g for gLVKM and i for iLVKM.
243 * EUMAcut : put the same EUMA cut off used to generate the LVKM files. This will be used to recognize the files.
244 * LVKM_out : This is usually your_basedir/LVKM, which contains all the LVKM files generated. All the LVKM files in this directory that matches the EUMAcut specified will be included in the merged table.
245 * NR(1/0) : 1 means the script must use the LVKM files generated after the noncoding RNAs starting with the NR prefix are removed. 0 is otherwise.
246
247
248 2) merge_LVKM_readcount.pl
249
250
251 This script produces a gNIR / iNIR file that contains the read counts for all samples for all genes/isoforms. The gNIR / iNIR files can be fed to diffNEUMA (http://neuma.kobic.re.kr), for identification of differentially expressed genes/isoforms. This script is identical to merge_NEUMA_readcount.pl in previous versions.
252
253 usage: ./merge_LVKM_readcount.pl genewise/isoformwise[g/i] EUMAcut LVKM_out_dir NR(1/0) > output.gLVKM(iLVKM).merged
254
255 Example usage:
256 ./merge_LVKM_readcount.pl g 10 data/LVKM 1 > data/LVKM/all.gNIR
257 ./merge_LVKM_readcount.pl i 10 data/LVKM 1 > data/LVKM/all.iNIR
258
259 * genewise/isoformwise[g/i] : put g for gLVKM and i for iLVKM.
260 * EUMAcut : put the same EUMA cut off used to generate the LVKM files. This will be used to recognize the files.
261 * LVKM_out : This is usually your_basedir/LVKM, which contains all the LVKM files generated. All the LVKM files in this directory that matches the EUMAcut specified will be included in the NIR file.
262 * NR(1/0) : 1 means the script must use the LVKM files generated after the noncoding RNAs starting with the NR prefix are removed. 0 is otherwise.
263
264
265
266 3) merge_readcount.pl
267
268
269 This script produces a gNIR.merged / iNIR.merged / gReadcount.merged file that contains the read counts for all samples. The gNIR / iNIR /gReadcount files can be fed to diffNEUMA (http://neuma.kobic.re.kr), for identification of differentially expressed genes/isoforms. The result is not filtered for EUMA cut or NR as in merge_LVKM_readcount.pl.
270
271 usage: ./merge_readcount.pl type[gNIR/iNIR/gReadcount] readcount_dir > output.gNIR(iNIR/gReadcount).merged
272
273 Example usage:
274 ./merge_readcount.pl gNIR data/readcount > data/readcount/all.gNIR.merged
275 ./merge_readcount.pl iNIR data/readcount > data/readcount/all.iNIR.merged
276
277 * type (gNIR/iNIR/gReadcount) : the type of the read counts to be merged
278 * readcount_dir : This is usually your_basedir/readcount, which contains all the readcount files generated.
279
280
281 //
282
283
284