Mercurial > repos > chawhwa > neuma

###########################
# README for NEUMA v1.2.0 #
###########################

## Additional information and files can be obtained from the NEUMA website, http://neuma.kobic.re.kr.
## Inquiries can be written to duplexa@gmail.com.

# This version has removed the extra tab in the iNIR file. (version 1.2.1)
# This version has fixed the problem of missing some reads at the 5'end in paired-end data. (Version 1.2.1)

# This version has the following changes.(version 1.2.0)
## The intermediate cmbt and MA files are not produced and mapping stat, insertlendis and read counts are directly computed from bowtie output files. This drastically improves on memory and speed and uses less hard disk space.
## A subdirectory named 'readcount' will be generated in which gNIR, iNIR and gReadcount files will be placed.
## A new option of generating all read counts (not only gene-wise or isoform-wise informative reads) for each gene (--gReadcount option). It is possible to get only these numbers and not compute NIR (--noNIR), EUMA, FVKM and LVKM (--noNEUMA).
## auto_NEUMA_PE.pl can be used either for initial run to determine maximum insert length (--only_init) for after-initial run (--skip_init) ,or both (default).
## Mismatches are allowed (--mm), but the alignments are filtered for all best-matching alignments.


# This version can handle the newer fastq format for paired-end case. (version 1.1.5)
# This version can handle the newer fastq format with a space in the sequencd ID. (version 1.1.4)
# This version includes two distinct strand-sepcificity options (S for forward and R for reverse strand). (version 1.1.3)
# This version includes a script that generates merged LVKM file and NIR files that can be used for diffNEUMA, to identify differentially expression genes/isoforms. (version 1.1.2)
# This version runs dos2unix for gene2NM and gene2symbol files in the beginning. (version 1.1.1)
# This version handles strand-specific data. (version 1.1.1)
# This version allows multi-thread option for bowtie. (version 1.1.0)
# This version uses a different way to take argument. (usage has changed). (version 1.1.0)
# This version fixed a bug in reading mapping stat file in the Ensembl mode. (version 1.0.5)
# This version handles Ensembl data. (version 1.0.4)
# This version handles SOLiD colorspace data. (version 1.0.3)


**** Table of contents ****

  1. Installation
  2. Bowtie
  3. Ingredients
  4. How to run
  5. Output files
  6. Preliminary run
  7. Mergin output files

***************************


1. Installation

 No Installation is required. Simply, create a directory (eg. neumadir) and extract the .tar.gz file in the directory. Then, make all the perl scripts executable by the following command:

chmod a+x *.pl


2. Bowtie

 Please download and install bowtie from the bowtie website http://bowtie-bio.sourceforge.net/index.shtml, following the developers' instruction. The reference index file must be created, using the same fasta file used for generating the gU and iU tables (gEUMA and iEUMA tables). The fasta file can be obtained from the NEUMA website.


3. Ingredients

 Before you run NEUMA, you need the following files.

    * a fasta file containing raw sequences (single-end)
       or a pair of fasta-files containing raw sequences, each mate pair having the same unique ID (paired-end).
    * bowtie reference index file
    * gEUMA and iEUMA tables (single-end)
       or gU table and iU table (paired-end)
    * gene2NM file
    * gene2symbol file


 Note that the gene2NM file must be matched to the initial reference fasta file that gU and iU tables (gEUMA and iEUMA tables) were created from. The gU and iU tables (gEUMA and iEUMA tables) along with gene2NM and gene2symbol files can be obtained from the NEUMA website.


4. How to run

Two scripts, auto_NEUMA_PE.pl (paired-end) and auto_NEUMA_SE.pl (single-end) are what you need and these scripts run the other scripts automatically.


  ## paired-end case
  usage: ./auto_NEUMA_PE.pl [options] -L=<read_length> -D=<maxdist> -1=<input_file1(mate1)> -2=<input_file2(mate2)> -U=<Utable_prefix(fullpath, before .gU.table or .iU.table)> --g2m=<gene2NM_file> --g2s=<gene2symbol_file> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>

    The order of Arguments and options can be arbitrary.

    ** required arguments **
    * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
    * -D=<maxdist> : maxdist(eg.400) : maximum outer distance between mates (insert size). This must be identical to the maxdist used for generating gU and iU tables. (no default : D must be specified)
    * -1=<input_file1(mate1)>  : fasta or fastq file of a series of read mate 1
    * -2=<input_file2(mate2)>  : fasta or fastq file of a series of read mate 2. The ID of each mate 2 must be matched to that of mate 1 in case of fasta file.
    * -U=<Utable_prefix(fullpath, before .gU.table or .iU.table)> : the path to the gU.table and iU.table files, except their extension. If you placed your gU.table and iU.table files in /home/paired-end/Utable/ and the file names are hg19.refMrna.L36.D250.gU.table and hg19.refMrna.L36.D250.iU.table, then the value for this argument is '/home/paired-end/Utable/hg19.refMrna.L36.D250'. The files must be matched to the read length, maxdist and reference transcriptome sequence.
    * --g2m=<gene2NM_file> : gene2NM file, matched to the reference transcriptome model used for generating the gU and iU tables.
    * --g2s=<gene2symbol_file> : gene2symbol file, containing at least all of the genes in the reference transcriptome sequence.
    * -b=<bowtie_dir(eg.bin/bowtie-0.12.7)> : directory in which bowtie executable is installed. Either full path or relative path must be used and '~' must be avoided.
    * --bi=<bowtieindex> : the index file prefix of the reference transcriptome sequence, created by bowtie (bowtie-build). It is the same string put as the reference index argument for the bowtie program. (see bowtie manual)
    * -o=<outputdir> : a directory that will contain all the output files. If the directory does not exist, the program will create it automatically.
    * -s=<sample_name> : name of the sample that will be used as the prefix of all the output files.

    ** options **
    * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
    * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
    * -t=<euma_cut> : EUMAcut(eg.50) : The cut off of EUMA that determines measurable genes and transcripts. (default : 50)
    * -d=<data_type> : Refseq data(R) or Ensemble data(E). The R option uses the 'NM' and 'NR' prefices in RefSeq data to discriminate between mRNA and ncRNA. If your reference doesn't have these prefices, use the E option. (default : R)
    * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
    * --str=<strand_specificity> : Strand-specific(S) vs Non-Strand-specific(N). For strand-specific data, strand-specific U tables must be used. (default : N)
    * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)
    * --gReadcount : Compute the number of all reads (not just gene-wise and isoform-wise informative reads) for each gene.
    * --noNIR : do not compute NIR. (This option must be used together with --noNEUMA)
    * --noNEUMA : do not compute EUMA, FVKM and LVKM (but compute NIR unless used with --noNIR).
    * --only_init : run only the initial part of bowtie-mapping and insert-lendis calculation, preferably with a relatively large -D value (which is way over expected maximum, eg. 1000). A more detailed description can be found below in section 6.
    * --skip_init : given the maximum insert length to be used is decided, skip the initial bowtie mapping and insert-lendis part and use the output files from the previous run. (No need to run bowtie again with the new -D value because length filtering will be done at the downstream steps.


  ## single-end case
  usage: ./auto_NEUMA_SE.pl [options] -L=<read_length> -i=<input_file> -U=<EUMA_prefix(fullpath, before .gEUMA or .iEUMA)> --g2m=<gene2NM_file> --g2s=<gene2symbol_file> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>

    The order of Arguments and options can be arbitrary.

    ** required arguments **
    * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
    * -i=<input_file> : fasta file of a series of sequenced reads
    * -U=<EUMA_prefix(fullpath, before .gEUMA or .iEUMA)> : the path to the gEUMA and tEUMA files, except their extension. If you placed your gEUMA and iEUMA files in /home/single-end/EUMA/ and the file names are hg19.refMrna.L36.single.gEUMA and hg19.refMrna.L36.single.iEUMA, then the value for this argument is '/home/single-end/EUMA/hg19.refMrna.L36.single'.
    * --g2m=<gene2NM_file> : same as paired-end case
    * --g2s=<gene2symbol_file> : same as paired-end case
    * -b=<bowtiedir(eg.bin/bowtie-0.12.5)> : same as paired-end case
    * --bi=<bowtieindex> : same as paired-end case
    * -o=<outputdir> : same as paired-end case
    * -s=<samplename> : same as paired-end case

    ** options **
    * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
    * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
    * -t=<euma_cut> : EUMAcut(eg.50) : The cut off of EUMA that determines measurable genes and transcripts. (default : 50)
    * -d=<data_type> : Refseq data(R) or Ensemble data(E). The R option uses the 'NM' and 'NR' prefices in RefSeq data to discriminate between mRNA and ncRNA. If your reference doesn't have these prefices, use the E option. (default : R)
    * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
    * --str=<strand_specificity> : Strand-specific, forward(S), strand-specific, reverse (R), and Non-Strand-specific(N). For strand-specific data, strand-specific EUMA tables must be used. (default : N)
    * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)
    * --gReadcount : Compute the number of all reads (not just gene-wise and isoform-wise informative reads) for each gene.
    * --noNIR : do not compute NIR. (This option must be used together with --noNEUMA)
    * --noNEUMA : do not compute EUMA, FVKM and LVKM (but compute NIR unless used with --noNIR).


Example usages are as follows:

# paired-end
neumadir/auto_NEUMA_PE.pl --only_init -f=f -L=36 -D=1000 --mm2 -1=MKN28.1.fa -2=MKN28.2.fa -U=U.tables/hg19.refMrna.L36.D250 --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=bin/bowtie-0.12.7 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
# This command will run initial bowtie mapping, mapping stat and insert length distribution.
neumadir/auto_NEUMA_PE.pl --skip_init --gReadcount --noNEUMA --mm2 -f=f -L=36 -D=250 -1=MKN28.1.fa -2=MKN28.2.fa -U=U.tables/hg19.refMrna.L36.D250 --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=bin/bowtie-0.12.7 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
# This command will skip bowtie-mapping and insert length distribution and use the files from the previous (eg. above) run, and reports counts of all reads (gReadcount), gNIR and iNIR, and will not compute EUMA, FVKM, LVKM.
# Only reads with insert length up to 250 will be used, although the bowtie output from the previous run contains larger insert lengths.


# single-end
auto_NEUMA_SE.pl --noNEUMA --mm2 -L=36 -t=30 -p=2 -i=MKN28.txt -U=hg19.refMrna.L36.single --g2m=gene2NM.human.fastafiltered --g2s=gene2symbol.human -b=../bin/bowtie-0.12.5 --bi=ebwt/hg19.RefmRNA -o=MKN28.hg19 -s=MKN28
# This command will only do bowtie-mapping and reports mapping stat and gNIR and iNIR.


5. OUTPUT Files

   ## paired-end

   For Refseq, the final log2(x+1) transformed values of FVKM (LVKM) can be found in  outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.-EUMAcut.-NR.gLVKM (gene-wise) and outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.-EUMAcut.-NR.iLVKM (isoform-wise).

   Files containing '-NR' excludes genes with no mRNA (eg. rRNA, tRNA, snRNA, snoRNA and other small RNA genes). If your sample is poly-A-selected, these RNAs cannot be quantified along with mRNAs. If your sample did not use any enrichment step and mRNAs and ncRNAs are represented in their exact proportion as in the cell, you can use the LVKM files without the '-NR' tag. The ncRNAs were removed at the last step because reads mapping to both an mRNA and an ncRNA must be considered a multi-read. For Ensembl, -NR version are not provided (ncRNAs are not removed).


   The unadjusted and adjusted FVKM values (as 'FVK' and 'FVKM' for before and after sample-normalization, respectively) can be found in the LVKM files as well, along with gEUMA and iEUMA values and the number of isoforms and the number of measurable isoforms.

   The mapping stat can be found in outputdir/mapping_stat.samplename.

   The NIR and/or readcount values can be found in outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.gNIR, outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.iNIR and/or outputdir/readcount/samplename.ebwtname.maxinsMAX_DIST.mm0.gReadcount

   The EUMA values can be found in outputdir/EUMA/samplename.ebwtname.maxinsMAX_DIST.mm0.gEUMA and outputdir/EUMA/samplename.ebwtname.maxinsMAX_DIST.mm0.iEUMA.

   The insert length distribution can be found in outputdir/insertlendis/samplename.ebwtname.maxinsMAX_DIST.mm0.i.insertlendis.

   If --mm option was used, replace 'mm0' in the above file names with 'mm1' or 'mm2'.


   ## single-end

   The formats for LVKM and NIR files(gMA & iMA) are the same as in the paired-end case. The LVKM file names are outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.single.-EUMAcut.-NR.gLVKM (gene-wise) and outputdir/LVKM/samplename.ebwtname.maxinsMAX_DIST.mm0.single.-EUMAcut.-NR.iLVKM (isoform-wise). The NIR values can be found in outputdir/MA/samplename.ebwtname.maxinsMAX_DIST.mm0.single.all.gMA (the #uniq.common column) and outputdir/MA/samplename.ebwtname.maxins400.mm0.single.all.iMA (the #uniq column).

   The mapping stat can be found in outputdir/mappint_stat.samplename.single.


6. Preliminary run

   Given a new data set, users can run the first part of auto_NEUMA_PE.pl, without the U.tables, to find out the insert length distribution and mapping stats. Based on the length distribution, the user can determine a safe MAXDIST and request us to build U.tables (through the NEUMA website). After having the U.tables ready, the user can run the latter part of auto_NEUMA_PE.pl. Note that -U, --g2m and --g2s are not required for this run.

   * Usages:

    auto_NEUMA_PE.pl --only_init [options] -L=<read_length> -D=<maxdist> -1=<input_file1(mate1)> -2=<input_file2(mate2)> -b=<bowtie_dir(eg.bin/bowtie-0.12.5)> --bi=<bowtieindex> -o=<outputdir> -s=<sample_name>

    ** required arguments **
    * -L=<read_length> : read_length(eg.36) : sequenced length of a read mate (/not/ the insert length or sum of the two mate lengths) (no default : L must be specified)
    * -D=<maxdist> : maxdist(eg.400) : maximum outer distance between mates (insert size). This must be identical to the maxdist used for generating gU and iU tables. (no default : D must be specified)
    * -1=<input_file1(mate1)>  : fasta or fastq file of a series of read mate 1
    * -2=<input_file2(mate2)>  : fasta or fastq file of a series of read mate 2. The ID of each mate 2 must be matched to that of mate 1 in case of fasta file.
    * -b=<bowtie_dir(eg.bin/bowtie-0.12.7)> : directory in which bowtie executable is installed. Either full path or relative path must be used and '~' must be avoided.
    * --bi=<bowtieindex> : the index file prefix of the reference transcriptome sequence, created by bowtie (bowtie-build). It is the same string put as the reference index argument for the bowtie program. (see bowtie manual)
    * -o=<outputdir> : a directory that will contain all the output files. If the directory does not exist, the program will create it automatically.
    * -s=<sample_name> : name of the sample that will be used as the prefix of all the output files.

    ** options **
    * -f=<file_type> : fasta(f)_or_fastq(q) : f if input files are in fasta format; q if in fastq format. (default : q)
    * -c=<coding_option> : nucleotide(n)_or_colorspace(c) : n if input files are in DNA sequence (A,C,G,T); c if in colorspace (SOLiD platform). Note that if colorspace is used, the colorspace version of bowtie index file must be used. (default : n)
    * -p=<num_cpu> : number of cpu's to use for Bowtie. (default : 1)
    * --str=<strand_specificity> : Strand-specific(S) vs Non-Strand-specific(N). (default : N)
    * --mm=<number_of_mismatches_allowed> : setting number of mismatches allowed (default : 0)

   * The maxdist for this run can be set to something very large, eg. 1000. Then, for the real runs using the --skip_init option, the maxdist must be set identical to the one used to build the U.tables.

   * For single-end reads, insert length distribution does not have to be pre-determined. Simply tell us the read length to request the EUMA tables for single-end data.

   ** output **
   The quickest way to check the output insert length distribution is to look at the insertlendis directory under outputdir.


7. Merging output files


   1) merge_LVKM.pl

   The script produces a text file that contains gLVKM or iLVKM values for all samples for all genes/isoforms.

   usage: ./merge_LVKM.pl genewise/isoformwise[g/i] EUMAcut LVKM_out_dir NR(1/0) > output.gLVKM(iLVKM).merged

   Example usage:
   ./merge_LVKM.pl g 50 data/LVKM 0 > data/LVKM/all.gLVKM.merged
   ./merge_LVKM.pl i 50 data/LVKM 0 > data/LVKM/all.iLVKM.merged

   * genewise/isoformwise[g/i] : put g for gLVKM and i for iLVKM.
   * EUMAcut : put the same EUMA cut off used to generate the LVKM files. This will be used to recognize the files.
   * LVKM_out : This is usually your_basedir/LVKM, which contains all the LVKM files generated. All the LVKM files in this directory that matches the EUMAcut specified will be included in the merged table.
   * NR(1/0) : 1 means the script must use the LVKM files generated after the noncoding RNAs starting with the NR prefix are removed. 0 is otherwise.


   2) merge_LVKM_readcount.pl


   This script produces a gNIR / iNIR file that contains the read counts for all samples for all genes/isoforms. The gNIR / iNIR files can be fed to diffNEUMA (http://neuma.kobic.re.kr), for identification of differentially expressed genes/isoforms. This script is identical to merge_NEUMA_readcount.pl in previous versions.

   usage: ./merge_LVKM_readcount.pl genewise/isoformwise[g/i] EUMAcut LVKM_out_dir NR(1/0) > output.gLVKM(iLVKM).merged

   Example usage:
   ./merge_LVKM_readcount.pl g 10 data/LVKM 1 > data/LVKM/all.gNIR
   ./merge_LVKM_readcount.pl i 10 data/LVKM 1 > data/LVKM/all.iNIR

   * genewise/isoformwise[g/i] : put g for gLVKM and i for iLVKM.
   * EUMAcut : put the same EUMA cut off used to generate the LVKM files. This will be used to recognize the files.
   * LVKM_out : This is usually your_basedir/LVKM, which contains all the LVKM files generated. All the LVKM files in this directory that matches the EUMAcut specified will be included in the NIR file.
   * NR(1/0) : 1 means the script must use the LVKM files generated after the noncoding RNAs starting with the NR prefix are removed. 0 is otherwise.


   3) merge_readcount.pl


   This script produces a gNIR.merged / iNIR.merged / gReadcount.merged file that contains the read counts for all samples. The gNIR / iNIR /gReadcount files can be fed to diffNEUMA (http://neuma.kobic.re.kr), for identification of differentially expressed genes/isoforms. The result is not filtered for EUMA cut or NR as in merge_LVKM_readcount.pl.

   usage: ./merge_readcount.pl type[gNIR/iNIR/gReadcount] readcount_dir > output.gNIR(iNIR/gReadcount).merged

   Example usage:
   ./merge_readcount.pl gNIR data/readcount > data/readcount/all.gNIR.merged
   ./merge_readcount.pl iNIR data/readcount > data/readcount/all.iNIR.merged

   * type (gNIR/iNIR/gReadcount) : the type of the read counts to be merged
   * readcount_dir : This is usually your_basedir/readcount, which contains all the readcount files generated.


//
author	chawhwa
date	Thu, 08 Aug 2013 00:46:13 -0400
parents
children