Mercurial > repos > weilong-guo > bs_seeker2
diff BSseeker2/README.md @ 1:8b26adf64adc draft default tip
V2.0.5
author | weilong-guo |
---|---|
date | Tue, 05 Nov 2013 01:55:39 -0500 |
parents | e6df770c0e58 |
children |
line wrap: on
line diff
--- a/BSseeker2/README.md Fri Jul 12 18:47:28 2013 -0400 +++ b/BSseeker2/README.md Tue Nov 05 01:55:39 2013 -0500 @@ -48,9 +48,10 @@ * Linux or Mac OS platform * One of the following Aligner - - [bowtie](http://bowtie-bio.sourceforge.net/) - - [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/) (Recommend) + - [bowtie](http://bowtie-bio.sourceforge.net/) (fast) + - [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/) (Default) - [soap](http://soap.genomics.org.cn/) + - [rmap](http://www.cmb.usc.edu/people/andrewds/rmap/) * [Python](http://www.python.org/download/) (Version 2.6 +) (It is normally pre-installed in Linux. Type " python -V" to see the installed version.) @@ -74,7 +75,7 @@ $ python FilterReads.py Usage: FilterReads.py -i <input> -o <output> [-k] - Author : Guo, Weilong; guoweilong@gmail.com; 2012-11-10 + Author : Guo, Weilong; 2012-11-10 Unique reads for qseq/fastq/fasta/sequencce, and filter low quality file in qseq file. @@ -105,16 +106,16 @@ -h, --help show this help message and exit -f FILE, --file=FILE Input your reference genome file (fasta) --aligner=ALIGNER Aligner program to perform the analysis: bowtie, - bowtie2, soap [Default: bowtie2] - -p PATH, --path=PATH Path to the aligner program. Defaults: - bowtie: /u/home/mcdb/weilong/install/bowtie-0.12.8 - bowtie2: - /u/home/mcdb/weilong/install/bowtie2-2.0.0-beta7 - soap: /u/home/mcdb/weilong/install/soap2.21release/ + bowtie2, soap, rmap [Default: bowtie] + -p PATH, --path=PATH Path to the aligner program. Detected: + bowtie: ~/install/bowtie + bowtie2: ~/install/bowtie2 + rmap: ~/install/rmap_/bin + soap: ~/install/soap/ -d DBPATH, --db=DBPATH Path to the reference genome library (generated in - preprocessing genome) [Default: /u/home/mcdb/weilong/i - nstall/BSseeker2/bs_utils/reference_genomes] + preprocessing genome) [Default: ~/install + /BSseeker2/bs_utils/reference_genomes] -v, --version show version of BS-Seeker2 Reduced Representation Bisulfite Sequencing Options: @@ -125,7 +126,7 @@ certain fragments will be masked. [Default: False] -l LOW_BOUND, --low=LOW_BOUND lower bound of fragment length (excluding recognition - sequence such as C-CGG) [Default: 40] + sequence such as C-CGG) [Default: 20] -u UP_BOUND, --up=UP_BOUND upper bound of fragment length (excluding recognition sequence such as C-CGG ends) [Default: 500] @@ -134,6 +135,7 @@ Mael:(C-TAG), double-enzyme MspI&Mael:(C-CGG,C-TAG). [Default: C-CGG] + ##Example * Build genome index for WGBS using bowtie, path of bowtie should be included in $PATH @@ -150,7 +152,7 @@ * Build genome index for RRBS library for double-enzyme : MspI (C-CGG) & ApeKI (G-CWGC, where W=A|T, see [IUPAC code](http://www.bioinformatics.org/sms/iupac.html)) - python bs_seeker2-build.py -f genome.fa -r -c C-CGG,G-CWGC + python bs_seeker2-build.py -f genome.fa -r -c C-CGG,G-CWGC --aligner=bowtie ##Tips: @@ -172,75 +174,73 @@ ##Usage : $ python ~/install/BSseeker2/bs_seeker2-align.py -h - Usage: bs_seeker2-align.py [options] + Usage: bs_seeker2-align.py {-i <single> | -1 <mate1> -2 <mate2>} -g <genome.fa> [options] Options: -h, --help show this help message and exit For single end reads: -i INFILE, --input=INFILE - Input your read file name (FORMAT: sequences, - fastq, qseq,fasta) + Input read file (FORMAT: sequences, qseq, fasta, + fastq). Ex: read.fa or read.fa.gz For pair end reads: -1 FILE, --input_1=FILE - Input your read file end 1 (FORMAT: sequences, - qseq, fasta, fastq) + Input read file, mate 1 (FORMAT: sequences, qseq, + fasta, fastq) -2 FILE, --input_2=FILE - Input your read file end 2 (FORMAT: sequences, - qseq, fasta, fastq) - --minins=MIN_INSERT_SIZE + Input read file, mate 2 (FORMAT: sequences, qseq, + fasta, fastq) + -I MIN_INSERT_SIZE, --minins=MIN_INSERT_SIZE The minimum insert size for valid paired-end - alignments [Default: -1] - --maxins=MAX_INSERT_SIZE + alignments [Default: 0] + -X MAX_INSERT_SIZE, --maxins=MAX_INSERT_SIZE The maximum insert size for valid paired-end - alignments [Default: 400] + alignments [Default: 500] Reduced Representation Bisulfite Sequencing Options: - -r, --rrbs Process reads from Reduced Representation Bisulfite - Sequencing experiments + -r, --rrbs Map reads to the Reduced Representation genome -c pattern, --cut-site=pattern Cutting sites of restriction enzyme. Ex: MspI(C-CGG), Mael:(C-TAG), double-enzyme MspI&Mael:(C-CGG,C-TAG). + [Default: C-CGG] -L RRBS_LOW_BOUND, --low=RRBS_LOW_BOUND - lower bound of fragment length (excluding C-CGG ends) - [Default: 40] + Lower bound of fragment length (excluding C-CGG ends) + [Default: 20] -U RRBS_UP_BOUND, --up=RRBS_UP_BOUND - upper bound of fragment length (excluding C-CGG ends) + Upper bound of fragment length (excluding C-CGG ends) [Default: 500] General options: -t TAG, --tag=TAG [Y]es for undirectional lib, [N]o for directional [Default: N] -s CUTNUMBER1, --start_base=CUTNUMBER1 - The first base of your read to be mapped [Default: 1] + The first cycle of the read to be mapped [Default: 1] -e CUTNUMBER2, --end_base=CUTNUMBER2 - The last cycle number of your read to be mapped - [Default: 200] + The last cycle of the read to be mapped [Default: 200] -a FILE, --adapter=FILE Input text file of your adaptor sequences (to be - trimed from the 3'end of the reads). Input 1 seq for - dir. lib., 2 seqs for undir. lib. One line per - sequence + trimmed from the 3'end of the reads, ). Input one seq + for dir. lib., twon seqs for undir. lib. One line per + sequence. Only the first 10bp will be used --am=ADAPTER_MISMATCH - Number of mismatches allowed in adaptor [Default: 1] + Number of mismatches allowed in adapter [Default: 0] -g GENOME, --genome=GENOME - Name of the reference genome (the same as the - reference genome file in the preprocessing step) [ex. - chr21_hg18.fa] - -m INT_NO_MISMATCHES, --mismatches=INT_NO_MISMATCHES + Name of the reference genome (should be the same as + "-f" in bs_seeker2-build.py ) [ex. chr21_hg18.fa] + -m NO_MISMATCHES, --mismatches=NO_MISMATCHES Number of mismatches in one read [Default: 4] - --aligner=ALIGNER Aligner program to perform the analisys: bowtie, - bowtie2, soap [Default: bowtie2] + --aligner=ALIGNER Aligner program for short reads mapping: bowtie, + bowtie2, soap, rmap [Default: bowtie] -p PATH, --path=PATH - Path to the aligner program. Defaults: - bowtie: /u/home/mcdb/weilong/install/bowtie-0.12.8 - bowtie2: - /u/home/mcdb/weilong/install/bowtie2-2.0.0-beta7 - soap: /u/home/mcdb/weilong/soap2.21release/ + Path to the aligner program. Detected: + bowtie: ~/install/bowtie + bowtie2: ~/install/bowtie2 + rmap: ~/install/rmap/bin + soap: ~/install/soap/ -d DBPATH, --db=DBPATH Path to the reference genome library (generated in - preprocessing genome) [Default: /u/home/mcdb/weilong/i + preprocessing genome) [Default: ~/i nstall/BSseeker2/bs_utils/reference_genomes] -l NO_SPLIT, --split_line=NO_SPLIT Number of lines per split (the read file will be split @@ -251,7 +251,7 @@ -f FORMAT, --output-format=FORMAT Output format: bam, sam, bs_seeker1 [Default: bam] --no-header Suppress SAM header lines [Default: False] - --temp_dir=PATH The path to your temporary directory [Default: /tmp] + --temp_dir=PATH The path to your temporary directory [Detected: /tmp] --XS=XS_FILTER Filter definition for tag XS, format X,Y. X=0.8 and y=5 indicate that for one read, if #(mCH sites)/#(all CH sites)>0.8 and #(mCH sites)>5, then tag XS=1; or @@ -263,31 +263,51 @@ Aligner Options: You may specify any additional options for the aligner. You just have to prefix them with --bt- for bowtie, --bt2- for bowtie2, --soap- for - soap, and BS Seeker will pass them on. For example: --bt-p 4 will - increase the number of threads for bowtie to 4, --bt--tryhard will - instruct bowtie to try as hard as possible to find valid alignments - when they exist, and so on. Be sure that you know what you are doing - when using these options! Also, we don't do any validation on the - values. - + soap, --rmap- for rmap, and BS Seeker will pass them on. For example: + --bt-p 4 will increase the number of threads for bowtie to 4, --bt-- + tryhard will instruct bowtie to try as hard as possible to find valid + alignments when they exist, and so on. Be sure that you know what you + are doing when using these options! Also, we don't do any validation + on the values. ##Examples : -* Align from fasta format with bowtie2 (local alignment) for whole genome, allowing 3 mismatches +* WGBS library ; alignment mode, bowtie ; map to WGBS index + + python bs_seeker2-align.py -i WGBS.fa --aligner=bowtie -o WGBS.bam -f bam -g genome.fa + +* WGBS library ; alignment mode, bowtie2-local ; map to WGBS index - python bs_seeker2-align.py -i WGBS.fa -m 3 --aligner=bowtie2 -o WGBS.bam -f bam -g genome.fa + python bs_seeker2-align.py -i WGBS.fa --aligner=bowtie2 -o WGBS.bam -f bam -g genome.fa + +* WGBS library ; alignment mode, bowtie2-end-to-end ; map to WGBS index -* Align from qseq format for RRBS with bowtie, default parameters for RRBS fragments + python bs_seeker2-align.py -i WGBS.fa -m 3 --aligner=bowtie2 -o WGBS.bam -f bam -g genome.fa --bt2--end-to-end + +* RRBS library ; alignment mode, bowtie ; map to RR index - python bs_seeker2-align.py -i RRBS.fa --aligner=bowtie -o RRBS.sam -f sam -g genome.fa -r -a adapter.txt + python bs_seeker2-align.py -i RRBS.fa --aligner=bowtie -o RRBS.bam -g genome.fa -r -a adapter.txt + +* RRBS library ; alignment mode, bowtie ; map to WG index + + python bs_seeker2-align.py -i RRBS.fa --aligner=bowtie -o RRBS.bam -g genome.fa -a adapter.txt -* Align from qseq format for RRBS with bowtie (end-to-end), specifying lengths of fragments ranging [40bp, 400bp] +* RRBS library ; alignment mode, bowtie2-end-to-end ; map to WG index + + python bs_seeker2-align.py -i RRBS.fa --aligner=bowtie -o RRBS.bam -g genome.fa -a adapter.txt --bt2--end-to-end - python bs_seeker2-align.py -i RRBS.qseq --aligner=bowtie2 --bt2--end-to-end -o RRBS.bam -f bam -g genome.fa -r --low=40 --up=400 -a adapter.txt +* Align from qseq format for RRBS with bowtie, specifying lengths of fragments ranging [40bp, 400bp] + + python bs_seeker2-align.py -i RRBS.qseq --aligner=bowtie -o RRBS.bam -f bam -g genome.fa -r --low=40 --up=400 -a adapter.txt The parameters '--low' and '--up' should be the same with corresponding parameters when building the genome index +* WGBS library ; alignment mode, bowtie ; map to WGBS index; use 8 threads for alignment + + python bs_seeker2-align.py -i WGBS.fa --aligner=bowtie -o WGBS.bam -f bam -g genome.fa --bt-p 4 + +BS-Seeker2 will run TWO bowtie instances in parallel. ##Input file: @@ -341,6 +361,10 @@ For example, if 95% of reads come from fragments with length range [50bp, 250bp], you'd better choose [40bp, 300bp]. +- Fewer mismatches for the 'local alignment' mode. + + As the 'local alignment', the bad sequenced bases are usually trimmed, and would not be considered by the parameter "-m". + It is suggested to user fewer mismatches for the 'local alignment' mode. (3) bs_seeker2-call_methylation.py ------------ @@ -351,16 +375,14 @@ ##Usage: $ python bs_seeker2-call_methylation.py -h - Usage: bs_seeker2-call_methylation.py [options] - Options: -h, --help show this help message and exit -i INFILE, --input=INFILE BAM output from bs_seeker2-align.py -d DBPATH, --db=DBPATH Path to the reference genome library (generated in - preprocessing genome) [Default: /u/home/mcdb/weilong/i - nstall/BSseeker2/bs_utils/reference_genomes] + preprocessing genome) [Default: ~/install + /BSseeker2/bs_utils/reference_genomes] -o OUTFILE, --output-prefix=OUTFILE The output prefix to create ATCGmap and wiggle files [INFILE] @@ -370,6 +392,7 @@ -x, --rm-SX Removed reads with tag 'XS:i:1', which would be considered as not fully converted by bisulfite treatment [Default: False] + --txt Show CGmap and ATCGmap in .gz [Default: False] -r READ_NO, --read-no=READ_NO The least number of reads covering one site to be shown in wig file [Default: 1] @@ -430,9 +453,9 @@ (3) position (4) context (CG/CHG/CHH) (5) dinucleotide-context (CA/CC/CG/CT) - (6) methyltion-level = #-of-C / (#-of-C + #-of-T) - (7) #-of-C (methylated) - (8) (#-ofC + #-of-T) (all cytosines) + (6) methylation-level = #_of_C / (#_of_C + #_of_T). + (7) #_of_C (methylated C, the count of reads showing C here) + (8) = #_of_C + #_of_T (all Cytosines, the count of reads showing C or T here) - ATCGmap file @@ -442,7 +465,7 @@ chr1 T 3009410 -- -- 0 10 0 0 0 0 0 0 0 0 na chr1 C 3009411 CHH CC 0 10 0 0 0 0 0 0 0 0 0.0 chr1 C 3009412 CHG CC 0 10 0 0 0 0 0 0 0 0 0.0 - chr1 C 3009413 CG CG 0 10 50 0 0 0 0 0 0 0 0.833333333333 + chr1 C 3009413 CG CG 0 10 50 0 0 0 0 0 0 0 0.83 Format descriptions: @@ -467,14 +490,15 @@ (14) # of reads from Crick strand mapped here, support G on Watson strand and C on Crick strand (15) # of reads from Crick strand mapped here, support N - (16) methylation_level = #C/(#C+#T) = (C8+C14)/(C7+C8+C11+C14); "nan" means none reads support C/T at this position. + (16) methylation_level = #C/(#C+#T) = C8/(C7+C8) for Watson strand, =C14/(C11+C14) for Crick strand; + "nan" means none reads support C/T at this position. Contact Information ============ -If you still have questions on BS-Seeker 2, or you find bugs when using BS-Seeker 2, or you have suggestions, please write email to guoweilong@gmail.com (Weilong Guo). +If you still have questions on BS-Seeker 2, or you find bugs when using BS-Seeker 2, or you have suggestions, please write email to [Weilong Guo](guoweilong@gmail.com). @@ -556,7 +580,7 @@ cd BSseeker2-master/ -(4)Run BS-Seeker2 +(4)Run BS-Seeker2 Q: Can I add the path of BS-Seeker2's *.py to the $PATH, so I can call BS-Seeker2 from anywhere? @@ -585,6 +609,69 @@ bs_seeker-align.py -h bs_seeker-call_methylation.py -h +(5) Unique alignment + +Q: If I want to only keep alignments that map uniquely, is this an argument I should supply directly +to Bowtie2 (via BS Seeker 2's command line option), or is this an option that's available in +BS Seeker 2 itself? + +A: BS-Seeker2 reports unique alignment by default already. If you want to know how many reads +could have multiple hits, run BS-Seeker2 with parameter "--multiple-hit". + + +(6) Output + +Q: In CGmap files, why some lines shown "--" but not a motif (CG/CHG/CHH), for example: + + chr01 C 4303711 -- CC 0.0 0 2 + chr01 C 4303712 -- CN 0.0 0 2 + +A: In this example, the site 4303713 is "N" in genome, thus we could not decide the explict pattern. + +(7) Algorithm to remove the adapter. + +Q: What's the algorithm to remove the adapter + +A: BS-Seeker2 has built-in algorithm for removing the adapter, which is developed by [Weilong Guo](http://bioinfo.au.tsinghua.edu.cn/member/wguo/index.html). + + First, if the adapter was provided as "AGATCGGAAGAGCACACGTC", only the first 10bp would be used. + Second, we use semi-local alignment strategy for removing the adapters. + Exmaple: + + Read : ACCGCGTTGATCGAGTACGTACGTGGGTC + Adapter : ....................ACGTGGGTCCCG + + no_mismatch : the maximum number allowed for mismatches + + Algorithm: (allowing 1 mismatch) + -Step 1: + ACCGCGTTGATCGAGTACGTACGTGGGTC + ||XX + ACGTGGGTCCCG + -Step 2: + ACCGCGTTGATCGAGTACGTACGTGGGTC + X||X + .ACGTGGGTCCCG + -Step 3: + ACCGCGTTGATCGAGTACGTACGTGGGTC + XX + ..ACGTGGGTCCCG + -Step ... + -Step N: + ACCGCGTTGATCGAGTACGTACGTGGGTC + ||||||||| + ....................ACGTGGGTCCCG + Success & return! + + Third, we also removed the synthesized bases at the end of RRBS fragments. + Take the "C-CGG" cutting site as example, + + - - C|U G G - - =>cut=> - - C =>add=> - - C|C G =>sequencing + - - G G C|C - - - - G G C - - G G C + + In our algorithm, the "CG" in "--CCG" (upper strand) was trimmed, in order to get accurate methyaltion level. + +