Mercurial > repos > siyuan > prada
comparison pyPRADA_1.2/tools/bwa-0.5.7-mh/bwa.1 @ 0:acc2ca1a3ba4
Uploaded
author | siyuan |
---|---|
date | Thu, 20 Feb 2014 00:44:58 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:acc2ca1a3ba4 |
---|---|
1 .TH bwa 1 "10 Feburuary 2010" "bwa-0.5.6" "Bioinformatics tools" | |
2 .SH NAME | |
3 .PP | |
4 bwa - Burrows-Wheeler Alignment Tool | |
5 .SH SYNOPSIS | |
6 .PP | |
7 bwa index -a bwtsw database.fasta | |
8 .PP | |
9 bwa aln database.fasta short_read.fastq > aln_sa.sai | |
10 .PP | |
11 bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam | |
12 .PP | |
13 bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam | |
14 .PP | |
15 bwa bwasw database.fasta long_read.fastq > aln.sam | |
16 | |
17 .SH DESCRIPTION | |
18 .PP | |
19 BWA is a fast light-weighted tool that aligns relatively short sequences | |
20 (queries) to a sequence database (targe), such as the human reference | |
21 genome. It implements two different algorithms, both based on | |
22 Burrows-Wheeler Transform (BWT). The first algorithm is designed for | |
23 short queries up to ~200bp with low error rate (<3%). It does gapped | |
24 global alignment w.r.t. queries, supports paired-end reads, and is one | |
25 of the fastest short read alignment algorithms to date while also | |
26 visiting suboptimal hits. The second algorithm, BWA-SW, is designed for | |
27 long reads with more errors. It performs heuristic Smith-Waterman-like | |
28 alignment to find high-scoring local hits (and thus chimera). On | |
29 low-error short queries, BWA-SW is slower and less accurate than the | |
30 first algorithm, but on long queries, it is better. | |
31 .PP | |
32 For both algorithms, the database file in the FASTA format must be | |
33 first indexed with the | |
34 .B `index' | |
35 command, which typically takes a few hours. The first algorithm is | |
36 implemented via the | |
37 .B `aln' | |
38 command, which finds the suffix array (SA) coordinates of good hits of | |
39 each individual read, and the | |
40 .B `samse/sampe' | |
41 command, which converts SA coordinates to chromosomal coordinate and | |
42 pairs reads (for `sampe'). The second algorithm is invoked by the | |
43 .B `dbtwsw' | |
44 command. It works for single-end reads only. | |
45 | |
46 .SH COMMANDS AND OPTIONS | |
47 .TP | |
48 .B index | |
49 bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta> | |
50 | |
51 Index database sequences in the FASTA format. | |
52 | |
53 .B OPTIONS: | |
54 .RS | |
55 .TP 10 | |
56 .B -c | |
57 Build color-space index. The input fast should be in nucleotide space. | |
58 .TP | |
59 .B -p STR | |
60 Prefix of the output database [same as db filename] | |
61 .TP | |
62 .B -a STR | |
63 Algorithm for constructing BWT index. Available options are: | |
64 .RS | |
65 .TP | |
66 .B is | |
67 IS linear-time algorithm for constructing suffix array. It requires | |
68 5.37N memory where N is the size of the database. IS is moderately fast, | |
69 but does not work with database larger than 2GB. IS is the default | |
70 algorithm due to its simplicity. The current codes for IS algorithm are | |
71 reimplemented by Yuta Mori. | |
72 .TP | |
73 .B bwtsw | |
74 Algorithm implemented in BWT-SW. This method works with the whole human | |
75 genome, but it does not work with database smaller than 10MB and it is | |
76 usually slower than IS. | |
77 .RE | |
78 .RE | |
79 | |
80 .TP | |
81 .B aln | |
82 bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i | |
83 nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc] | |
84 [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> > | |
85 <out.sai> | |
86 | |
87 Find the SA coordinates of the input reads. Maximum | |
88 .I maxSeedDiff | |
89 differences are allowed in the first | |
90 .I seedLen | |
91 subsequence and maximum | |
92 .I maxDiff | |
93 differences are allowed in the whole sequence. | |
94 | |
95 .B OPTIONS: | |
96 .RS | |
97 .TP 10 | |
98 .B -n NUM | |
99 Maximum edit distance if the value is INT, or the fraction of missing | |
100 alignments given 2% uniform base error rate if FLOAT. In the latter | |
101 case, the maximum edit distance is automatically chosen for different | |
102 read lengths. [0.04] | |
103 .TP | |
104 .B -o INT | |
105 Maximum number of gap opens [1] | |
106 .TP | |
107 .B -e INT | |
108 Maximum number of gap extensions, -1 for k-difference mode (disallowing | |
109 long gaps) [-1] | |
110 .TP | |
111 .B -d INT | |
112 Disallow a long deletion within INT bp towards the 3'-end [16] | |
113 .TP | |
114 .B -i INT | |
115 Disallow an indel within INT bp towards the ends [5] | |
116 .TP | |
117 .B -l INT | |
118 Take the first INT subsequence as seed. If INT is larger than the query | |
119 sequence, seeding will be disabled. For long reads, this option is | |
120 typically ranged from 25 to 35 for `-k 2'. [inf] | |
121 .TP | |
122 .B -k INT | |
123 Maximum edit distance in the seed [2] | |
124 .TP | |
125 .B -t INT | |
126 Number of threads (multi-threading mode) [1] | |
127 .TP | |
128 .B -M INT | |
129 Mismatch penalty. BWA will not search for suboptimal hits with a score | |
130 lower than (bestScore-misMsc). [3] | |
131 .TP | |
132 .B -O INT | |
133 Gap open penalty [11] | |
134 .TP | |
135 .B -E INT | |
136 Gap extension penalty [4] | |
137 .TP | |
138 .B -R INT | |
139 Proceed with suboptimal alignments if there are no more than INT equally | |
140 best hits. This option only affects paired-end mapping. Increasing this | |
141 threshold helps to improve the pairing accuracy at the cost of speed, | |
142 especially for short reads (~32bp). | |
143 .TP | |
144 .B -c | |
145 Reverse query but not complement it, which is required for alignment in | |
146 the color space. | |
147 .TP | |
148 .B -N | |
149 Disable iterative search. All hits with no more than | |
150 .I maxDiff | |
151 differences will be found. This mode is much slower than the default. | |
152 .TP | |
153 .B -q INT | |
154 Parameter for read trimming. BWA trims a read down to | |
155 argmax_x{\\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original | |
156 read length. [0] | |
157 .RE | |
158 | |
159 .TP | |
160 .B samse | |
161 bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> | |
162 | |
163 Generate alignments in the SAM format given single-end reads. Repetitive | |
164 hits will be randomly chosen. | |
165 | |
166 .B OPTIONS: | |
167 .RS | |
168 .TP 10 | |
169 .B -n INT | |
170 Maximum number of alignments to output in the XA tag for reads paired | |
171 properly. If a read has more than INT hits, the XA tag will not be | |
172 written. [3] | |
173 .RE | |
174 | |
175 .TP | |
176 .B sampe | |
177 bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis] | |
178 [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam> | |
179 | |
180 Generate alignments in the SAM format given paired-end reads. Repetitive | |
181 read pairs will be placed randomly. | |
182 | |
183 .B OPTIONS: | |
184 .RS | |
185 .TP 8 | |
186 .B -a INT | |
187 Maximum insert size for a read pair to be considered being mapped | |
188 properly. Since 0.4.5, this option is only used when there are not | |
189 enough good alignment to infer the distribution of insert sizes. [500] | |
190 .TP | |
191 .B -o INT | |
192 Maximum occurrences of a read for pairing. A read with more occurrneces | |
193 will be treated as a single-end read. Reducing this parameter helps | |
194 faster pairing. [100000] | |
195 .TP | |
196 .B -P | |
197 Load the entire FM-index into memory to reduce disk operations | |
198 (base-space reads only). With this option, at least 1.25N bytes of | |
199 memory are required, where N is the length of the genome. | |
200 .TP | |
201 .B -n INT | |
202 Maximum number of alignments to output in the XA tag for reads paired | |
203 properly. If a read has more than INT hits, the XA tag will not be | |
204 written. [3] | |
205 .TP | |
206 .B -N INT | |
207 Maximum number of alignments to output in the XA tag for disconcordant | |
208 read pairs (excluding singletons). If a read has more than INT hits, the | |
209 XA tag will not be written. [10] | |
210 .RE | |
211 | |
212 .TP | |
213 .B bwasw | |
214 bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t | |
215 nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N | |
216 nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> | |
217 | |
218 Align query sequences in the <in.fq> file. | |
219 | |
220 .B OPTIONS: | |
221 .RS | |
222 .TP 10 | |
223 .B -a INT | |
224 Score of a match [1] | |
225 .TP | |
226 .B -b INT | |
227 Mismatch penalty [3] | |
228 .TP | |
229 .B -q INT | |
230 Gap open penalty [5] | |
231 .TP | |
232 .B -r INT | |
233 Gap extension penalty. The penalty for a contiguous gap of size k is | |
234 q+k*r. [2] | |
235 .TP | |
236 .B -t INT | |
237 Number of threads in the multi-threading mode [1] | |
238 .TP | |
239 .B -w INT | |
240 Band width in the banded alignment [33] | |
241 .TP | |
242 .B -T INT | |
243 Minimum score threshold divided by a [37] | |
244 .TP | |
245 .B -c FLOAT | |
246 Coefficient for threshold adjustment according to query length. Given an | |
247 l-long query, the threshold for a hit to be retained is | |
248 a*max{T,c*log(l)}. [5.5] | |
249 .TP | |
250 .B -z INT | |
251 Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1] | |
252 .TP | |
253 .B -s INT | |
254 Maximum SA interval size for initiating a seed. Higher -s increases | |
255 accuracy at the cost of speed. [3] | |
256 .TP | |
257 .B -N INT | |
258 Minimum number of seeds supporting the resultant alignment to skip | |
259 reverse alignment. [5] | |
260 .RE | |
261 | |
262 .SH SAM ALIGNMENT FORMAT | |
263 .PP | |
264 The output of the | |
265 .B `aln' | |
266 command is binary and designed for BWA use only. BWA outputs the final | |
267 alignment in the SAM (Sequence Alignment/Map) format. Each line consists | |
268 of: | |
269 | |
270 .TS | |
271 center box; | |
272 cb | cb | cb | |
273 n | l | l . | |
274 Col Field Description | |
275 _ | |
276 1 QNAME Query (pair) NAME | |
277 2 FLAG bitwise FLAG | |
278 3 RNAME Reference sequence NAME | |
279 4 POS 1-based leftmost POSition/coordinate of clipped sequence | |
280 5 MAPQ MAPping Quality (Phred-scaled) | |
281 6 CIAGR extended CIGAR string | |
282 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) | |
283 8 MPOS 1-based Mate POSistion | |
284 9 ISIZE Inferred insert SIZE | |
285 10 SEQ query SEQuence on the same strand as the reference | |
286 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) | |
287 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE | |
288 .TE | |
289 | |
290 .PP | |
291 Each bit in the FLAG field is defined as: | |
292 | |
293 .TS | |
294 center box; | |
295 cb | cb | cb | |
296 c | l | l . | |
297 Chr Flag Description | |
298 _ | |
299 p 0x0001 the read is paired in sequencing | |
300 P 0x0002 the read is mapped in a proper pair | |
301 u 0x0004 the query sequence itself is unmapped | |
302 U 0x0008 the mate is unmapped | |
303 r 0x0010 strand of the query (1 for reverse) | |
304 R 0x0020 strand of the mate | |
305 1 0x0040 the read is the first read in a pair | |
306 2 0x0080 the read is the second read in a pair | |
307 s 0x0100 the alignment is not primary | |
308 f 0x0200 QC failure | |
309 d 0x0400 optical or PCR duplicate | |
310 .TE | |
311 | |
312 .PP | |
313 The Please check <http://samtools.sourceforge.net> for the format | |
314 specification and the tools for post-processing the alignment. | |
315 | |
316 BWA generates the following optional fields. Tags starting with `X' are | |
317 specific to BWA. | |
318 | |
319 .TS | |
320 center box; | |
321 cb | cb | |
322 cB | l . | |
323 Tag Meaning | |
324 _ | |
325 NM Edit distance | |
326 MD Mismatching positions/bases | |
327 AS Alignment score | |
328 _ | |
329 X0 Number of best hits | |
330 X1 Number of suboptimal hits found by BWA | |
331 XN Number of ambiguous bases in the referenece | |
332 XM Number of mismatches in the alignment | |
333 XO Number of gap opens | |
334 XG Number of gap extentions | |
335 XT Type: Unique/Repeat/N/Mate-sw | |
336 XA Alternative hits; format: (chr,pos,CIGAR,NM;)* | |
337 _ | |
338 XS Suboptimal alignment score | |
339 XF Support from forward/reverse alignment | |
340 XE Number of supporting seeds | |
341 .TE | |
342 | |
343 .PP | |
344 Note that XO and XG are generated by BWT search while the CIGAR string | |
345 by Smith-Waterman alignment. These two tags may be inconsistent with the | |
346 CIGAR string. This is not a bug. | |
347 | |
348 .SH NOTES ON SHORT-READ ALIGNMENT | |
349 .SS Alignment Accuracy | |
350 .PP | |
351 When seeding is disabled, BWA guarantees to find an alignment | |
352 containing maximum | |
353 .I maxDiff | |
354 differences including | |
355 .I maxGapO | |
356 gap opens which do not occur within | |
357 .I nIndelEnd | |
358 bp towards either end of the query. Longer gaps may be found if | |
359 .I maxGapE | |
360 is positive, but it is not guaranteed to find all hits. When seeding is | |
361 enabled, BWA further requires that the first | |
362 .I seedLen | |
363 subsequence contains no more than | |
364 .I maxSeedDiff | |
365 differences. | |
366 .PP | |
367 When gapped alignment is disabled, BWA is expected to generate the same | |
368 alignment as Eland, the Illumina alignment program. However, as BWA | |
369 change `N' in the database sequence to random nucleotides, hits to these | |
370 random sequences will also be counted. As a consequence, BWA may mark a | |
371 unique hit as a repeat, if the random sequences happen to be identical | |
372 to the sequences which should be unqiue in the database. This random | |
373 behaviour will be avoided in future releases. | |
374 .PP | |
375 By default, if the best hit is no so repetitive (controlled by -R), BWA | |
376 also finds all hits contains one more mismatch; otherwise, BWA finds all | |
377 equally best hits only. Base quality is NOT considered in evaluating | |
378 hits. In paired-end alignment, BWA pairs all hits it found. It further | |
379 performs Smith-Waterman alignment for unmapped reads with mates mapped | |
380 to rescue mapped mates, and for high-quality anomalous pairs to fix | |
381 potential alignment errors. | |
382 | |
383 .SS Estimating Insert Size Distribution | |
384 .PP | |
385 BWA estimates the insert size distribution per 256*1024 read pairs. It | |
386 first collects pairs of reads with both ends mapped with a single-end | |
387 quality 20 or higher and then calculates median (Q2), lower and higher | |
388 quartile (Q1 and Q3). It estimates the mean and the variance of the | |
389 insert size distribution from pairs whose insert sizes are within | |
390 interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair | |
391 considered to be properly paired (SAM flag 0x2) is calculated by solving | |
392 equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the | |
393 standard error of the insert size distribution, L is the length of the | |
394 genome, p0 is prior of anomalous pair and Phi() is the standard | |
395 cumulative distribution function. For mapping Illumina short-insert | |
396 reads to the human genome, x is about 6-7 sigma away from the | |
397 mean. Quartiles, mean, variance and x will be printed to the standard | |
398 error output. | |
399 | |
400 .SS Memory Requirement | |
401 .PP | |
402 With bwtsw algorithm, 2.5GB memory is required for indexing the complete | |
403 human genome sequences. For short reads, the | |
404 .B `aln' | |
405 command uses ~2.3GB memory and the | |
406 .B `sampe' | |
407 command uses ~3.5GB. | |
408 | |
409 .SS Speed | |
410 .PP | |
411 Indexing the human genome sequences takes 3 hours with bwtsw | |
412 algorithm. Indexing smaller genomes with IS or divsufsort algorithms is | |
413 several times faster, but requires more memory. | |
414 .PP | |
415 Speed of alignment is largely determined by the error rate of the query | |
416 sequences (r). Firstly, BWA runs much faster for near perfect hits than | |
417 for hits with many differences, and it stops searching for a hit with | |
418 l+2 differences if a l-difference hit is found. This means BWA will be | |
419 very slow if r is high because in this case BWA has to visit hits with | |
420 many differences and looking for these hits is expensive. Secondly, the | |
421 alignment algorithm behind makes the speed sensitive to [k log(N)/m], | |
422 where k is the maximum allowed differences, N the size of database and m | |
423 the length of a query. In practice, we choose k w.r.t. r and therefore r | |
424 is the leading factor. I would not recommend to use BWA on data with | |
425 r>0.02. | |
426 .PP | |
427 Pairing is slower for shorter reads. This is mainly because shorter | |
428 reads have more spurious hits and converting SA coordinates to | |
429 chromosomal coordinates are very costly. | |
430 .PP | |
431 In a practical experiment, BWA is able to map 2 million 32bp reads to a | |
432 bacterial genome in several minutes, map the same amount of reads to | |
433 human X chromosome in 8-15 minutes and to the human genome in 15-25 | |
434 minutes. This result implies that the speed of BWA is insensitive to the | |
435 size of database and therefore BWA is more efficient when the database | |
436 is sufficiently large. On smaller genomes, hash based algorithms are | |
437 usually much faster. | |
438 | |
439 .SH NOTES ON LONG-READ ALIGNMENT | |
440 .PP | |
441 Command | |
442 .B `bwasw' | |
443 is designed for long-read alignment. The algorithm behind, BWA-SW, is | |
444 similar to BWT-SW, but does not guarantee to find all local hits due to | |
445 the heuristic acceleration. It tends to be faster and more accurate if | |
446 the resultant alignment is supported by more seeds, and therefore | |
447 BWA-SW usually performs better on long queries than on short ones. | |
448 | |
449 On 350-1000bp reads, BWA-SW is several to tens of times faster than the | |
450 existing programs. Its accuracy is comparable to SSAHA2, more accurate | |
451 than BLAT. Like BLAT, BWA-SW also finds chimera which may pose a | |
452 challenge to SSAHA2. On 10-100kbp queries where chimera detection is | |
453 important, BWA-SW is over 10X faster than BLAT while being more | |
454 sensitive. | |
455 | |
456 BWA-SW can also be used to align ~100bp reads, but it is slower than | |
457 the short-read algorithm. Its sensitivity and accuracy is lower than | |
458 SSAHA2 especially when the sequencing error rate is above 2%. This is | |
459 the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode. | |
460 | |
461 .SH SEE ALSO | |
462 BWA website <http://bio-bwa.sourceforge.net>, Samtools website | |
463 <http://samtools.sourceforge.net> | |
464 | |
465 .SH AUTHOR | |
466 Heng Li at the Sanger Institute wrote the key source codes and | |
467 integrated the following codes for BWT construction: bwtsw | |
468 <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at | |
469 the University of Hong Kong and IS | |
470 <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge | |
471 <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and | |
472 implemented by Yuta Mori. | |
473 | |
474 .SH LICENSE AND CITATION | |
475 .PP | |
476 The full BWA package is distributed under GPLv3 as it uses source codes | |
477 from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS | |
478 libraries are distributed under the MIT license. | |
479 .PP | |
480 If you use the short-read alignment component, please cite the following | |
481 paper: | |
482 .PP | |
483 Li H. and Durbin R. (2009) Fast and accurate short read alignment with | |
484 Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID: 19451168] | |
485 .PP | |
486 If you use the long-read component (BWA-SW), please cite: | |
487 .PP | |
488 Li H. and Durbin R. (2010) Fast and accurate long-read alignment with | |
489 Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505] | |
490 | |
491 .SH HISTORY | |
492 BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW | |
493 and mimics its binary file formats; BWA-SW resembles BWT-SW in several | |
494 ways. The initial idea about BWT-based alignment also came from the | |
495 group who developed BWT-SW. At the same time, BWA is different enough | |
496 from BWT-SW. The short-read alignment algorithm bears no similarity to | |
497 Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it | |
498 introduces heuristics that can hardly be applied to the original | |
499 algorithm. In all, BWA does not guarantee to find all local hits as what | |
500 BWT-SW is designed to do, but it is much faster than BWT-SW on both | |
501 short and long query sequences. | |
502 | |
503 I started to write the first piece of codes on 24 May 2008 and got the | |
504 initial stable version on 02 June 2008. During this period, I was | |
505 acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper, | |
506 was collaborating with Beijing Genomics Institute on SOAP2, the successor | |
507 to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has come out in | |
508 November 2008. According to the SourceForge download page, the third | |
509 BWT-based short read aligner, bowtie, was first released in August | |
510 2008. At the time of writing this manual, at least three more BWT-based | |
511 short-read aligners are being implemented. | |
512 | |
513 The BWA-SW algorithm is a new component of BWA. It was conceived in | |
514 November 2008 and implemented ten months later. |