annotate pyPRADA_1.2/tools/bwa-0.5.7-mh/bwa.1 @ 0:acc2ca1a3ba4

Uploaded
author siyuan
date Thu, 20 Feb 2014 00:44:58 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
1 .TH bwa 1 "10 Feburuary 2010" "bwa-0.5.6" "Bioinformatics tools"
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
2 .SH NAME
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
3 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
4 bwa - Burrows-Wheeler Alignment Tool
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
5 .SH SYNOPSIS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
6 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
7 bwa index -a bwtsw database.fasta
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
8 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
9 bwa aln database.fasta short_read.fastq > aln_sa.sai
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
10 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
11 bwa samse database.fasta aln_sa.sai short_read.fastq > aln.sam
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
12 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
13 bwa sampe database.fasta aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln.sam
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
14 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
15 bwa bwasw database.fasta long_read.fastq > aln.sam
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
16
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
17 .SH DESCRIPTION
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
18 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
19 BWA is a fast light-weighted tool that aligns relatively short sequences
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
20 (queries) to a sequence database (targe), such as the human reference
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
21 genome. It implements two different algorithms, both based on
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
22 Burrows-Wheeler Transform (BWT). The first algorithm is designed for
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
23 short queries up to ~200bp with low error rate (<3%). It does gapped
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
24 global alignment w.r.t. queries, supports paired-end reads, and is one
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
25 of the fastest short read alignment algorithms to date while also
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
26 visiting suboptimal hits. The second algorithm, BWA-SW, is designed for
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
27 long reads with more errors. It performs heuristic Smith-Waterman-like
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
28 alignment to find high-scoring local hits (and thus chimera). On
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
29 low-error short queries, BWA-SW is slower and less accurate than the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
30 first algorithm, but on long queries, it is better.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
31 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
32 For both algorithms, the database file in the FASTA format must be
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
33 first indexed with the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
34 .B `index'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
35 command, which typically takes a few hours. The first algorithm is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
36 implemented via the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
37 .B `aln'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
38 command, which finds the suffix array (SA) coordinates of good hits of
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
39 each individual read, and the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
40 .B `samse/sampe'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
41 command, which converts SA coordinates to chromosomal coordinate and
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
42 pairs reads (for `sampe'). The second algorithm is invoked by the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
43 .B `dbtwsw'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
44 command. It works for single-end reads only.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
45
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
46 .SH COMMANDS AND OPTIONS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
47 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
48 .B index
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
49 bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
50
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
51 Index database sequences in the FASTA format.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
52
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
53 .B OPTIONS:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
54 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
55 .TP 10
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
56 .B -c
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
57 Build color-space index. The input fast should be in nucleotide space.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
58 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
59 .B -p STR
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
60 Prefix of the output database [same as db filename]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
61 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
62 .B -a STR
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
63 Algorithm for constructing BWT index. Available options are:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
64 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
65 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
66 .B is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
67 IS linear-time algorithm for constructing suffix array. It requires
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
68 5.37N memory where N is the size of the database. IS is moderately fast,
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
69 but does not work with database larger than 2GB. IS is the default
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
70 algorithm due to its simplicity. The current codes for IS algorithm are
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
71 reimplemented by Yuta Mori.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
72 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
73 .B bwtsw
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
74 Algorithm implemented in BWT-SW. This method works with the whole human
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
75 genome, but it does not work with database smaller than 10MB and it is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
76 usually slower than IS.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
77 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
78 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
79
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
80 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
81 .B aln
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
82 bwa aln [-n maxDiff] [-o maxGapO] [-e maxGapE] [-d nDelTail] [-i
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
83 nIndelEnd] [-k maxSeedDiff] [-l seedLen] [-t nThrds] [-cRN] [-M misMsc]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
84 [-O gapOsc] [-E gapEsc] [-q trimQual] <in.db.fasta> <in.query.fq> >
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
85 <out.sai>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
86
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
87 Find the SA coordinates of the input reads. Maximum
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
88 .I maxSeedDiff
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
89 differences are allowed in the first
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
90 .I seedLen
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
91 subsequence and maximum
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
92 .I maxDiff
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
93 differences are allowed in the whole sequence.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
94
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
95 .B OPTIONS:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
96 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
97 .TP 10
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
98 .B -n NUM
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
99 Maximum edit distance if the value is INT, or the fraction of missing
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
100 alignments given 2% uniform base error rate if FLOAT. In the latter
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
101 case, the maximum edit distance is automatically chosen for different
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
102 read lengths. [0.04]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
103 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
104 .B -o INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
105 Maximum number of gap opens [1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
106 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
107 .B -e INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
108 Maximum number of gap extensions, -1 for k-difference mode (disallowing
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
109 long gaps) [-1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
110 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
111 .B -d INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
112 Disallow a long deletion within INT bp towards the 3'-end [16]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
113 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
114 .B -i INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
115 Disallow an indel within INT bp towards the ends [5]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
116 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
117 .B -l INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
118 Take the first INT subsequence as seed. If INT is larger than the query
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
119 sequence, seeding will be disabled. For long reads, this option is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
120 typically ranged from 25 to 35 for `-k 2'. [inf]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
121 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
122 .B -k INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
123 Maximum edit distance in the seed [2]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
124 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
125 .B -t INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
126 Number of threads (multi-threading mode) [1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
127 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
128 .B -M INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
129 Mismatch penalty. BWA will not search for suboptimal hits with a score
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
130 lower than (bestScore-misMsc). [3]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
131 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
132 .B -O INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
133 Gap open penalty [11]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
134 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
135 .B -E INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
136 Gap extension penalty [4]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
137 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
138 .B -R INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
139 Proceed with suboptimal alignments if there are no more than INT equally
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
140 best hits. This option only affects paired-end mapping. Increasing this
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
141 threshold helps to improve the pairing accuracy at the cost of speed,
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
142 especially for short reads (~32bp).
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
143 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
144 .B -c
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
145 Reverse query but not complement it, which is required for alignment in
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
146 the color space.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
147 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
148 .B -N
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
149 Disable iterative search. All hits with no more than
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
150 .I maxDiff
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
151 differences will be found. This mode is much slower than the default.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
152 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
153 .B -q INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
154 Parameter for read trimming. BWA trims a read down to
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
155 argmax_x{\\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
156 read length. [0]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
157 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
158
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
159 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
160 .B samse
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
161 bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
162
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
163 Generate alignments in the SAM format given single-end reads. Repetitive
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
164 hits will be randomly chosen.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
165
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
166 .B OPTIONS:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
167 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
168 .TP 10
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
169 .B -n INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
170 Maximum number of alignments to output in the XA tag for reads paired
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
171 properly. If a read has more than INT hits, the XA tag will not be
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
172 written. [3]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
173 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
174
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
175 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
176 .B sampe
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
177 bwa sampe [-a maxInsSize] [-o maxOcc] [-n maxHitPaired] [-N maxHitDis]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
178 [-P] <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
179
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
180 Generate alignments in the SAM format given paired-end reads. Repetitive
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
181 read pairs will be placed randomly.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
182
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
183 .B OPTIONS:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
184 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
185 .TP 8
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
186 .B -a INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
187 Maximum insert size for a read pair to be considered being mapped
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
188 properly. Since 0.4.5, this option is only used when there are not
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
189 enough good alignment to infer the distribution of insert sizes. [500]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
190 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
191 .B -o INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
192 Maximum occurrences of a read for pairing. A read with more occurrneces
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
193 will be treated as a single-end read. Reducing this parameter helps
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
194 faster pairing. [100000]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
195 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
196 .B -P
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
197 Load the entire FM-index into memory to reduce disk operations
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
198 (base-space reads only). With this option, at least 1.25N bytes of
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
199 memory are required, where N is the length of the genome.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
200 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
201 .B -n INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
202 Maximum number of alignments to output in the XA tag for reads paired
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
203 properly. If a read has more than INT hits, the XA tag will not be
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
204 written. [3]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
205 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
206 .B -N INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
207 Maximum number of alignments to output in the XA tag for disconcordant
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
208 read pairs (excluding singletons). If a read has more than INT hits, the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
209 XA tag will not be written. [10]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
210 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
211
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
212 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
213 .B bwasw
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
214 bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
215 nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
216 nHspRev] [-c thresCoef] <in.db.fasta> <in.fq>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
217
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
218 Align query sequences in the <in.fq> file.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
219
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
220 .B OPTIONS:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
221 .RS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
222 .TP 10
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
223 .B -a INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
224 Score of a match [1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
225 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
226 .B -b INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
227 Mismatch penalty [3]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
228 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
229 .B -q INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
230 Gap open penalty [5]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
231 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
232 .B -r INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
233 Gap extension penalty. The penalty for a contiguous gap of size k is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
234 q+k*r. [2]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
235 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
236 .B -t INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
237 Number of threads in the multi-threading mode [1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
238 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
239 .B -w INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
240 Band width in the banded alignment [33]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
241 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
242 .B -T INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
243 Minimum score threshold divided by a [37]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
244 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
245 .B -c FLOAT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
246 Coefficient for threshold adjustment according to query length. Given an
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
247 l-long query, the threshold for a hit to be retained is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
248 a*max{T,c*log(l)}. [5.5]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
249 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
250 .B -z INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
251 Z-best heuristics. Higher -z increases accuracy at the cost of speed. [1]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
252 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
253 .B -s INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
254 Maximum SA interval size for initiating a seed. Higher -s increases
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
255 accuracy at the cost of speed. [3]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
256 .TP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
257 .B -N INT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
258 Minimum number of seeds supporting the resultant alignment to skip
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
259 reverse alignment. [5]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
260 .RE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
261
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
262 .SH SAM ALIGNMENT FORMAT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
263 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
264 The output of the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
265 .B `aln'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
266 command is binary and designed for BWA use only. BWA outputs the final
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
267 alignment in the SAM (Sequence Alignment/Map) format. Each line consists
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
268 of:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
269
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
270 .TS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
271 center box;
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
272 cb | cb | cb
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
273 n | l | l .
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
274 Col Field Description
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
275 _
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
276 1 QNAME Query (pair) NAME
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
277 2 FLAG bitwise FLAG
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
278 3 RNAME Reference sequence NAME
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
279 4 POS 1-based leftmost POSition/coordinate of clipped sequence
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
280 5 MAPQ MAPping Quality (Phred-scaled)
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
281 6 CIAGR extended CIGAR string
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
282 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME)
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
283 8 MPOS 1-based Mate POSistion
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
284 9 ISIZE Inferred insert SIZE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
285 10 SEQ query SEQuence on the same strand as the reference
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
286 11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
287 12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
288 .TE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
289
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
290 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
291 Each bit in the FLAG field is defined as:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
292
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
293 .TS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
294 center box;
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
295 cb | cb | cb
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
296 c | l | l .
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
297 Chr Flag Description
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
298 _
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
299 p 0x0001 the read is paired in sequencing
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
300 P 0x0002 the read is mapped in a proper pair
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
301 u 0x0004 the query sequence itself is unmapped
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
302 U 0x0008 the mate is unmapped
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
303 r 0x0010 strand of the query (1 for reverse)
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
304 R 0x0020 strand of the mate
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
305 1 0x0040 the read is the first read in a pair
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
306 2 0x0080 the read is the second read in a pair
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
307 s 0x0100 the alignment is not primary
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
308 f 0x0200 QC failure
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
309 d 0x0400 optical or PCR duplicate
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
310 .TE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
311
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
312 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
313 The Please check <http://samtools.sourceforge.net> for the format
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
314 specification and the tools for post-processing the alignment.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
315
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
316 BWA generates the following optional fields. Tags starting with `X' are
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
317 specific to BWA.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
318
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
319 .TS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
320 center box;
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
321 cb | cb
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
322 cB | l .
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
323 Tag Meaning
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
324 _
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
325 NM Edit distance
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
326 MD Mismatching positions/bases
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
327 AS Alignment score
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
328 _
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
329 X0 Number of best hits
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
330 X1 Number of suboptimal hits found by BWA
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
331 XN Number of ambiguous bases in the referenece
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
332 XM Number of mismatches in the alignment
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
333 XO Number of gap opens
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
334 XG Number of gap extentions
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
335 XT Type: Unique/Repeat/N/Mate-sw
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
336 XA Alternative hits; format: (chr,pos,CIGAR,NM;)*
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
337 _
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
338 XS Suboptimal alignment score
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
339 XF Support from forward/reverse alignment
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
340 XE Number of supporting seeds
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
341 .TE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
342
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
343 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
344 Note that XO and XG are generated by BWT search while the CIGAR string
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
345 by Smith-Waterman alignment. These two tags may be inconsistent with the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
346 CIGAR string. This is not a bug.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
347
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
348 .SH NOTES ON SHORT-READ ALIGNMENT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
349 .SS Alignment Accuracy
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
350 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
351 When seeding is disabled, BWA guarantees to find an alignment
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
352 containing maximum
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
353 .I maxDiff
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
354 differences including
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
355 .I maxGapO
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
356 gap opens which do not occur within
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
357 .I nIndelEnd
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
358 bp towards either end of the query. Longer gaps may be found if
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
359 .I maxGapE
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
360 is positive, but it is not guaranteed to find all hits. When seeding is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
361 enabled, BWA further requires that the first
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
362 .I seedLen
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
363 subsequence contains no more than
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
364 .I maxSeedDiff
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
365 differences.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
366 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
367 When gapped alignment is disabled, BWA is expected to generate the same
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
368 alignment as Eland, the Illumina alignment program. However, as BWA
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
369 change `N' in the database sequence to random nucleotides, hits to these
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
370 random sequences will also be counted. As a consequence, BWA may mark a
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
371 unique hit as a repeat, if the random sequences happen to be identical
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
372 to the sequences which should be unqiue in the database. This random
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
373 behaviour will be avoided in future releases.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
374 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
375 By default, if the best hit is no so repetitive (controlled by -R), BWA
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
376 also finds all hits contains one more mismatch; otherwise, BWA finds all
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
377 equally best hits only. Base quality is NOT considered in evaluating
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
378 hits. In paired-end alignment, BWA pairs all hits it found. It further
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
379 performs Smith-Waterman alignment for unmapped reads with mates mapped
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
380 to rescue mapped mates, and for high-quality anomalous pairs to fix
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
381 potential alignment errors.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
382
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
383 .SS Estimating Insert Size Distribution
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
384 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
385 BWA estimates the insert size distribution per 256*1024 read pairs. It
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
386 first collects pairs of reads with both ends mapped with a single-end
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
387 quality 20 or higher and then calculates median (Q2), lower and higher
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
388 quartile (Q1 and Q3). It estimates the mean and the variance of the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
389 insert size distribution from pairs whose insert sizes are within
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
390 interval [Q1-2(Q3-Q1), Q3+2(Q3-Q1)]. The maximum distance x for a pair
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
391 considered to be properly paired (SAM flag 0x2) is calculated by solving
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
392 equation Phi((x-mu)/sigma)=x/L*p0, where mu is the mean, sigma is the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
393 standard error of the insert size distribution, L is the length of the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
394 genome, p0 is prior of anomalous pair and Phi() is the standard
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
395 cumulative distribution function. For mapping Illumina short-insert
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
396 reads to the human genome, x is about 6-7 sigma away from the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
397 mean. Quartiles, mean, variance and x will be printed to the standard
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
398 error output.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
399
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
400 .SS Memory Requirement
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
401 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
402 With bwtsw algorithm, 2.5GB memory is required for indexing the complete
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
403 human genome sequences. For short reads, the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
404 .B `aln'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
405 command uses ~2.3GB memory and the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
406 .B `sampe'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
407 command uses ~3.5GB.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
408
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
409 .SS Speed
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
410 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
411 Indexing the human genome sequences takes 3 hours with bwtsw
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
412 algorithm. Indexing smaller genomes with IS or divsufsort algorithms is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
413 several times faster, but requires more memory.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
414 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
415 Speed of alignment is largely determined by the error rate of the query
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
416 sequences (r). Firstly, BWA runs much faster for near perfect hits than
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
417 for hits with many differences, and it stops searching for a hit with
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
418 l+2 differences if a l-difference hit is found. This means BWA will be
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
419 very slow if r is high because in this case BWA has to visit hits with
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
420 many differences and looking for these hits is expensive. Secondly, the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
421 alignment algorithm behind makes the speed sensitive to [k log(N)/m],
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
422 where k is the maximum allowed differences, N the size of database and m
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
423 the length of a query. In practice, we choose k w.r.t. r and therefore r
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
424 is the leading factor. I would not recommend to use BWA on data with
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
425 r>0.02.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
426 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
427 Pairing is slower for shorter reads. This is mainly because shorter
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
428 reads have more spurious hits and converting SA coordinates to
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
429 chromosomal coordinates are very costly.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
430 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
431 In a practical experiment, BWA is able to map 2 million 32bp reads to a
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
432 bacterial genome in several minutes, map the same amount of reads to
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
433 human X chromosome in 8-15 minutes and to the human genome in 15-25
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
434 minutes. This result implies that the speed of BWA is insensitive to the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
435 size of database and therefore BWA is more efficient when the database
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
436 is sufficiently large. On smaller genomes, hash based algorithms are
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
437 usually much faster.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
438
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
439 .SH NOTES ON LONG-READ ALIGNMENT
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
440 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
441 Command
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
442 .B `bwasw'
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
443 is designed for long-read alignment. The algorithm behind, BWA-SW, is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
444 similar to BWT-SW, but does not guarantee to find all local hits due to
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
445 the heuristic acceleration. It tends to be faster and more accurate if
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
446 the resultant alignment is supported by more seeds, and therefore
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
447 BWA-SW usually performs better on long queries than on short ones.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
448
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
449 On 350-1000bp reads, BWA-SW is several to tens of times faster than the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
450 existing programs. Its accuracy is comparable to SSAHA2, more accurate
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
451 than BLAT. Like BLAT, BWA-SW also finds chimera which may pose a
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
452 challenge to SSAHA2. On 10-100kbp queries where chimera detection is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
453 important, BWA-SW is over 10X faster than BLAT while being more
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
454 sensitive.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
455
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
456 BWA-SW can also be used to align ~100bp reads, but it is slower than
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
457 the short-read algorithm. Its sensitivity and accuracy is lower than
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
458 SSAHA2 especially when the sequencing error rate is above 2%. This is
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
459 the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
460
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
461 .SH SEE ALSO
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
462 BWA website <http://bio-bwa.sourceforge.net>, Samtools website
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
463 <http://samtools.sourceforge.net>
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
464
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
465 .SH AUTHOR
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
466 Heng Li at the Sanger Institute wrote the key source codes and
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
467 integrated the following codes for BWT construction: bwtsw
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
468 <http://i.cs.hku.hk/~ckwong3/bwtsw/>, implemented by Chi-Kwong Wong at
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
469 the University of Hong Kong and IS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
470 <http://yuta.256.googlepages.com/sais> originally proposed by Nong Ge
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
471 <http://www.cs.sysu.edu.cn/nong/> at the Sun Yat-Sen University and
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
472 implemented by Yuta Mori.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
473
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
474 .SH LICENSE AND CITATION
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
475 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
476 The full BWA package is distributed under GPLv3 as it uses source codes
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
477 from BWT-SW which is covered by GPL. Sorting, hash table, BWT and IS
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
478 libraries are distributed under the MIT license.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
479 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
480 If you use the short-read alignment component, please cite the following
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
481 paper:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
482 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
483 Li H. and Durbin R. (2009) Fast and accurate short read alignment with
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
484 Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID: 19451168]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
485 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
486 If you use the long-read component (BWA-SW), please cite:
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
487 .PP
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
488 Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
489 Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505]
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
490
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
491 .SH HISTORY
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
492 BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
493 and mimics its binary file formats; BWA-SW resembles BWT-SW in several
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
494 ways. The initial idea about BWT-based alignment also came from the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
495 group who developed BWT-SW. At the same time, BWA is different enough
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
496 from BWT-SW. The short-read alignment algorithm bears no similarity to
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
497 Smith-Waterman algorithm any more. While BWA-SW learns from BWT-SW, it
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
498 introduces heuristics that can hardly be applied to the original
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
499 algorithm. In all, BWA does not guarantee to find all local hits as what
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
500 BWT-SW is designed to do, but it is much faster than BWT-SW on both
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
501 short and long query sequences.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
502
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
503 I started to write the first piece of codes on 24 May 2008 and got the
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
504 initial stable version on 02 June 2008. During this period, I was
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
505 acquainted that Professor Tak-Wah Lam, the first author of BWT-SW paper,
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
506 was collaborating with Beijing Genomics Institute on SOAP2, the successor
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
507 to SOAP (Short Oligonucleotide Analysis Package). SOAP2 has come out in
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
508 November 2008. According to the SourceForge download page, the third
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
509 BWT-based short read aligner, bowtie, was first released in August
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
510 2008. At the time of writing this manual, at least three more BWT-based
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
511 short-read aligners are being implemented.
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
512
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
513 The BWA-SW algorithm is a new component of BWA. It was conceived in
acc2ca1a3ba4 Uploaded
siyuan
parents:
diff changeset
514 November 2008 and implemented ten months later.