|
0
|
1 samtools(1) Bioinformatics tools samtools(1)
|
|
|
2
|
|
|
3
|
|
|
4
|
|
|
5 NAME
|
|
|
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
|
|
|
7
|
|
|
8 SYNOPSIS
|
|
|
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
|
|
|
10
|
|
|
11 samtools sort aln.bam aln.sorted
|
|
|
12
|
|
|
13 samtools index aln.sorted.bam
|
|
|
14
|
|
|
15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
|
|
|
16
|
|
|
17 samtools merge out.bam in1.bam in2.bam in3.bam
|
|
|
18
|
|
|
19 samtools faidx ref.fasta
|
|
|
20
|
|
|
21 samtools pileup -f ref.fasta aln.sorted.bam
|
|
|
22
|
|
|
23 samtools tview aln.sorted.bam ref.fasta
|
|
|
24
|
|
|
25
|
|
|
26 DESCRIPTION
|
|
|
27 Samtools is a set of utilities that manipulate alignments in the BAM
|
|
|
28 format. It imports from and exports to the SAM (Sequence Alignment/Map)
|
|
|
29 format, does sorting, merging and indexing, and allows to retrieve
|
|
|
30 reads in any regions swiftly.
|
|
|
31
|
|
|
32 Samtools is designed to work on a stream. It regards an input file `-'
|
|
|
33 as the standard input (stdin) and an output file `-' as the standard
|
|
|
34 output (stdout). Several commands can thus be combined with Unix pipes.
|
|
|
35 Samtools always output warning and error messages to the standard error
|
|
|
36 output (stderr).
|
|
|
37
|
|
|
38 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
|
|
|
39 HTTP server if the BAM file name starts with `ftp://' or `http://'.
|
|
|
40 Samtools checks the current working directory for the index file and
|
|
|
41 will download the index upon absence. Samtools does not retrieve the
|
|
|
42 entire alignment file unless it is asked to do so.
|
|
|
43
|
|
|
44
|
|
|
45 COMMANDS AND OPTIONS
|
|
|
46 import samtools import <in.ref_list> <in.sam> <out.bam>
|
|
|
47
|
|
|
48 Since 0.1.4, this command is an alias of:
|
|
|
49
|
|
|
50 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>
|
|
|
51
|
|
|
52
|
|
|
53 sort samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
|
|
|
54
|
|
|
55 Sort alignments by leftmost coordinates. File <out.pre-
|
|
|
56 fix>.bam will be created. This command may also create tempo-
|
|
|
57 rary files <out.prefix>.%d.bam when the whole alignment can-
|
|
|
58 not be fitted into memory (controlled by option -m).
|
|
|
59
|
|
|
60 OPTIONS:
|
|
|
61
|
|
|
62 -n Sort by read names rather than by chromosomal coordi-
|
|
|
63 nates
|
|
|
64
|
|
|
65 -m INT Approximately the maximum required memory.
|
|
|
66 [500000000]
|
|
|
67
|
|
|
68
|
|
|
69 merge samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam>
|
|
|
70 <in2.bam> [...]
|
|
|
71
|
|
|
72 Merge multiple sorted alignments. The header reference lists
|
|
|
73 of all the input BAM files, and the @SQ headers of inh.sam,
|
|
|
74 if any, must all refer to the same set of reference
|
|
|
75 sequences. The header reference list and (unless overridden
|
|
|
76 by -h) `@' headers of in1.bam will be copied to out.bam, and
|
|
|
77 the headers of other files will be ignored.
|
|
|
78
|
|
|
79 OPTIONS:
|
|
|
80
|
|
|
81 -h FILE Use the lines of FILE as `@' headers to be copied to
|
|
|
82 out.bam, replacing any header lines that would other-
|
|
|
83 wise be copied from in1.bam. (FILE is actually in
|
|
|
84 SAM format, though any alignment records it may con-
|
|
|
85 tain are ignored.)
|
|
|
86
|
|
|
87 -n The input alignments are sorted by read names rather
|
|
|
88 than by chromosomal coordinates
|
|
|
89
|
|
|
90
|
|
|
91 index samtools index <aln.bam>
|
|
|
92
|
|
|
93 Index sorted alignment for fast random access. Index file
|
|
|
94 <aln.bam>.bai will be created.
|
|
|
95
|
|
|
96
|
|
|
97 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
|
|
|
98 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
|
|
|
99 Group] <in.bam>|<in.sam> [region1 [...]]
|
|
|
100
|
|
|
101 Extract/print all or sub alignments in SAM or BAM format. If
|
|
|
102 no region is specified, all the alignments will be printed;
|
|
|
103 otherwise only alignments overlapping the specified regions
|
|
|
104 will be output. An alignment may be given multiple times if
|
|
|
105 it is overlapping several regions. A region can be presented,
|
|
|
106 for example, in the following format: `chr2', `chr2:1000000'
|
|
|
107 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.
|
|
|
108
|
|
|
109 OPTIONS:
|
|
|
110
|
|
|
111 -b Output in the BAM format.
|
|
|
112
|
|
|
113 -u Output uncompressed BAM. This option saves time spent
|
|
|
114 on compression/decomprssion and is thus preferred
|
|
|
115 when the output is piped to another samtools command.
|
|
|
116
|
|
|
117 -h Include the header in the output.
|
|
|
118
|
|
|
119 -H Output the header only.
|
|
|
120
|
|
|
121 -S Input is in SAM. If @SQ header lines are absent, the
|
|
|
122 `-t' option is required.
|
|
|
123
|
|
|
124 -t FILE This file is TAB-delimited. Each line must contain
|
|
|
125 the reference name and the length of the reference,
|
|
|
126 one line for each distinct reference; additional
|
|
|
127 fields are ignored. This file also defines the order
|
|
|
128 of the reference sequences in sorting. If you run
|
|
|
129 `samtools faidx <ref.fa>', the resultant index file
|
|
|
130 <ref.fa>.fai can be used as this <in.ref_list> file.
|
|
|
131
|
|
|
132 -o FILE Output file [stdout]
|
|
|
133
|
|
|
134 -f INT Only output alignments with all bits in INT present
|
|
|
135 in the FLAG field. INT can be in hex in the format of
|
|
|
136 /^0x[0-9A-F]+/ [0]
|
|
|
137
|
|
|
138 -F INT Skip alignments with bits present in INT [0]
|
|
|
139
|
|
|
140 -q INT Skip alignments with MAPQ smaller than INT [0]
|
|
|
141
|
|
|
142 -l STR Only output reads in library STR [null]
|
|
|
143
|
|
|
144 -r STR Only output reads in read group STR [null]
|
|
|
145
|
|
|
146
|
|
|
147 faidx samtools faidx <ref.fasta> [region1 [...]]
|
|
|
148
|
|
|
149 Index reference sequence in the FASTA format or extract sub-
|
|
|
150 sequence from indexed reference sequence. If no region is
|
|
|
151 specified, faidx will index the file and create
|
|
|
152 <ref.fasta>.fai on the disk. If regions are speficified, the
|
|
|
153 subsequences will be retrieved and printed to stdout in the
|
|
|
154 FASTA format. The input file can be compressed in the RAZF
|
|
|
155 format.
|
|
|
156
|
|
|
157
|
|
|
158 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
|
|
|
159 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
|
|
|
160 pairDiffRate] <in.bam>|<in.sam>
|
|
|
161
|
|
|
162 Print the alignment in the pileup format. In the pileup for-
|
|
|
163 mat, each line represents a genomic position, consisting of
|
|
|
164 chromosome name, coordinate, reference base, read bases, read
|
|
|
165 qualities and alignment mapping qualities. Information on
|
|
|
166 match, mismatch, indel, strand, mapping quality and start and
|
|
|
167 end of a read are all encoded at the read base column. At
|
|
|
168 this column, a dot stands for a match to the reference base
|
|
|
169 on the forward strand, a comma for a match on the reverse
|
|
|
170 strand, `ACGTN' for a mismatch on the forward strand and
|
|
|
171 `acgtn' for a mismatch on the reverse strand. A pattern
|
|
|
172 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
|
|
|
173 between this reference position and the next reference posi-
|
|
|
174 tion. The length of the insertion is given by the integer in
|
|
|
175 the pattern, followed by the inserted sequence. Similarly, a
|
|
|
176 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
|
|
|
177 reference. The deleted bases will be presented as `*' in the
|
|
|
178 following lines. Also at the read base column, a symbol `^'
|
|
|
179 marks the start of a read segment which is a contiguous sub-
|
|
|
180 sequence on the read separated by `N/S/H' CIGAR operations.
|
|
|
181 The ASCII of the character following `^' minus 33 gives the
|
|
|
182 mapping quality. A symbol `$' marks the end of a read seg-
|
|
|
183 ment.
|
|
|
184
|
|
|
185 If option -c is applied, the consensus base, consensus qual-
|
|
|
186 ity, SNP quality and RMS mapping quality of the reads cover-
|
|
|
187 ing the site will be inserted between the `reference base'
|
|
|
188 and the `read bases' columns. An indel occupies an additional
|
|
|
189 line. Each indel line consists of chromosome name, coordi-
|
|
|
190 nate, a star, the genotype, consensus quality, SNP quality,
|
|
|
191 RMS mapping quality, # covering reads, the first alllele, the
|
|
|
192 second allele, # reads supporting the first allele, # reads
|
|
|
193 supporting the second allele and # reads containing indels
|
|
|
194 different from the top two alleles.
|
|
|
195
|
|
|
196 OPTIONS:
|
|
|
197
|
|
|
198
|
|
|
199 -s Print the mapping quality as the last column. This
|
|
|
200 option makes the output easier to parse, although
|
|
|
201 this format is not space efficient.
|
|
|
202
|
|
|
203
|
|
|
204 -S The input file is in SAM.
|
|
|
205
|
|
|
206
|
|
|
207 -i Only output pileup lines containing indels.
|
|
|
208
|
|
|
209
|
|
|
210 -f FILE The reference sequence in the FASTA format. Index
|
|
|
211 file FILE.fai will be created if absent.
|
|
|
212
|
|
|
213
|
|
|
214 -M INT Cap mapping quality at INT [60]
|
|
|
215
|
|
|
216
|
|
|
217 -t FILE List of reference names ane sequence lengths, in
|
|
|
218 the format described for the import command. If
|
|
|
219 this option is present, samtools assumes the input
|
|
|
220 <in.alignment> is in SAM format; otherwise it
|
|
|
221 assumes in BAM format.
|
|
|
222
|
|
|
223
|
|
|
224 -l FILE List of sites at which pileup is output. This file
|
|
|
225 is space delimited. The first two columns are
|
|
|
226 required to be chromosome and 1-based coordinate.
|
|
|
227 Additional columns are ignored. It is recommended
|
|
|
228 to use option -s together with -l as in the default
|
|
|
229 format we may not know the mapping quality.
|
|
|
230
|
|
|
231
|
|
|
232 -c Call the consensus sequence using MAQ consensus
|
|
|
233 model. Options -T, -N, -I and -r are only effective
|
|
|
234 when -c or -g is in use.
|
|
|
235
|
|
|
236
|
|
|
237 -g Generate genotype likelihood in the binary GLFv3
|
|
|
238 format. This option suppresses -c, -i and -s.
|
|
|
239
|
|
|
240
|
|
|
241 -T FLOAT The theta parameter (error dependency coefficient)
|
|
|
242 in the maq consensus calling model [0.85]
|
|
|
243
|
|
|
244
|
|
|
245 -N INT Number of haplotypes in the sample (>=2) [2]
|
|
|
246
|
|
|
247
|
|
|
248 -r FLOAT Expected fraction of differences between a pair of
|
|
|
249 haplotypes [0.001]
|
|
|
250
|
|
|
251
|
|
|
252 -I INT Phred probability of an indel in sequencing/prep.
|
|
|
253 [40]
|
|
|
254
|
|
|
255
|
|
|
256
|
|
|
257 tview samtools tview <in.sorted.bam> [ref.fasta]
|
|
|
258
|
|
|
259 Text alignment viewer (based on the ncurses library). In the
|
|
|
260 viewer, press `?' for help and press `g' to check the align-
|
|
|
261 ment start from a region in the format like
|
|
|
262 `chr10:10,000,000'.
|
|
|
263
|
|
|
264
|
|
|
265
|
|
|
266 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
|
|
|
267
|
|
|
268 Fill in mate coordinates, ISIZE and mate related flags from a
|
|
|
269 name-sorted alignment.
|
|
|
270
|
|
|
271
|
|
|
272 rmdup samtools rmdup <input.srt.bam> <out.bam>
|
|
|
273
|
|
|
274 Remove potential PCR duplicates: if multiple read pairs have
|
|
|
275 identical external coordinates, only retain the pair with
|
|
|
276 highest mapping quality. This command ONLY works with FR
|
|
|
277 orientation and requires ISIZE is correctly set.
|
|
|
278
|
|
|
279
|
|
|
280
|
|
|
281 rmdupse samtools rmdupse <input.srt.bam> <out.bam>
|
|
|
282
|
|
|
283 Remove potential duplicates for single-ended reads. This com-
|
|
|
284 mand will treat all reads as single-ended even if they are
|
|
|
285 paired in fact.
|
|
|
286
|
|
|
287
|
|
|
288
|
|
|
289 fillmd samtools fillmd [-e] <aln.bam> <ref.fasta>
|
|
|
290
|
|
|
291 Generate the MD tag. If the MD tag is already present, this
|
|
|
292 command will give a warning if the MD tag generated is dif-
|
|
|
293 ferent from the existing tag.
|
|
|
294
|
|
|
295 OPTIONS:
|
|
|
296
|
|
|
297 -e Convert a the read base to = if it is identical to
|
|
|
298 the aligned reference base. Indel caller does not
|
|
|
299 support the = bases at the moment.
|
|
|
300
|
|
|
301
|
|
|
302
|
|
|
303 SAM FORMAT
|
|
|
304 SAM is TAB-delimited. Apart from the header lines, which are started
|
|
|
305 with the `@' symbol, each alignment line consists of:
|
|
|
306
|
|
|
307
|
|
|
308 +----+-------+----------------------------------------------------------+
|
|
|
309 |Col | Field | Description |
|
|
|
310 +----+-------+----------------------------------------------------------+
|
|
|
311 | 1 | QNAME | Query (pair) NAME |
|
|
|
312 | 2 | FLAG | bitwise FLAG |
|
|
|
313 | 3 | RNAME | Reference sequence NAME |
|
|
|
314 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
|
|
|
315 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
|
|
|
316 | 6 | CIAGR | extended CIGAR string |
|
|
|
317 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
|
|
|
318 | 8 | MPOS | 1-based Mate POSistion |
|
|
|
319 | 9 | ISIZE | Inferred insert SIZE |
|
|
|
320 |10 | SEQ | query SEQuence on the same strand as the reference |
|
|
|
321 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
|
|
|
322 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
|
|
|
323 +----+-------+----------------------------------------------------------+
|
|
|
324
|
|
|
325 Each bit in the FLAG field is defined as:
|
|
|
326
|
|
|
327
|
|
|
328 +-------+--------------------------------------------------+
|
|
|
329 | Flag | Description |
|
|
|
330 +-------+--------------------------------------------------+
|
|
|
331 |0x0001 | the read is paired in sequencing |
|
|
|
332 |0x0002 | the read is mapped in a proper pair |
|
|
|
333 |0x0004 | the query sequence itself is unmapped |
|
|
|
334 |0x0008 | the mate is unmapped |
|
|
|
335 |0x0010 | strand of the query (1 for reverse) |
|
|
|
336 |0x0020 | strand of the mate |
|
|
|
337 |0x0040 | the read is the first read in a pair |
|
|
|
338 |0x0080 | the read is the second read in a pair |
|
|
|
339 |0x0100 | the alignment is not primary |
|
|
|
340 |0x0200 | the read fails platform/vendor quality checks |
|
|
|
341 |0x0400 | the read is either a PCR or an optical duplicate |
|
|
|
342 +-------+--------------------------------------------------+
|
|
|
343
|
|
|
344 LIMITATIONS
|
|
|
345 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
|
|
|
346 bam_aux.c.
|
|
|
347
|
|
|
348 o CIGAR operation P is not properly handled at the moment.
|
|
|
349
|
|
|
350 o In merging, the input files are required to have the same number of
|
|
|
351 reference sequences. The requirement can be relaxed. In addition,
|
|
|
352 merging does not reconstruct the header dictionaries automatically.
|
|
|
353 Endusers have to provide the correct header. Picard is better at
|
|
|
354 merging.
|
|
|
355
|
|
|
356 o Samtools' rmdup does not work for single-end data and does not remove
|
|
|
357 duplicates across chromosomes. Picard is better.
|
|
|
358
|
|
|
359
|
|
|
360 AUTHOR
|
|
|
361 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
|
|
|
362 Handsaker from the Broad Institute implemented the BGZF library and Jue
|
|
|
363 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
|
|
|
364 people in the 1000Genomes Project contributed to the SAM format speci-
|
|
|
365 fication.
|
|
|
366
|
|
|
367
|
|
|
368 SEE ALSO
|
|
|
369 Samtools website: <http://samtools.sourceforge.net>
|
|
|
370
|
|
|
371
|
|
|
372
|
|
|
373 samtools-0.1.6 2 September 2009 samtools(1)
|