Mercurial > repos > lsong10 > psiclass
comparison PsiCLASS-1.0.2/samtools-0.1.19/samtools.1 @ 0:903fc43d6227 draft default tip
Uploaded
author | lsong10 |
---|---|
date | Fri, 26 Mar 2021 16:52:45 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:903fc43d6227 |
---|---|
1 .TH samtools 1 "15 March 2013" "samtools-0.1.19" "Bioinformatics tools" | |
2 .SH NAME | |
3 .PP | |
4 samtools - Utilities for the Sequence Alignment/Map (SAM) format | |
5 | |
6 bcftools - Utilities for the Binary Call Format (BCF) and VCF | |
7 .SH SYNOPSIS | |
8 .PP | |
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz | |
10 .PP | |
11 samtools sort aln.bam aln.sorted | |
12 .PP | |
13 samtools index aln.sorted.bam | |
14 .PP | |
15 samtools idxstats aln.sorted.bam | |
16 .PP | |
17 samtools view aln.sorted.bam chr2:20,100,000-20,200,000 | |
18 .PP | |
19 samtools merge out.bam in1.bam in2.bam in3.bam | |
20 .PP | |
21 samtools faidx ref.fasta | |
22 .PP | |
23 samtools pileup -vcf ref.fasta aln.sorted.bam | |
24 .PP | |
25 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam | |
26 .PP | |
27 samtools tview aln.sorted.bam ref.fasta | |
28 .PP | |
29 bcftools index in.bcf | |
30 .PP | |
31 bcftools view in.bcf chr2:100-200 > out.vcf | |
32 .PP | |
33 bcftools view -Nvm0.99 in.bcf > out.vcf 2> out.afs | |
34 | |
35 .SH DESCRIPTION | |
36 .PP | |
37 Samtools is a set of utilities that manipulate alignments in the BAM | |
38 format. It imports from and exports to the SAM (Sequence Alignment/Map) | |
39 format, does sorting, merging and indexing, and allows to retrieve reads | |
40 in any regions swiftly. | |
41 | |
42 Samtools is designed to work on a stream. It regards an input file `-' | |
43 as the standard input (stdin) and an output file `-' as the standard | |
44 output (stdout). Several commands can thus be combined with Unix | |
45 pipes. Samtools always output warning and error messages to the standard | |
46 error output (stderr). | |
47 | |
48 Samtools is also able to open a BAM (not SAM) file on a remote FTP or | |
49 HTTP server if the BAM file name starts with `ftp://' or `http://'. | |
50 Samtools checks the current working directory for the index file and | |
51 will download the index upon absence. Samtools does not retrieve the | |
52 entire alignment file unless it is asked to do so. | |
53 | |
54 .SH SAMTOOLS COMMANDS AND OPTIONS | |
55 | |
56 .TP 10 | |
57 .B view | |
58 samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F | |
59 skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]] | |
60 | |
61 Extract/print all or sub alignments in SAM or BAM format. If no region | |
62 is specified, all the alignments will be printed; otherwise only | |
63 alignments overlapping the specified regions will be output. An | |
64 alignment may be given multiple times if it is overlapping several | |
65 regions. A region can be presented, for example, in the following | |
66 format: `chr2' (the whole chr2), `chr2:1000000' (region starting from | |
67 1,000,000bp) or `chr2:1,000,000-2,000,000' (region between 1,000,000 and | |
68 2,000,000bp including the end points). The coordinate is 1-based. | |
69 | |
70 .B OPTIONS: | |
71 .RS | |
72 .TP 10 | |
73 .B -b | |
74 Output in the BAM format. | |
75 .TP | |
76 .BI -f \ INT | |
77 Only output alignments with all bits in INT present in the FLAG | |
78 field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] | |
79 .TP | |
80 .BI -F \ INT | |
81 Skip alignments with bits present in INT [0] | |
82 .TP | |
83 .B -h | |
84 Include the header in the output. | |
85 .TP | |
86 .B -H | |
87 Output the header only. | |
88 .TP | |
89 .BI -l \ STR | |
90 Only output reads in library STR [null] | |
91 .TP | |
92 .BI -o \ FILE | |
93 Output file [stdout] | |
94 .TP | |
95 .BI -q \ INT | |
96 Skip alignments with MAPQ smaller than INT [0] | |
97 .TP | |
98 .BI -r \ STR | |
99 Only output reads in read group STR [null] | |
100 .TP | |
101 .BI -R \ FILE | |
102 Output reads in read groups listed in | |
103 .I FILE | |
104 [null] | |
105 .TP | |
106 .BI -s \ FLOAT | |
107 Fraction of templates/pairs to subsample; the integer part is treated as the | |
108 seed for the random number generator [-1] | |
109 .TP | |
110 .B -S | |
111 Input is in SAM. If @SQ header lines are absent, the | |
112 .B `-t' | |
113 option is required. | |
114 .TP | |
115 .B -c | |
116 Instead of printing the alignments, only count them and print the | |
117 total number. All filter options, such as | |
118 .B `-f', | |
119 .B `-F' | |
120 and | |
121 .B `-q' | |
122 , are taken into account. | |
123 .TP | |
124 .BI -t \ FILE | |
125 This file is TAB-delimited. Each line must contain the reference name | |
126 and the length of the reference, one line for each distinct reference; | |
127 additional fields are ignored. This file also defines the order of the | |
128 reference sequences in sorting. If you run `samtools faidx <ref.fa>', | |
129 the resultant index file | |
130 .I <ref.fa>.fai | |
131 can be used as this | |
132 .I <in.ref_list> | |
133 file. | |
134 .TP | |
135 .B -u | |
136 Output uncompressed BAM. This option saves time spent on | |
137 compression/decomprssion and is thus preferred when the output is piped | |
138 to another samtools command. | |
139 .RE | |
140 | |
141 .TP | |
142 .B tview | |
143 samtools tview | |
144 .RB [ \-p | |
145 .IR chr:pos ] | |
146 .RB [ \-s | |
147 .IR STR ] | |
148 .RB [ \-d | |
149 .IR display ] | |
150 .RI <in.sorted.bam> | |
151 .RI [ref.fasta] | |
152 | |
153 Text alignment viewer (based on the ncurses library). In the viewer, | |
154 press `?' for help and press `g' to check the alignment start from a | |
155 region in the format like `chr10:10,000,000' or `=10,000,000' when | |
156 viewing the same reference sequence. | |
157 | |
158 .B Options: | |
159 .RS | |
160 .TP 14 | |
161 .BI -d \ display | |
162 Output as (H)tml or (C)urses or (T)ext | |
163 .TP | |
164 .BI -p \ chr:pos | |
165 Go directly to this position | |
166 .TP | |
167 .BI -s \ STR | |
168 Display only reads from this sample or read group | |
169 .RE | |
170 | |
171 .TP | |
172 .B mpileup | |
173 samtools mpileup | |
174 .RB [ \-EBugp ] | |
175 .RB [ \-C | |
176 .IR capQcoef ] | |
177 .RB [ \-r | |
178 .IR reg ] | |
179 .RB [ \-f | |
180 .IR in.fa ] | |
181 .RB [ \-l | |
182 .IR list ] | |
183 .RB [ \-M | |
184 .IR capMapQ ] | |
185 .RB [ \-Q | |
186 .IR minBaseQ ] | |
187 .RB [ \-q | |
188 .IR minMapQ ] | |
189 .I in.bam | |
190 .RI [ in2.bam | |
191 .RI [ ... ]] | |
192 | |
193 Generate BCF or pileup for one or multiple BAM files. Alignment records | |
194 are grouped by sample identifiers in @RG header lines. If sample | |
195 identifiers are absent, each input file is regarded as one sample. | |
196 | |
197 In the pileup format (without | |
198 .BR -u or -g ), | |
199 each | |
200 line represents a genomic position, consisting of chromosome name, | |
201 coordinate, reference base, read bases, read qualities and alignment | |
202 mapping qualities. Information on match, mismatch, indel, strand, | |
203 mapping quality and start and end of a read are all encoded at the read | |
204 base column. At this column, a dot stands for a match to the reference | |
205 base on the forward strand, a comma for a match on the reverse strand, | |
206 a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward | |
207 strand and `acgtn' for a mismatch on the reverse strand. A pattern | |
208 `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this | |
209 reference position and the next reference position. The length of the | |
210 insertion is given by the integer in the pattern, followed by the | |
211 inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' | |
212 represents a deletion from the reference. The deleted bases will be | |
213 presented as `*' in the following lines. Also at the read base column, a | |
214 symbol `^' marks the start of a read. The ASCII of the character | |
215 following `^' minus 33 gives the mapping quality. A symbol `$' marks the | |
216 end of a read segment. | |
217 | |
218 .B Input Options: | |
219 .RS | |
220 .TP 10 | |
221 .B -6 | |
222 Assume the quality is in the Illumina 1.3+ encoding. | |
223 .B -A | |
224 Do not skip anomalous read pairs in variant calling. | |
225 .TP | |
226 .B -B | |
227 Disable probabilistic realignment for the computation of base alignment | |
228 quality (BAQ). BAQ is the Phred-scaled probability of a read base being | |
229 misaligned. Applying this option greatly helps to reduce false SNPs | |
230 caused by misalignments. | |
231 .TP | |
232 .BI -b \ FILE | |
233 List of input BAM files, one file per line [null] | |
234 .TP | |
235 .BI -C \ INT | |
236 Coefficient for downgrading mapping quality for reads containing | |
237 excessive mismatches. Given a read with a phred-scaled probability q of | |
238 being generated from the mapped position, the new mapping quality is | |
239 about sqrt((INT-q)/INT)*INT. A zero value disables this | |
240 functionality; if enabled, the recommended value for BWA is 50. [0] | |
241 .TP | |
242 .BI -d \ INT | |
243 At a position, read maximally | |
244 .I INT | |
245 reads per input BAM. [250] | |
246 .TP | |
247 .B -E | |
248 Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt | |
249 specificity a little bit. | |
250 .TP | |
251 .BI -f \ FILE | |
252 The | |
253 .BR faidx -indexed | |
254 reference file in the FASTA format. The file can be optionally compressed by | |
255 .BR razip . | |
256 [null] | |
257 .TP | |
258 .BI -l \ FILE | |
259 BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null] | |
260 .TP | |
261 .BI -q \ INT | |
262 Minimum mapping quality for an alignment to be used [0] | |
263 .TP | |
264 .BI -Q \ INT | |
265 Minimum base quality for a base to be considered [13] | |
266 .TP | |
267 .BI -r \ STR | |
268 Only generate pileup in region | |
269 .I STR | |
270 [all sites] | |
271 .TP | |
272 .B Output Options: | |
273 | |
274 .TP | |
275 .B -D | |
276 Output per-sample read depth | |
277 .TP | |
278 .B -g | |
279 Compute genotype likelihoods and output them in the binary call format (BCF). | |
280 .TP | |
281 .B -S | |
282 Output per-sample Phred-scaled strand bias P-value | |
283 .TP | |
284 .B -u | |
285 Similar to | |
286 .B -g | |
287 except that the output is uncompressed BCF, which is preferred for piping. | |
288 | |
289 .TP | |
290 .B Options for Genotype Likelihood Computation (for -g or -u): | |
291 | |
292 .TP | |
293 .BI -e \ INT | |
294 Phred-scaled gap extension sequencing error probability. Reducing | |
295 .I INT | |
296 leads to longer indels. [20] | |
297 .TP | |
298 .BI -h \ INT | |
299 Coefficient for modeling homopolymer errors. Given an | |
300 .IR l -long | |
301 homopolymer | |
302 run, the sequencing error of an indel of size | |
303 .I s | |
304 is modeled as | |
305 .IR INT * s / l . | |
306 [100] | |
307 .TP | |
308 .B -I | |
309 Do not perform INDEL calling | |
310 .TP | |
311 .BI -L \ INT | |
312 Skip INDEL calling if the average per-sample depth is above | |
313 .IR INT . | |
314 [250] | |
315 .TP | |
316 .BI -o \ INT | |
317 Phred-scaled gap open sequencing error probability. Reducing | |
318 .I INT | |
319 leads to more indel calls. [40] | |
320 .TP | |
321 .BI -p | |
322 Apply -m and -F thresholds per sample to increase sensitivity of calling. | |
323 By default both options are applied to reads pooled from all samples. | |
324 .TP | |
325 .BI -P \ STR | |
326 Comma dilimited list of platforms (determined by | |
327 .BR @RG-PL ) | |
328 from which indel candidates are obtained. It is recommended to collect | |
329 indel candidates from sequencing technologies that have low indel error | |
330 rate such as ILLUMINA. [all] | |
331 .RE | |
332 | |
333 .TP | |
334 .B reheader | |
335 samtools reheader <in.header.sam> <in.bam> | |
336 | |
337 Replace the header in | |
338 .I in.bam | |
339 with the header in | |
340 .I in.header.sam. | |
341 This command is much faster than replacing the header with a | |
342 BAM->SAM->BAM conversion. | |
343 | |
344 .TP | |
345 .B cat | |
346 samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ] | |
347 | |
348 Concatenate BAMs. The sequence dictionary of each input BAM must be identical, | |
349 although this command does not check this. This command uses a similar trick | |
350 to | |
351 .B reheader | |
352 which enables fast BAM concatenation. | |
353 | |
354 .TP | |
355 .B sort | |
356 samtools sort [-nof] [-m maxMem] <in.bam> <out.prefix> | |
357 | |
358 Sort alignments by leftmost coordinates. File | |
359 .I <out.prefix>.bam | |
360 will be created. This command may also create temporary files | |
361 .I <out.prefix>.%d.bam | |
362 when the whole alignment cannot be fitted into memory (controlled by | |
363 option -m). | |
364 | |
365 .B OPTIONS: | |
366 .RS | |
367 .TP 8 | |
368 .B -o | |
369 Output the final alignment to the standard output. | |
370 .TP | |
371 .B -n | |
372 Sort by read names rather than by chromosomal coordinates | |
373 .TP | |
374 .B -f | |
375 Use | |
376 .I <out.prefix> | |
377 as the full output path and do not append | |
378 .I .bam | |
379 suffix. | |
380 .TP | |
381 .BI -m \ INT | |
382 Approximately the maximum required memory. [500000000] | |
383 .RE | |
384 | |
385 .TP | |
386 .B merge | |
387 samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...] | |
388 | |
389 Merge multiple sorted alignments. | |
390 The header reference lists of all the input BAM files, and the @SQ headers of | |
391 .IR inh.sam , | |
392 if any, must all refer to the same set of reference sequences. | |
393 The header reference list and (unless overridden by | |
394 .BR -h ) | |
395 `@' headers of | |
396 .I in1.bam | |
397 will be copied to | |
398 .IR out.bam , | |
399 and the headers of other files will be ignored. | |
400 | |
401 .B OPTIONS: | |
402 .RS | |
403 .TP 8 | |
404 .B -1 | |
405 Use zlib compression level 1 to comrpess the output | |
406 .TP | |
407 .B -f | |
408 Force to overwrite the output file if present. | |
409 .TP 8 | |
410 .BI -h \ FILE | |
411 Use the lines of | |
412 .I FILE | |
413 as `@' headers to be copied to | |
414 .IR out.bam , | |
415 replacing any header lines that would otherwise be copied from | |
416 .IR in1.bam . | |
417 .RI ( FILE | |
418 is actually in SAM format, though any alignment records it may contain | |
419 are ignored.) | |
420 .TP | |
421 .B -n | |
422 The input alignments are sorted by read names rather than by chromosomal | |
423 coordinates | |
424 .TP | |
425 .BI -R \ STR | |
426 Merge files in the specified region indicated by | |
427 .I STR | |
428 [null] | |
429 .TP | |
430 .B -r | |
431 Attach an RG tag to each alignment. The tag value is inferred from file names. | |
432 .TP | |
433 .B -u | |
434 Uncompressed BAM output | |
435 .RE | |
436 | |
437 .TP | |
438 .B index | |
439 samtools index <aln.bam> | |
440 | |
441 Index sorted alignment for fast random access. Index file | |
442 .I <aln.bam>.bai | |
443 will be created. | |
444 | |
445 .TP | |
446 .B idxstats | |
447 samtools idxstats <aln.bam> | |
448 | |
449 Retrieve and print stats in the index file. The output is TAB delimited | |
450 with each line consisting of reference sequence name, sequence length, # | |
451 mapped reads and # unmapped reads. | |
452 | |
453 .TP | |
454 .B faidx | |
455 samtools faidx <ref.fasta> [region1 [...]] | |
456 | |
457 Index reference sequence in the FASTA format or extract subsequence from | |
458 indexed reference sequence. If no region is specified, | |
459 .B faidx | |
460 will index the file and create | |
461 .I <ref.fasta>.fai | |
462 on the disk. If regions are speficified, the subsequences will be | |
463 retrieved and printed to stdout in the FASTA format. The input file can | |
464 be compressed in the | |
465 .B RAZF | |
466 format. | |
467 | |
468 .TP | |
469 .B fixmate | |
470 samtools fixmate <in.nameSrt.bam> <out.bam> | |
471 | |
472 Fill in mate coordinates, ISIZE and mate related flags from a | |
473 name-sorted alignment. | |
474 | |
475 .TP | |
476 .B rmdup | |
477 samtools rmdup [-sS] <input.srt.bam> <out.bam> | |
478 | |
479 Remove potential PCR duplicates: if multiple read pairs have identical | |
480 external coordinates, only retain the pair with highest mapping quality. | |
481 In the paired-end mode, this command | |
482 .B ONLY | |
483 works with FR orientation and requires ISIZE is correctly set. It does | |
484 not work for unpaired reads (e.g. two ends mapped to different | |
485 chromosomes or orphan reads). | |
486 | |
487 .B OPTIONS: | |
488 .RS | |
489 .TP 8 | |
490 .B -s | |
491 Remove duplicate for single-end reads. By default, the command works for | |
492 paired-end reads only. | |
493 .TP 8 | |
494 .B -S | |
495 Treat paired-end reads and single-end reads. | |
496 .RE | |
497 | |
498 .TP | |
499 .B calmd | |
500 samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta> | |
501 | |
502 Generate the MD tag. If the MD tag is already present, this command will | |
503 give a warning if the MD tag generated is different from the existing | |
504 tag. Output SAM by default. | |
505 | |
506 .B OPTIONS: | |
507 .RS | |
508 .TP 8 | |
509 .B -A | |
510 When used jointly with | |
511 .B -r | |
512 this option overwrites the original base quality. | |
513 .TP 8 | |
514 .B -e | |
515 Convert a the read base to = if it is identical to the aligned reference | |
516 base. Indel caller does not support the = bases at the moment. | |
517 .TP | |
518 .B -u | |
519 Output uncompressed BAM | |
520 .TP | |
521 .B -b | |
522 Output compressed BAM | |
523 .TP | |
524 .B -S | |
525 The input is SAM with header lines | |
526 .TP | |
527 .BI -C \ INT | |
528 Coefficient to cap mapping quality of poorly mapped reads. See the | |
529 .B pileup | |
530 command for details. [0] | |
531 .TP | |
532 .B -r | |
533 Compute the BQ tag (without -A) or cap base quality by BAQ (with -A). | |
534 .TP | |
535 .B -E | |
536 Extended BAQ calculation. This option trades specificity for sensitivity, though the | |
537 effect is minor. | |
538 .RE | |
539 | |
540 .TP | |
541 .B targetcut | |
542 samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam> | |
543 | |
544 This command identifies target regions by examining the continuity of read depth, computes | |
545 haploid consensus sequences of targets and outputs a SAM with each sequence corresponding | |
546 to a target. When option | |
547 .B -f | |
548 is in use, BAQ will be applied. This command is | |
549 .B only | |
550 designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)]. | |
551 .RE | |
552 | |
553 .TP | |
554 .B phase | |
555 samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam> | |
556 | |
557 Call and phase heterozygous SNPs. | |
558 .B OPTIONS: | |
559 .RS | |
560 .TP 8 | |
561 .B -A | |
562 Drop reads with ambiguous phase. | |
563 .TP 8 | |
564 .BI -b \ STR | |
565 Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file | |
566 .BR STR .0.bam | |
567 and phase-1 reads in | |
568 .BR STR .1.bam. | |
569 Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads | |
570 with switch errors will be saved in | |
571 .BR STR .chimeric.bam. | |
572 [null] | |
573 .TP | |
574 .B -F | |
575 Do not attempt to fix chimeric reads. | |
576 .TP | |
577 .BI -k \ INT | |
578 Maximum length for local phasing. [13] | |
579 .TP | |
580 .BI -q \ INT | |
581 Minimum Phred-scaled LOD to call a heterozygote. [40] | |
582 .TP | |
583 .BI -Q \ INT | |
584 Minimum base quality to be used in het calling. [13] | |
585 .RE | |
586 | |
587 .SH BCFTOOLS COMMANDS AND OPTIONS | |
588 | |
589 .TP 10 | |
590 .B view | |
591 .B bcftools view | |
592 .RB [ \-AbFGNQSucgv ] | |
593 .RB [ \-D | |
594 .IR seqDict ] | |
595 .RB [ \-l | |
596 .IR listLoci ] | |
597 .RB [ \-s | |
598 .IR listSample ] | |
599 .RB [ \-i | |
600 .IR gapSNPratio ] | |
601 .RB [ \-t | |
602 .IR mutRate ] | |
603 .RB [ \-p | |
604 .IR varThres ] | |
605 .RB [ \-m | |
606 .IR varThres ] | |
607 .RB [ \-P | |
608 .IR prior ] | |
609 .RB [ \-1 | |
610 .IR nGroup1 ] | |
611 .RB [ \-d | |
612 .IR minFrac ] | |
613 .RB [ \-U | |
614 .IR nPerm ] | |
615 .RB [ \-X | |
616 .IR permThres ] | |
617 .RB [ \-T | |
618 .IR trioType ] | |
619 .I in.bcf | |
620 .RI [ region ] | |
621 | |
622 Convert between BCF and VCF, call variant candidates and estimate allele | |
623 frequencies. | |
624 | |
625 .RS | |
626 .TP | |
627 .B Input/Output Options: | |
628 .TP 10 | |
629 .B -A | |
630 Retain all possible alternate alleles at variant sites. By default, the view | |
631 command discards unlikely alleles. | |
632 .TP 10 | |
633 .B -b | |
634 Output in the BCF format. The default is VCF. | |
635 .TP | |
636 .BI -D \ FILE | |
637 Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null] | |
638 .TP | |
639 .B -F | |
640 Indicate PL is generated by r921 or before (ordering is different). | |
641 .TP | |
642 .B -G | |
643 Suppress all individual genotype information. | |
644 .TP | |
645 .BI -l \ FILE | |
646 List of sites at which information are outputted [all sites] | |
647 .TP | |
648 .B -N | |
649 Skip sites where the REF field is not A/C/G/T | |
650 .TP | |
651 .B -Q | |
652 Output the QCALL likelihood format | |
653 .TP | |
654 .BI -s \ FILE | |
655 List of samples to use. The first column in the input gives the sample names | |
656 and the second gives the ploidy, which can only be 1 or 2. When the 2nd column | |
657 is absent, the sample ploidy is assumed to be 2. In the output, the ordering of | |
658 samples will be identical to the one in | |
659 .IR FILE . | |
660 [null] | |
661 .TP | |
662 .B -S | |
663 The input is VCF instead of BCF. | |
664 .TP | |
665 .B -u | |
666 Uncompressed BCF output (force -b). | |
667 .TP | |
668 .B Consensus/Variant Calling Options: | |
669 .TP 10 | |
670 .B -c | |
671 Call variants using Bayesian inference. This option automatically invokes option | |
672 .BR -e . | |
673 .TP | |
674 .BI -d \ FLOAT | |
675 When | |
676 .B -v | |
677 is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0] | |
678 .TP | |
679 .B -e | |
680 Perform max-likelihood inference only, including estimating the site allele frequency, | |
681 testing Hardy-Weinberg equlibrium and testing associations with LRT. | |
682 .TP | |
683 .B -g | |
684 Call per-sample genotypes at variant sites (force -c) | |
685 .TP | |
686 .BI -i \ FLOAT | |
687 Ratio of INDEL-to-SNP mutation rate [0.15] | |
688 .TP | |
689 .BI -m \ FLOAT | |
690 New model for improved multiallelic and rare-variant calling. Another | |
691 ALT allele is accepted if P(chi^2) of LRT exceeds the FLOAT threshold. The | |
692 parameter seems robust and the actual value usually does not affect the results | |
693 much; a good value to use is 0.99. This is the recommended calling method. [0] | |
694 .TP | |
695 .BI -p \ FLOAT | |
696 A site is considered to be a variant if P(ref|D)<FLOAT [0.5] | |
697 .TP | |
698 .BI -P \ STR | |
699 Prior or initial allele frequency spectrum. If STR can be | |
700 .IR full , | |
701 .IR cond2 , | |
702 .I flat | |
703 or the file consisting of error output from a previous variant calling | |
704 run. | |
705 .TP | |
706 .BI -t \ FLOAT | |
707 Scaled muttion rate for variant calling [0.001] | |
708 .TP | |
709 .BI -T \ STR | |
710 Enable pair/trio calling. For trio calling, option | |
711 .B -s | |
712 is usually needed to be applied to configure the trio members and their ordering. | |
713 In the file supplied to the option | |
714 .BR -s , | |
715 the first sample must be the child, the second the father and the third the mother. | |
716 The valid values of | |
717 .I STR | |
718 are `pair', `trioauto', `trioxd' and `trioxs', where `pair' calls differences between two input samples, and `trioxd' (`trioxs') specifies that the input | |
719 is from the X chromosome non-PAR regions and the child is a female (male). [null] | |
720 .TP | |
721 .B -v | |
722 Output variant sites only (force -c) | |
723 .TP | |
724 .B Contrast Calling and Association Test Options: | |
725 .TP | |
726 .BI -1 \ INT | |
727 Number of group-1 samples. This option is used for dividing the samples into | |
728 two groups for contrast SNP calling or association test. | |
729 When this option is in use, the following VCF INFO will be outputted: | |
730 PC2, PCHI2 and QCHI2. [0] | |
731 .TP | |
732 .BI -U \ INT | |
733 Number of permutations for association test (effective only with | |
734 .BR -1 ) | |
735 [0] | |
736 .TP | |
737 .BI -X \ FLOAT | |
738 Only perform permutations for P(chi^2)<FLOAT (effective only with | |
739 .BR -U ) | |
740 [0.01] | |
741 .RE | |
742 | |
743 .TP | |
744 .B index | |
745 .B bcftools index | |
746 .I in.bcf | |
747 | |
748 Index sorted BCF for random access. | |
749 .RE | |
750 | |
751 .TP | |
752 .B cat | |
753 .B bcftools cat | |
754 .I in1.bcf | |
755 .RI [ "in2.bcf " [ ... "]]]" | |
756 | |
757 Concatenate BCF files. The input files are required to be sorted and | |
758 have identical samples appearing in the same order. | |
759 .RE | |
760 .SH SAM FORMAT | |
761 | |
762 Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started | |
763 with the `@' symbol, each alignment line consists of: | |
764 | |
765 .TS | |
766 center box; | |
767 cb | cb | cb | |
768 n | l | l . | |
769 Col Field Description | |
770 _ | |
771 1 QNAME Query template/pair NAME | |
772 2 FLAG bitwise FLAG | |
773 3 RNAME Reference sequence NAME | |
774 4 POS 1-based leftmost POSition/coordinate of clipped sequence | |
775 5 MAPQ MAPping Quality (Phred-scaled) | |
776 6 CIAGR extended CIGAR string | |
777 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) | |
778 8 MPOS 1-based Mate POSistion | |
779 9 TLEN inferred Template LENgth (insert size) | |
780 10 SEQ query SEQuence on the same strand as the reference | |
781 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) | |
782 12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE | |
783 .TE | |
784 | |
785 .PP | |
786 Each bit in the FLAG field is defined as: | |
787 | |
788 .TS | |
789 center box; | |
790 cb | cb | cb | |
791 l | c | l . | |
792 Flag Chr Description | |
793 _ | |
794 0x0001 p the read is paired in sequencing | |
795 0x0002 P the read is mapped in a proper pair | |
796 0x0004 u the query sequence itself is unmapped | |
797 0x0008 U the mate is unmapped | |
798 0x0010 r strand of the query (1 for reverse) | |
799 0x0020 R strand of the mate | |
800 0x0040 1 the read is the first read in a pair | |
801 0x0080 2 the read is the second read in a pair | |
802 0x0100 s the alignment is not primary | |
803 0x0200 f the read fails platform/vendor quality checks | |
804 0x0400 d the read is either a PCR or an optical duplicate | |
805 .TE | |
806 | |
807 where the second column gives the string representation of the FLAG field. | |
808 | |
809 .SH VCF FORMAT | |
810 | |
811 The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields: | |
812 .TS | |
813 center box; | |
814 cb | cb | cb | |
815 n | l | l . | |
816 Col Field Description | |
817 _ | |
818 1 CHROM CHROMosome name | |
819 2 POS the left-most POSition of the variant | |
820 3 ID unique variant IDentifier | |
821 4 REF the REFerence allele | |
822 5 ALT the ALTernate allele(s), separated by comma | |
823 6 QUAL variant/reference QUALity | |
824 7 FILTER FILTers applied | |
825 8 INFO INFOrmation related to the variant, separated by semi-colon | |
826 9 FORMAT FORMAT of the genotype fields, separated by colon (optional) | |
827 10+ SAMPLE SAMPLE genotypes and per-sample information (optional) | |
828 .TE | |
829 | |
830 .PP | |
831 The following table gives the | |
832 .B INFO | |
833 tags used by samtools and bcftools. | |
834 | |
835 .TS | |
836 center box; | |
837 cb | cb | cb | |
838 l | l | l . | |
839 Tag Format Description | |
840 _ | |
841 AF1 double Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele | |
842 DP int Raw read depth (without quality filtering) | |
843 DP4 int[4] # high-quality reference forward bases, ref reverse, alternate for and alt rev bases | |
844 FQ int Consensus quality. Positive: sample genotypes different; negative: otherwise | |
845 MQ int Root-Mean-Square mapping quality of covering reads | |
846 PC2 int[2] Phred probability of AF in group1 samples being larger (,smaller) than in group2 | |
847 PCHI2 double Posterior weighted chi^2 P-value between group1 and group2 samples | |
848 PV4 double[4] P-value for strand bias, baseQ bias, mapQ bias and tail distance bias | |
849 QCHI2 int Phred-scaled PCHI2 | |
850 RP int # permutations yielding a smaller PCHI2 | |
851 CLR int Phred log ratio of genotype likelihoods with and without the trio/pair constraint | |
852 UGT string Most probable genotype configuration without the trio constraint | |
853 CGT string Most probable configuration with the trio constraint | |
854 VDB float Tests variant positions within reads. Intended for filtering RNA-seq artifacts around splice sites | |
855 RPB float Mann-Whitney rank-sum test for tail distance bias | |
856 HWE float Hardy-Weinberg equilibrium test, Wigginton et al., PMID: 15789306 | |
857 .TE | |
858 | |
859 .SH EXAMPLES | |
860 .IP o 2 | |
861 Import SAM to BAM when | |
862 .B @SQ | |
863 lines are present in the header: | |
864 | |
865 samtools view -bS aln.sam > aln.bam | |
866 | |
867 If | |
868 .B @SQ | |
869 lines are absent: | |
870 | |
871 samtools faidx ref.fa | |
872 samtools view -bt ref.fa.fai aln.sam > aln.bam | |
873 | |
874 where | |
875 .I ref.fa.fai | |
876 is generated automatically by the | |
877 .B faidx | |
878 command. | |
879 | |
880 .IP o 2 | |
881 Attach the | |
882 .B RG | |
883 tag while merging sorted alignments: | |
884 | |
885 perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt | |
886 samtools merge -rh rg.txt merged.bam ga.bam 454.bam | |
887 | |
888 The value in a | |
889 .B RG | |
890 tag is determined by the file name the read is coming from. In this | |
891 example, in the | |
892 .IR merged.bam , | |
893 reads from | |
894 .I ga.bam | |
895 will be attached | |
896 .IR RG:Z:ga , | |
897 while reads from | |
898 .I 454.bam | |
899 will be attached | |
900 .IR RG:Z:454 . | |
901 | |
902 .IP o 2 | |
903 Call SNPs and short INDELs for one diploid individual: | |
904 | |
905 samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf | |
906 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf | |
907 | |
908 The | |
909 .B -D | |
910 option of varFilter controls the maximum read depth, which should be | |
911 adjusted to about twice the average read depth. One may consider to add | |
912 .B -C50 | |
913 to | |
914 .B mpileup | |
915 if mapping quality is overestimated for reads containing excessive | |
916 mismatches. Applying this option usually helps | |
917 .B BWA-short | |
918 but may not other mappers. | |
919 | |
920 .IP o 2 | |
921 Generate the consensus sequence for one diploid individual: | |
922 | |
923 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq | |
924 | |
925 .IP o 2 | |
926 Call somatic mutations from a pair of samples: | |
927 | |
928 samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair - > var.bcf | |
929 | |
930 In the output INFO field, | |
931 .I CLR | |
932 gives the Phred-log ratio between the likelihood by treating the | |
933 two samples independently, and the likelihood by requiring the genotype to be identical. | |
934 This | |
935 .I CLR | |
936 is effectively a score measuring the confidence of somatic calls. The higher the better. | |
937 | |
938 .IP o 2 | |
939 Call de novo and somatic mutations from a family trio: | |
940 | |
941 samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -s samples.txt - > var.bcf | |
942 | |
943 File | |
944 .I samples.txt | |
945 should consist of three lines specifying the member and order of samples (in the order of child-father-mother). | |
946 Similarly, | |
947 .I CLR | |
948 gives the Phred-log likelihood ratio with and without the trio constraint. | |
949 .I UGT | |
950 shows the most likely genotype configuration without the trio constraint, and | |
951 .I CGT | |
952 gives the most likely genotype configuration satisfying the trio constraint. | |
953 | |
954 .IP o 2 | |
955 Phase one individual: | |
956 | |
957 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out | |
958 | |
959 The | |
960 .B calmd | |
961 command is used to reduce false heterozygotes around INDELs. | |
962 | |
963 .IP o 2 | |
964 Call SNPs and short indels for multiple diploid individuals: | |
965 | |
966 samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf | |
967 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf | |
968 | |
969 Individuals are identified from the | |
970 .B SM | |
971 tags in the | |
972 .B @RG | |
973 header lines. Individuals can be pooled in one alignment file; one | |
974 individual can also be separated into multiple files. The | |
975 .B -P | |
976 option specifies that indel candidates should be collected only from | |
977 read groups with the | |
978 .B @RG-PL | |
979 tag set to | |
980 .IR ILLUMINA . | |
981 Collecting indel candidates from reads sequenced by an indel-prone | |
982 technology may affect the performance of indel calling. | |
983 | |
984 Note that there is a new calling model which can be invoked by | |
985 | |
986 bcftools view -m0.99 ... | |
987 | |
988 which fixes some severe limitations of the default method. | |
989 | |
990 For filtering, best results seem to be achieved by first applying the | |
991 .IR SnpGap | |
992 filter and then applying some machine learning approach | |
993 | |
994 vcf-annotate -f SnpGap=n | |
995 vcf filter ... | |
996 | |
997 Both can be found in the | |
998 .B vcftools | |
999 and | |
1000 .B htslib | |
1001 package (links below). | |
1002 | |
1003 .IP o 2 | |
1004 Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: | |
1005 | |
1006 samtools mpileup -Igf ref.fa *.bam > all.bcf | |
1007 bcftools view -bl sites.list all.bcf > sites.bcf | |
1008 bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs | |
1009 bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs | |
1010 bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs | |
1011 ...... | |
1012 | |
1013 where | |
1014 .I sites.list | |
1015 contains the list of sites with each line consisting of the reference | |
1016 sequence name and position. The following | |
1017 .B bcftools | |
1018 commands estimate AFS by EM. | |
1019 | |
1020 .IP o 2 | |
1021 Dump BAQ applied alignment for other SNP callers: | |
1022 | |
1023 samtools calmd -bAr aln.bam > aln.baq.bam | |
1024 | |
1025 It adds and corrects the | |
1026 .B NM | |
1027 and | |
1028 .B MD | |
1029 tags at the same time. The | |
1030 .B calmd | |
1031 command also comes with the | |
1032 .B -C | |
1033 option, the same as the one in | |
1034 .B pileup | |
1035 and | |
1036 .BR mpileup . | |
1037 Apply if it helps. | |
1038 | |
1039 .SH LIMITATIONS | |
1040 .PP | |
1041 .IP o 2 | |
1042 Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. | |
1043 .IP o 2 | |
1044 Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan | |
1045 reads or ends mapped to different chromosomes). If this is a concern, | |
1046 please use Picard's MarkDuplicate which correctly handles these cases, | |
1047 although a little slower. | |
1048 | |
1049 .SH AUTHOR | |
1050 .PP | |
1051 Heng Li from the Sanger Institute wrote the C version of samtools. Bob | |
1052 Handsaker from the Broad Institute implemented the BGZF library and Jue | |
1053 Ruan from Beijing Genomics Institute wrote the RAZF library. John | |
1054 Marshall and Petr Danecek contribute to the source code and various | |
1055 people from the 1000 Genomes Project have contributed to the SAM format | |
1056 specification. | |
1057 | |
1058 .SH SEE ALSO | |
1059 .PP | |
1060 Samtools website: <http://samtools.sourceforge.net> | |
1061 .br | |
1062 Samtools latest source: <https://github.com/samtools/samtools> | |
1063 .br | |
1064 VCFtools website with stable link to VCF specification: <http://vcftools.sourceforge.net> | |
1065 .br | |
1066 HTSlib website: <https://github.com/samtools/htslib> |