Mercurial > repos > lsong10 > psiclass
diff PsiCLASS-1.0.2/samtools-0.1.19/samtools.1 @ 0:903fc43d6227 draft default tip
Uploaded
author | lsong10 |
---|---|
date | Fri, 26 Mar 2021 16:52:45 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/PsiCLASS-1.0.2/samtools-0.1.19/samtools.1 Fri Mar 26 16:52:45 2021 +0000 @@ -0,0 +1,1066 @@ +.TH samtools 1 "15 March 2013" "samtools-0.1.19" "Bioinformatics tools" +.SH NAME +.PP +samtools - Utilities for the Sequence Alignment/Map (SAM) format + +bcftools - Utilities for the Binary Call Format (BCF) and VCF +.SH SYNOPSIS +.PP +samtools view -bt ref_list.txt -o aln.bam aln.sam.gz +.PP +samtools sort aln.bam aln.sorted +.PP +samtools index aln.sorted.bam +.PP +samtools idxstats aln.sorted.bam +.PP +samtools view aln.sorted.bam chr2:20,100,000-20,200,000 +.PP +samtools merge out.bam in1.bam in2.bam in3.bam +.PP +samtools faidx ref.fasta +.PP +samtools pileup -vcf ref.fasta aln.sorted.bam +.PP +samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam +.PP +samtools tview aln.sorted.bam ref.fasta +.PP +bcftools index in.bcf +.PP +bcftools view in.bcf chr2:100-200 > out.vcf +.PP +bcftools view -Nvm0.99 in.bcf > out.vcf 2> out.afs + +.SH DESCRIPTION +.PP +Samtools is a set of utilities that manipulate alignments in the BAM +format. It imports from and exports to the SAM (Sequence Alignment/Map) +format, does sorting, merging and indexing, and allows to retrieve reads +in any regions swiftly. + +Samtools is designed to work on a stream. It regards an input file `-' +as the standard input (stdin) and an output file `-' as the standard +output (stdout). Several commands can thus be combined with Unix +pipes. Samtools always output warning and error messages to the standard +error output (stderr). + +Samtools is also able to open a BAM (not SAM) file on a remote FTP or +HTTP server if the BAM file name starts with `ftp://' or `http://'. +Samtools checks the current working directory for the index file and +will download the index upon absence. Samtools does not retrieve the +entire alignment file unless it is asked to do so. + +.SH SAMTOOLS COMMANDS AND OPTIONS + +.TP 10 +.B view +samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F +skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]] + +Extract/print all or sub alignments in SAM or BAM format. If no region +is specified, all the alignments will be printed; otherwise only +alignments overlapping the specified regions will be output. An +alignment may be given multiple times if it is overlapping several +regions. A region can be presented, for example, in the following +format: `chr2' (the whole chr2), `chr2:1000000' (region starting from +1,000,000bp) or `chr2:1,000,000-2,000,000' (region between 1,000,000 and +2,000,000bp including the end points). The coordinate is 1-based. + +.B OPTIONS: +.RS +.TP 10 +.B -b +Output in the BAM format. +.TP +.BI -f \ INT +Only output alignments with all bits in INT present in the FLAG +field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] +.TP +.BI -F \ INT +Skip alignments with bits present in INT [0] +.TP +.B -h +Include the header in the output. +.TP +.B -H +Output the header only. +.TP +.BI -l \ STR +Only output reads in library STR [null] +.TP +.BI -o \ FILE +Output file [stdout] +.TP +.BI -q \ INT +Skip alignments with MAPQ smaller than INT [0] +.TP +.BI -r \ STR +Only output reads in read group STR [null] +.TP +.BI -R \ FILE +Output reads in read groups listed in +.I FILE +[null] +.TP +.BI -s \ FLOAT +Fraction of templates/pairs to subsample; the integer part is treated as the +seed for the random number generator [-1] +.TP +.B -S +Input is in SAM. If @SQ header lines are absent, the +.B `-t' +option is required. +.TP +.B -c +Instead of printing the alignments, only count them and print the +total number. All filter options, such as +.B `-f', +.B `-F' +and +.B `-q' +, are taken into account. +.TP +.BI -t \ FILE +This file is TAB-delimited. Each line must contain the reference name +and the length of the reference, one line for each distinct reference; +additional fields are ignored. This file also defines the order of the +reference sequences in sorting. If you run `samtools faidx <ref.fa>', +the resultant index file +.I <ref.fa>.fai +can be used as this +.I <in.ref_list> +file. +.TP +.B -u +Output uncompressed BAM. This option saves time spent on +compression/decomprssion and is thus preferred when the output is piped +to another samtools command. +.RE + +.TP +.B tview +samtools tview +.RB [ \-p +.IR chr:pos ] +.RB [ \-s +.IR STR ] +.RB [ \-d +.IR display ] +.RI <in.sorted.bam> +.RI [ref.fasta] + +Text alignment viewer (based on the ncurses library). In the viewer, +press `?' for help and press `g' to check the alignment start from a +region in the format like `chr10:10,000,000' or `=10,000,000' when +viewing the same reference sequence. + +.B Options: +.RS +.TP 14 +.BI -d \ display +Output as (H)tml or (C)urses or (T)ext +.TP +.BI -p \ chr:pos +Go directly to this position +.TP +.BI -s \ STR +Display only reads from this sample or read group +.RE + +.TP +.B mpileup +samtools mpileup +.RB [ \-EBugp ] +.RB [ \-C +.IR capQcoef ] +.RB [ \-r +.IR reg ] +.RB [ \-f +.IR in.fa ] +.RB [ \-l +.IR list ] +.RB [ \-M +.IR capMapQ ] +.RB [ \-Q +.IR minBaseQ ] +.RB [ \-q +.IR minMapQ ] +.I in.bam +.RI [ in2.bam +.RI [ ... ]] + +Generate BCF or pileup for one or multiple BAM files. Alignment records +are grouped by sample identifiers in @RG header lines. If sample +identifiers are absent, each input file is regarded as one sample. + +In the pileup format (without +.BR -u or -g ), +each +line represents a genomic position, consisting of chromosome name, +coordinate, reference base, read bases, read qualities and alignment +mapping qualities. Information on match, mismatch, indel, strand, +mapping quality and start and end of a read are all encoded at the read +base column. At this column, a dot stands for a match to the reference +base on the forward strand, a comma for a match on the reverse strand, +a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward +strand and `acgtn' for a mismatch on the reverse strand. A pattern +`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this +reference position and the next reference position. The length of the +insertion is given by the integer in the pattern, followed by the +inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' +represents a deletion from the reference. The deleted bases will be +presented as `*' in the following lines. Also at the read base column, a +symbol `^' marks the start of a read. The ASCII of the character +following `^' minus 33 gives the mapping quality. A symbol `$' marks the +end of a read segment. + +.B Input Options: +.RS +.TP 10 +.B -6 +Assume the quality is in the Illumina 1.3+ encoding. +.B -A +Do not skip anomalous read pairs in variant calling. +.TP +.B -B +Disable probabilistic realignment for the computation of base alignment +quality (BAQ). BAQ is the Phred-scaled probability of a read base being +misaligned. Applying this option greatly helps to reduce false SNPs +caused by misalignments. +.TP +.BI -b \ FILE +List of input BAM files, one file per line [null] +.TP +.BI -C \ INT +Coefficient for downgrading mapping quality for reads containing +excessive mismatches. Given a read with a phred-scaled probability q of +being generated from the mapped position, the new mapping quality is +about sqrt((INT-q)/INT)*INT. A zero value disables this +functionality; if enabled, the recommended value for BWA is 50. [0] +.TP +.BI -d \ INT +At a position, read maximally +.I INT +reads per input BAM. [250] +.TP +.B -E +Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt +specificity a little bit. +.TP +.BI -f \ FILE +The +.BR faidx -indexed +reference file in the FASTA format. The file can be optionally compressed by +.BR razip . +[null] +.TP +.BI -l \ FILE +BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null] +.TP +.BI -q \ INT +Minimum mapping quality for an alignment to be used [0] +.TP +.BI -Q \ INT +Minimum base quality for a base to be considered [13] +.TP +.BI -r \ STR +Only generate pileup in region +.I STR +[all sites] +.TP +.B Output Options: + +.TP +.B -D +Output per-sample read depth +.TP +.B -g +Compute genotype likelihoods and output them in the binary call format (BCF). +.TP +.B -S +Output per-sample Phred-scaled strand bias P-value +.TP +.B -u +Similar to +.B -g +except that the output is uncompressed BCF, which is preferred for piping. + +.TP +.B Options for Genotype Likelihood Computation (for -g or -u): + +.TP +.BI -e \ INT +Phred-scaled gap extension sequencing error probability. Reducing +.I INT +leads to longer indels. [20] +.TP +.BI -h \ INT +Coefficient for modeling homopolymer errors. Given an +.IR l -long +homopolymer +run, the sequencing error of an indel of size +.I s +is modeled as +.IR INT * s / l . +[100] +.TP +.B -I +Do not perform INDEL calling +.TP +.BI -L \ INT +Skip INDEL calling if the average per-sample depth is above +.IR INT . +[250] +.TP +.BI -o \ INT +Phred-scaled gap open sequencing error probability. Reducing +.I INT +leads to more indel calls. [40] +.TP +.BI -p +Apply -m and -F thresholds per sample to increase sensitivity of calling. +By default both options are applied to reads pooled from all samples. +.TP +.BI -P \ STR +Comma dilimited list of platforms (determined by +.BR @RG-PL ) +from which indel candidates are obtained. It is recommended to collect +indel candidates from sequencing technologies that have low indel error +rate such as ILLUMINA. [all] +.RE + +.TP +.B reheader +samtools reheader <in.header.sam> <in.bam> + +Replace the header in +.I in.bam +with the header in +.I in.header.sam. +This command is much faster than replacing the header with a +BAM->SAM->BAM conversion. + +.TP +.B cat +samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ] + +Concatenate BAMs. The sequence dictionary of each input BAM must be identical, +although this command does not check this. This command uses a similar trick +to +.B reheader +which enables fast BAM concatenation. + +.TP +.B sort +samtools sort [-nof] [-m maxMem] <in.bam> <out.prefix> + +Sort alignments by leftmost coordinates. File +.I <out.prefix>.bam +will be created. This command may also create temporary files +.I <out.prefix>.%d.bam +when the whole alignment cannot be fitted into memory (controlled by +option -m). + +.B OPTIONS: +.RS +.TP 8 +.B -o +Output the final alignment to the standard output. +.TP +.B -n +Sort by read names rather than by chromosomal coordinates +.TP +.B -f +Use +.I <out.prefix> +as the full output path and do not append +.I .bam +suffix. +.TP +.BI -m \ INT +Approximately the maximum required memory. [500000000] +.RE + +.TP +.B merge +samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...] + +Merge multiple sorted alignments. +The header reference lists of all the input BAM files, and the @SQ headers of +.IR inh.sam , +if any, must all refer to the same set of reference sequences. +The header reference list and (unless overridden by +.BR -h ) +`@' headers of +.I in1.bam +will be copied to +.IR out.bam , +and the headers of other files will be ignored. + +.B OPTIONS: +.RS +.TP 8 +.B -1 +Use zlib compression level 1 to comrpess the output +.TP +.B -f +Force to overwrite the output file if present. +.TP 8 +.BI -h \ FILE +Use the lines of +.I FILE +as `@' headers to be copied to +.IR out.bam , +replacing any header lines that would otherwise be copied from +.IR in1.bam . +.RI ( FILE +is actually in SAM format, though any alignment records it may contain +are ignored.) +.TP +.B -n +The input alignments are sorted by read names rather than by chromosomal +coordinates +.TP +.BI -R \ STR +Merge files in the specified region indicated by +.I STR +[null] +.TP +.B -r +Attach an RG tag to each alignment. The tag value is inferred from file names. +.TP +.B -u +Uncompressed BAM output +.RE + +.TP +.B index +samtools index <aln.bam> + +Index sorted alignment for fast random access. Index file +.I <aln.bam>.bai +will be created. + +.TP +.B idxstats +samtools idxstats <aln.bam> + +Retrieve and print stats in the index file. The output is TAB delimited +with each line consisting of reference sequence name, sequence length, # +mapped reads and # unmapped reads. + +.TP +.B faidx +samtools faidx <ref.fasta> [region1 [...]] + +Index reference sequence in the FASTA format or extract subsequence from +indexed reference sequence. If no region is specified, +.B faidx +will index the file and create +.I <ref.fasta>.fai +on the disk. If regions are speficified, the subsequences will be +retrieved and printed to stdout in the FASTA format. The input file can +be compressed in the +.B RAZF +format. + +.TP +.B fixmate +samtools fixmate <in.nameSrt.bam> <out.bam> + +Fill in mate coordinates, ISIZE and mate related flags from a +name-sorted alignment. + +.TP +.B rmdup +samtools rmdup [-sS] <input.srt.bam> <out.bam> + +Remove potential PCR duplicates: if multiple read pairs have identical +external coordinates, only retain the pair with highest mapping quality. +In the paired-end mode, this command +.B ONLY +works with FR orientation and requires ISIZE is correctly set. It does +not work for unpaired reads (e.g. two ends mapped to different +chromosomes or orphan reads). + +.B OPTIONS: +.RS +.TP 8 +.B -s +Remove duplicate for single-end reads. By default, the command works for +paired-end reads only. +.TP 8 +.B -S +Treat paired-end reads and single-end reads. +.RE + +.TP +.B calmd +samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta> + +Generate the MD tag. If the MD tag is already present, this command will +give a warning if the MD tag generated is different from the existing +tag. Output SAM by default. + +.B OPTIONS: +.RS +.TP 8 +.B -A +When used jointly with +.B -r +this option overwrites the original base quality. +.TP 8 +.B -e +Convert a the read base to = if it is identical to the aligned reference +base. Indel caller does not support the = bases at the moment. +.TP +.B -u +Output uncompressed BAM +.TP +.B -b +Output compressed BAM +.TP +.B -S +The input is SAM with header lines +.TP +.BI -C \ INT +Coefficient to cap mapping quality of poorly mapped reads. See the +.B pileup +command for details. [0] +.TP +.B -r +Compute the BQ tag (without -A) or cap base quality by BAQ (with -A). +.TP +.B -E +Extended BAQ calculation. This option trades specificity for sensitivity, though the +effect is minor. +.RE + +.TP +.B targetcut +samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam> + +This command identifies target regions by examining the continuity of read depth, computes +haploid consensus sequences of targets and outputs a SAM with each sequence corresponding +to a target. When option +.B -f +is in use, BAQ will be applied. This command is +.B only +designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)]. +.RE + +.TP +.B phase +samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam> + +Call and phase heterozygous SNPs. +.B OPTIONS: +.RS +.TP 8 +.B -A +Drop reads with ambiguous phase. +.TP 8 +.BI -b \ STR +Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file +.BR STR .0.bam +and phase-1 reads in +.BR STR .1.bam. +Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads +with switch errors will be saved in +.BR STR .chimeric.bam. +[null] +.TP +.B -F +Do not attempt to fix chimeric reads. +.TP +.BI -k \ INT +Maximum length for local phasing. [13] +.TP +.BI -q \ INT +Minimum Phred-scaled LOD to call a heterozygote. [40] +.TP +.BI -Q \ INT +Minimum base quality to be used in het calling. [13] +.RE + +.SH BCFTOOLS COMMANDS AND OPTIONS + +.TP 10 +.B view +.B bcftools view +.RB [ \-AbFGNQSucgv ] +.RB [ \-D +.IR seqDict ] +.RB [ \-l +.IR listLoci ] +.RB [ \-s +.IR listSample ] +.RB [ \-i +.IR gapSNPratio ] +.RB [ \-t +.IR mutRate ] +.RB [ \-p +.IR varThres ] +.RB [ \-m +.IR varThres ] +.RB [ \-P +.IR prior ] +.RB [ \-1 +.IR nGroup1 ] +.RB [ \-d +.IR minFrac ] +.RB [ \-U +.IR nPerm ] +.RB [ \-X +.IR permThres ] +.RB [ \-T +.IR trioType ] +.I in.bcf +.RI [ region ] + +Convert between BCF and VCF, call variant candidates and estimate allele +frequencies. + +.RS +.TP +.B Input/Output Options: +.TP 10 +.B -A +Retain all possible alternate alleles at variant sites. By default, the view +command discards unlikely alleles. +.TP 10 +.B -b +Output in the BCF format. The default is VCF. +.TP +.BI -D \ FILE +Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null] +.TP +.B -F +Indicate PL is generated by r921 or before (ordering is different). +.TP +.B -G +Suppress all individual genotype information. +.TP +.BI -l \ FILE +List of sites at which information are outputted [all sites] +.TP +.B -N +Skip sites where the REF field is not A/C/G/T +.TP +.B -Q +Output the QCALL likelihood format +.TP +.BI -s \ FILE +List of samples to use. The first column in the input gives the sample names +and the second gives the ploidy, which can only be 1 or 2. When the 2nd column +is absent, the sample ploidy is assumed to be 2. In the output, the ordering of +samples will be identical to the one in +.IR FILE . +[null] +.TP +.B -S +The input is VCF instead of BCF. +.TP +.B -u +Uncompressed BCF output (force -b). +.TP +.B Consensus/Variant Calling Options: +.TP 10 +.B -c +Call variants using Bayesian inference. This option automatically invokes option +.BR -e . +.TP +.BI -d \ FLOAT +When +.B -v +is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0] +.TP +.B -e +Perform max-likelihood inference only, including estimating the site allele frequency, +testing Hardy-Weinberg equlibrium and testing associations with LRT. +.TP +.B -g +Call per-sample genotypes at variant sites (force -c) +.TP +.BI -i \ FLOAT +Ratio of INDEL-to-SNP mutation rate [0.15] +.TP +.BI -m \ FLOAT +New model for improved multiallelic and rare-variant calling. Another +ALT allele is accepted if P(chi^2) of LRT exceeds the FLOAT threshold. The +parameter seems robust and the actual value usually does not affect the results +much; a good value to use is 0.99. This is the recommended calling method. [0] +.TP +.BI -p \ FLOAT +A site is considered to be a variant if P(ref|D)<FLOAT [0.5] +.TP +.BI -P \ STR +Prior or initial allele frequency spectrum. If STR can be +.IR full , +.IR cond2 , +.I flat +or the file consisting of error output from a previous variant calling +run. +.TP +.BI -t \ FLOAT +Scaled muttion rate for variant calling [0.001] +.TP +.BI -T \ STR +Enable pair/trio calling. For trio calling, option +.B -s +is usually needed to be applied to configure the trio members and their ordering. +In the file supplied to the option +.BR -s , +the first sample must be the child, the second the father and the third the mother. +The valid values of +.I STR +are `pair', `trioauto', `trioxd' and `trioxs', where `pair' calls differences between two input samples, and `trioxd' (`trioxs') specifies that the input +is from the X chromosome non-PAR regions and the child is a female (male). [null] +.TP +.B -v +Output variant sites only (force -c) +.TP +.B Contrast Calling and Association Test Options: +.TP +.BI -1 \ INT +Number of group-1 samples. This option is used for dividing the samples into +two groups for contrast SNP calling or association test. +When this option is in use, the following VCF INFO will be outputted: +PC2, PCHI2 and QCHI2. [0] +.TP +.BI -U \ INT +Number of permutations for association test (effective only with +.BR -1 ) +[0] +.TP +.BI -X \ FLOAT +Only perform permutations for P(chi^2)<FLOAT (effective only with +.BR -U ) +[0.01] +.RE + +.TP +.B index +.B bcftools index +.I in.bcf + +Index sorted BCF for random access. +.RE + +.TP +.B cat +.B bcftools cat +.I in1.bcf +.RI [ "in2.bcf " [ ... "]]]" + +Concatenate BCF files. The input files are required to be sorted and +have identical samples appearing in the same order. +.RE +.SH SAM FORMAT + +Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started +with the `@' symbol, each alignment line consists of: + +.TS +center box; +cb | cb | cb +n | l | l . +Col Field Description +_ +1 QNAME Query template/pair NAME +2 FLAG bitwise FLAG +3 RNAME Reference sequence NAME +4 POS 1-based leftmost POSition/coordinate of clipped sequence +5 MAPQ MAPping Quality (Phred-scaled) +6 CIAGR extended CIGAR string +7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) +8 MPOS 1-based Mate POSistion +9 TLEN inferred Template LENgth (insert size) +10 SEQ query SEQuence on the same strand as the reference +11 QUAL query QUALity (ASCII-33 gives the Phred base quality) +12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE +.TE + +.PP +Each bit in the FLAG field is defined as: + +.TS +center box; +cb | cb | cb +l | c | l . +Flag Chr Description +_ +0x0001 p the read is paired in sequencing +0x0002 P the read is mapped in a proper pair +0x0004 u the query sequence itself is unmapped +0x0008 U the mate is unmapped +0x0010 r strand of the query (1 for reverse) +0x0020 R strand of the mate +0x0040 1 the read is the first read in a pair +0x0080 2 the read is the second read in a pair +0x0100 s the alignment is not primary +0x0200 f the read fails platform/vendor quality checks +0x0400 d the read is either a PCR or an optical duplicate +.TE + +where the second column gives the string representation of the FLAG field. + +.SH VCF FORMAT + +The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields: +.TS +center box; +cb | cb | cb +n | l | l . +Col Field Description +_ +1 CHROM CHROMosome name +2 POS the left-most POSition of the variant +3 ID unique variant IDentifier +4 REF the REFerence allele +5 ALT the ALTernate allele(s), separated by comma +6 QUAL variant/reference QUALity +7 FILTER FILTers applied +8 INFO INFOrmation related to the variant, separated by semi-colon +9 FORMAT FORMAT of the genotype fields, separated by colon (optional) +10+ SAMPLE SAMPLE genotypes and per-sample information (optional) +.TE + +.PP +The following table gives the +.B INFO +tags used by samtools and bcftools. + +.TS +center box; +cb | cb | cb +l | l | l . +Tag Format Description +_ +AF1 double Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele +DP int Raw read depth (without quality filtering) +DP4 int[4] # high-quality reference forward bases, ref reverse, alternate for and alt rev bases +FQ int Consensus quality. Positive: sample genotypes different; negative: otherwise +MQ int Root-Mean-Square mapping quality of covering reads +PC2 int[2] Phred probability of AF in group1 samples being larger (,smaller) than in group2 +PCHI2 double Posterior weighted chi^2 P-value between group1 and group2 samples +PV4 double[4] P-value for strand bias, baseQ bias, mapQ bias and tail distance bias +QCHI2 int Phred-scaled PCHI2 +RP int # permutations yielding a smaller PCHI2 +CLR int Phred log ratio of genotype likelihoods with and without the trio/pair constraint +UGT string Most probable genotype configuration without the trio constraint +CGT string Most probable configuration with the trio constraint +VDB float Tests variant positions within reads. Intended for filtering RNA-seq artifacts around splice sites +RPB float Mann-Whitney rank-sum test for tail distance bias +HWE float Hardy-Weinberg equilibrium test, Wigginton et al., PMID: 15789306 +.TE + +.SH EXAMPLES +.IP o 2 +Import SAM to BAM when +.B @SQ +lines are present in the header: + + samtools view -bS aln.sam > aln.bam + +If +.B @SQ +lines are absent: + + samtools faidx ref.fa + samtools view -bt ref.fa.fai aln.sam > aln.bam + +where +.I ref.fa.fai +is generated automatically by the +.B faidx +command. + +.IP o 2 +Attach the +.B RG +tag while merging sorted alignments: + + perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt + samtools merge -rh rg.txt merged.bam ga.bam 454.bam + +The value in a +.B RG +tag is determined by the file name the read is coming from. In this +example, in the +.IR merged.bam , +reads from +.I ga.bam +will be attached +.IR RG:Z:ga , +while reads from +.I 454.bam +will be attached +.IR RG:Z:454 . + +.IP o 2 +Call SNPs and short INDELs for one diploid individual: + + samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf + +The +.B -D +option of varFilter controls the maximum read depth, which should be +adjusted to about twice the average read depth. One may consider to add +.B -C50 +to +.B mpileup +if mapping quality is overestimated for reads containing excessive +mismatches. Applying this option usually helps +.B BWA-short +but may not other mappers. + +.IP o 2 +Generate the consensus sequence for one diploid individual: + + samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq + +.IP o 2 +Call somatic mutations from a pair of samples: + + samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair - > var.bcf + +In the output INFO field, +.I CLR +gives the Phred-log ratio between the likelihood by treating the +two samples independently, and the likelihood by requiring the genotype to be identical. +This +.I CLR +is effectively a score measuring the confidence of somatic calls. The higher the better. + +.IP o 2 +Call de novo and somatic mutations from a family trio: + + samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -s samples.txt - > var.bcf + +File +.I samples.txt +should consist of three lines specifying the member and order of samples (in the order of child-father-mother). +Similarly, +.I CLR +gives the Phred-log likelihood ratio with and without the trio constraint. +.I UGT +shows the most likely genotype configuration without the trio constraint, and +.I CGT +gives the most likely genotype configuration satisfying the trio constraint. + +.IP o 2 +Phase one individual: + + samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out + +The +.B calmd +command is used to reduce false heterozygotes around INDELs. + +.IP o 2 +Call SNPs and short indels for multiple diploid individuals: + + samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf + bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf + +Individuals are identified from the +.B SM +tags in the +.B @RG +header lines. Individuals can be pooled in one alignment file; one +individual can also be separated into multiple files. The +.B -P +option specifies that indel candidates should be collected only from +read groups with the +.B @RG-PL +tag set to +.IR ILLUMINA . +Collecting indel candidates from reads sequenced by an indel-prone +technology may affect the performance of indel calling. + +Note that there is a new calling model which can be invoked by + + bcftools view -m0.99 ... + +which fixes some severe limitations of the default method. + +For filtering, best results seem to be achieved by first applying the +.IR SnpGap +filter and then applying some machine learning approach + + vcf-annotate -f SnpGap=n + vcf filter ... + +Both can be found in the +.B vcftools +and +.B htslib +package (links below). + +.IP o 2 +Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: + + samtools mpileup -Igf ref.fa *.bam > all.bcf + bcftools view -bl sites.list all.bcf > sites.bcf + bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs + bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs + bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs + ...... + +where +.I sites.list +contains the list of sites with each line consisting of the reference +sequence name and position. The following +.B bcftools +commands estimate AFS by EM. + +.IP o 2 +Dump BAQ applied alignment for other SNP callers: + + samtools calmd -bAr aln.bam > aln.baq.bam + +It adds and corrects the +.B NM +and +.B MD +tags at the same time. The +.B calmd +command also comes with the +.B -C +option, the same as the one in +.B pileup +and +.BR mpileup . +Apply if it helps. + +.SH LIMITATIONS +.PP +.IP o 2 +Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. +.IP o 2 +Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan +reads or ends mapped to different chromosomes). If this is a concern, +please use Picard's MarkDuplicate which correctly handles these cases, +although a little slower. + +.SH AUTHOR +.PP +Heng Li from the Sanger Institute wrote the C version of samtools. Bob +Handsaker from the Broad Institute implemented the BGZF library and Jue +Ruan from Beijing Genomics Institute wrote the RAZF library. John +Marshall and Petr Danecek contribute to the source code and various +people from the 1000 Genomes Project have contributed to the SAM format +specification. + +.SH SEE ALSO +.PP +Samtools website: <http://samtools.sourceforge.net> +.br +Samtools latest source: <https://github.com/samtools/samtools> +.br +VCFtools website with stable link to VCF specification: <http://vcftools.sourceforge.net> +.br +HTSlib website: <https://github.com/samtools/htslib>