Mercurial > repos > ryanmorin > nextgen_variant_identification
diff SNV/SNVMix2_source/SNVMix2-v0.12.1-rc1/samtools-0.1.6/samtools.1 @ 0:74f5ea818cea
Uploaded
author | ryanmorin |
---|---|
date | Wed, 12 Oct 2011 19:50:38 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/SNV/SNVMix2_source/SNVMix2-v0.12.1-rc1/samtools-0.1.6/samtools.1 Wed Oct 12 19:50:38 2011 -0400 @@ -0,0 +1,445 @@ +.TH samtools 1 "2 September 2009" "samtools-0.1.6" "Bioinformatics tools" +.SH NAME +.PP +samtools - Utilities for the Sequence Alignment/Map (SAM) format +.SH SYNOPSIS +.PP +samtools view -bt ref_list.txt -o aln.bam aln.sam.gz +.PP +samtools sort aln.bam aln.sorted +.PP +samtools index aln.sorted.bam +.PP +samtools view aln.sorted.bam chr2:20,100,000-20,200,000 +.PP +samtools merge out.bam in1.bam in2.bam in3.bam +.PP +samtools faidx ref.fasta +.PP +samtools pileup -f ref.fasta aln.sorted.bam +.PP +samtools tview aln.sorted.bam ref.fasta + +.SH DESCRIPTION +.PP +Samtools is a set of utilities that manipulate alignments in the BAM +format. It imports from and exports to the SAM (Sequence Alignment/Map) +format, does sorting, merging and indexing, and allows to retrieve reads +in any regions swiftly. + +Samtools is designed to work on a stream. It regards an input file `-' +as the standard input (stdin) and an output file `-' as the standard +output (stdout). Several commands can thus be combined with Unix +pipes. Samtools always output warning and error messages to the standard +error output (stderr). + +Samtools is also able to open a BAM (not SAM) file on a remote FTP or +HTTP server if the BAM file name starts with `ftp://' or `http://'. +Samtools checks the current working directory for the index file and +will download the index upon absence. Samtools does not retrieve the +entire alignment file unless it is asked to do so. + +.SH COMMANDS AND OPTIONS + +.TP 10 +.B import +samtools import <in.ref_list> <in.sam> <out.bam> + +Since 0.1.4, this command is an alias of: + +samtools view -bt <in.ref_list> -o <out.bam> <in.sam> + +.TP +.B sort +samtools sort [-n] [-m maxMem] <in.bam> <out.prefix> + +Sort alignments by leftmost coordinates. File +.I <out.prefix>.bam +will be created. This command may also create temporary files +.I <out.prefix>.%d.bam +when the whole alignment cannot be fitted into memory (controlled by +option -m). + +.B OPTIONS: +.RS +.TP 8 +.B -n +Sort by read names rather than by chromosomal coordinates +.TP +.B -m INT +Approximately the maximum required memory. [500000000] +.RE + +.TP +.B merge +samtools merge [-h inh.sam] [-n] <out.bam> <in1.bam> <in2.bam> [...] + +Merge multiple sorted alignments. +The header reference lists of all the input BAM files, and the @SQ headers of +.IR inh.sam , +if any, must all refer to the same set of reference sequences. +The header reference list and (unless overridden by +.BR -h ) +`@' headers of +.I in1.bam +will be copied to +.IR out.bam , +and the headers of other files will be ignored. + +.B OPTIONS: +.RS +.TP 8 +.B -h FILE +Use the lines of +.I FILE +as `@' headers to be copied to +.IR out.bam , +replacing any header lines that would otherwise be copied from +.IR in1.bam . +.RI ( FILE +is actually in SAM format, though any alignment records it may contain +are ignored.) +.TP +.B -n +The input alignments are sorted by read names rather than by chromosomal +coordinates +.RE + +.TP +.B index +samtools index <aln.bam> + +Index sorted alignment for fast random access. Index file +.I <aln.bam>.bai +will be created. + +.TP +.B view +samtools view [-bhuHS] [-t in.refList] [-o output] [-f reqFlag] [-F +skipFlag] [-q minMapQ] [-l library] [-r readGroup] <in.bam>|<in.sam> [region1 [...]] + +Extract/print all or sub alignments in SAM or BAM format. If no region +is specified, all the alignments will be printed; otherwise only +alignments overlapping the specified regions will be output. An +alignment may be given multiple times if it is overlapping several +regions. A region can be presented, for example, in the following +format: `chr2', `chr2:1000000' or `chr2:1,000,000-2,000,000'. The +coordinate is 1-based. + +.B OPTIONS: +.RS +.TP 8 +.B -b +Output in the BAM format. +.TP +.B -u +Output uncompressed BAM. This option saves time spent on +compression/decomprssion and is thus preferred when the output is piped +to another samtools command. +.TP +.B -h +Include the header in the output. +.TP +.B -H +Output the header only. +.TP +.B -S +Input is in SAM. If @SQ header lines are absent, the +.B `-t' +option is required. +.TP +.B -t FILE +This file is TAB-delimited. Each line must contain the reference name +and the length of the reference, one line for each distinct reference; +additional fields are ignored. This file also defines the order of the +reference sequences in sorting. If you run `samtools faidx <ref.fa>', +the resultant index file +.I <ref.fa>.fai +can be used as this +.I <in.ref_list> +file. +.TP +.B -o FILE +Output file [stdout] +.TP +.B -f INT +Only output alignments with all bits in INT present in the FLAG +field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] +.TP +.B -F INT +Skip alignments with bits present in INT [0] +.TP +.B -q INT +Skip alignments with MAPQ smaller than INT [0] +.TP +.B -l STR +Only output reads in library STR [null] +.TP +.B -r STR +Only output reads in read group STR [null] +.RE + +.TP +.B faidx +samtools faidx <ref.fasta> [region1 [...]] + +Index reference sequence in the FASTA format or extract subsequence from +indexed reference sequence. If no region is specified, +.B faidx +will index the file and create +.I <ref.fasta>.fai +on the disk. If regions are speficified, the subsequences will be +retrieved and printed to stdout in the FASTA format. The input file can +be compressed in the +.B RAZF +format. + +.TP +.B pileup +samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l in.site_list] +[-iscgS2] [-T theta] [-N nHap] [-r pairDiffRate] <in.bam>|<in.sam> + +Print the alignment in the pileup format. In the pileup format, each +line represents a genomic position, consisting of chromosome name, +coordinate, reference base, read bases, read qualities and alignment +mapping qualities. Information on match, mismatch, indel, strand, +mapping quality and start and end of a read are all encoded at the read +base column. At this column, a dot stands for a match to the reference +base on the forward strand, a comma for a match on the reverse strand, +`ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch +on the reverse strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates +there is an insertion between this reference position and the next +reference position. The length of the insertion is given by the integer +in the pattern, followed by the inserted sequence. Similarly, a pattern +`-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. The +deleted bases will be presented as `*' in the following lines. Also at +the read base column, a symbol `^' marks the start of a read segment +which is a contiguous subsequence on the read separated by `N/S/H' CIGAR +operations. The ASCII of the character following `^' minus 33 gives the +mapping quality. A symbol `$' marks the end of a read segment. + +If option +.B -c +is applied, the consensus base, consensus quality, SNP quality and RMS +mapping quality of the reads covering the site will be inserted between +the `reference base' and the `read bases' columns. An indel occupies an +additional line. Each indel line consists of chromosome name, +coordinate, a star, the genotype, consensus quality, SNP quality, RMS +mapping quality, # covering reads, the first alllele, the second allele, +# reads supporting the first allele, # reads supporting the second +allele and # reads containing indels different from the top two alleles. + +.B OPTIONS: +.RS + +.TP 10 +.B -s +Print the mapping quality as the last column. This option makes the +output easier to parse, although this format is not space efficient. + +.TP +.B -S +The input file is in SAM. + +.TP +.B -i +Only output pileup lines containing indels. + +.TP +.B -f FILE +The reference sequence in the FASTA format. Index file +.I FILE.fai +will be created if +absent. + +.TP +.B -M INT +Cap mapping quality at INT [60] + +.TP +.B -t FILE +List of reference names ane sequence lengths, in the format described +for the +.B import +command. If this option is present, samtools assumes the input +.I <in.alignment> +is in SAM format; otherwise it assumes in BAM format. + +.TP +.B -l FILE +List of sites at which pileup is output. This file is space +delimited. The first two columns are required to be chromosome and +1-based coordinate. Additional columns are ignored. It is +recommended to use option +.B -s +together with +.B -l +as in the default format we may not know the mapping quality. + +.TP +.B -c +Call the consensus sequence using MAQ consensus model. Options +.B -T, +.B -N, +.B -I +and +.B -r +are only effective when +.B -c +or +.B -g +is in use. + +.TP +.B -g +Generate genotype likelihood in the binary GLFv3 format. This option +suppresses -c, -i and -s. + +.TP +.B -T FLOAT +The theta parameter (error dependency coefficient) in the maq consensus +calling model [0.85] + +.TP +.B -N INT +Number of haplotypes in the sample (>=2) [2] + +.TP +.B -r FLOAT +Expected fraction of differences between a pair of haplotypes [0.001] + +.TP +.B -I INT +Phred probability of an indel in sequencing/prep. [40] + +.RE + +.TP +.B tview +samtools tview <in.sorted.bam> [ref.fasta] + +Text alignment viewer (based on the ncurses library). In the viewer, +press `?' for help and press `g' to check the alignment start from a +region in the format like `chr10:10,000,000'. + +.RE + +.TP +.B fixmate +samtools fixmate <in.nameSrt.bam> <out.bam> + +Fill in mate coordinates, ISIZE and mate related flags from a +name-sorted alignment. + +.TP +.B rmdup +samtools rmdup <input.srt.bam> <out.bam> + +Remove potential PCR duplicates: if multiple read pairs have identical +external coordinates, only retain the pair with highest mapping quality. +This command +.B ONLY +works with FR orientation and requires ISIZE is correctly set. + +.RE + +.TP +.B rmdupse +samtools rmdupse <input.srt.bam> <out.bam> + +Remove potential duplicates for single-ended reads. This command will +treat all reads as single-ended even if they are paired in fact. + +.RE + +.TP +.B fillmd +samtools fillmd [-e] <aln.bam> <ref.fasta> + +Generate the MD tag. If the MD tag is already present, this command will +give a warning if the MD tag generated is different from the existing +tag. + +.B OPTIONS: +.RS +.TP 8 +.B -e +Convert a the read base to = if it is identical to the aligned reference +base. Indel caller does not support the = bases at the moment. + +.RE + +.SH SAM FORMAT + +SAM is TAB-delimited. Apart from the header lines, which are started +with the `@' symbol, each alignment line consists of: + +.TS +center box; +cb | cb | cb +n | l | l . +Col Field Description +_ +1 QNAME Query (pair) NAME +2 FLAG bitwise FLAG +3 RNAME Reference sequence NAME +4 POS 1-based leftmost POSition/coordinate of clipped sequence +5 MAPQ MAPping Quality (Phred-scaled) +6 CIAGR extended CIGAR string +7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) +8 MPOS 1-based Mate POSistion +9 ISIZE Inferred insert SIZE +10 SEQ query SEQuence on the same strand as the reference +11 QUAL query QUALity (ASCII-33 gives the Phred base quality) +12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE +.TE + +.PP +Each bit in the FLAG field is defined as: + +.TS +center box; +cb | cb +l | l . +Flag Description +_ +0x0001 the read is paired in sequencing +0x0002 the read is mapped in a proper pair +0x0004 the query sequence itself is unmapped +0x0008 the mate is unmapped +0x0010 strand of the query (1 for reverse) +0x0020 strand of the mate +0x0040 the read is the first read in a pair +0x0080 the read is the second read in a pair +0x0100 the alignment is not primary +0x0200 the read fails platform/vendor quality checks +0x0400 the read is either a PCR or an optical duplicate +.TE + +.SH LIMITATIONS +.PP +.IP o 2 +Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. +.IP o 2 +CIGAR operation P is not properly handled at the moment. +.IP o 2 +In merging, the input files are required to have the same number of +reference sequences. The requirement can be relaxed. In addition, +merging does not reconstruct the header dictionaries +automatically. Endusers have to provide the correct header. Picard is +better at merging. +.IP o 2 +Samtools' rmdup does not work for single-end data and does not remove +duplicates across chromosomes. Picard is better. + +.SH AUTHOR +.PP +Heng Li from the Sanger Institute wrote the C version of samtools. Bob +Handsaker from the Broad Institute implemented the BGZF library and Jue +Ruan from Beijing Genomics Institute wrote the RAZF library. Various +people in the 1000Genomes Project contributed to the SAM format +specification. + +.SH SEE ALSO +.PP +Samtools website: <http://samtools.sourceforge.net>