view SNV/SNVMix2_source/SNVMix2-v0.12.1-rc1/samtools-0.1.6/samtools.txt @ 0:74f5ea818cea

Uploaded
author ryanmorin
date Wed, 12 Oct 2011 19:50:38 -0400
parents
children
line wrap: on
line source

samtools(1)                  Bioinformatics tools                  samtools(1)



NAME
       samtools - Utilities for the Sequence Alignment/Map (SAM) format

SYNOPSIS
       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

       samtools sort aln.bam aln.sorted

       samtools index aln.sorted.bam

       samtools view aln.sorted.bam chr2:20,100,000-20,200,000

       samtools merge out.bam in1.bam in2.bam in3.bam

       samtools faidx ref.fasta

       samtools pileup -f ref.fasta aln.sorted.bam

       samtools tview aln.sorted.bam ref.fasta


DESCRIPTION
       Samtools  is  a  set of utilities that manipulate alignments in the BAM
       format. It imports from and exports to the SAM (Sequence Alignment/Map)
       format,  does  sorting,  merging  and  indexing, and allows to retrieve
       reads in any regions swiftly.

       Samtools is designed to work on a stream. It regards an input file  `-'
       as  the  standard  input (stdin) and an output file `-' as the standard
       output (stdout). Several commands can thus be combined with Unix pipes.
       Samtools always output warning and error messages to the standard error
       output (stderr).

       Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
       HTTP  server  if  the  BAM file name starts with `ftp://' or `http://'.
       Samtools checks the current working directory for the  index  file  and
       will  download  the  index upon absence. Samtools does not retrieve the
       entire alignment file unless it is asked to do so.


COMMANDS AND OPTIONS
       import    samtools import <in.ref_list> <in.sam> <out.bam>

                 Since 0.1.4, this command is an alias of:

                 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>


       sort      samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>

                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
                 fix>.bam will be created. This command may also create tempo-
                 rary files <out.prefix>.%d.bam when the whole alignment  can-
                 not be fitted into memory (controlled by option -m).

                 OPTIONS:

                 -n      Sort by read names rather than by chromosomal coordi-
                         nates

                 -m INT  Approximately   the    maximum    required    memory.
                         [500000000]


       merge     samtools   merge   [-h   inh.sam]  [-n]  <out.bam>  <in1.bam>
                 <in2.bam> [...]

                 Merge multiple sorted alignments.  The header reference lists
                 of  all  the input BAM files, and the @SQ headers of inh.sam,
                 if  any,  must  all  refer  to  the  same  set  of  reference
                 sequences.   The header reference list and (unless overridden
                 by -h) `@' headers of in1.bam will be copied to out.bam,  and
                 the headers of other files will be ignored.

                 OPTIONS:

                 -h FILE Use  the lines of FILE as `@' headers to be copied to
                         out.bam, replacing any header lines that would other-
                         wise  be  copied  from in1.bam.  (FILE is actually in
                         SAM format, though any alignment records it may  con-
                         tain are ignored.)

                 -n      The  input alignments are sorted by read names rather
                         than by chromosomal coordinates


       index     samtools index <aln.bam>

                 Index sorted alignment for fast  random  access.  Index  file
                 <aln.bam>.bai will be created.


       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
                 reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                 Group] <in.bam>|<in.sam> [region1 [...]]

                 Extract/print  all or sub alignments in SAM or BAM format. If
                 no region is specified, all the alignments will  be  printed;
                 otherwise  only  alignments overlapping the specified regions
                 will be output. An alignment may be given multiple  times  if
                 it is overlapping several regions. A region can be presented,
                 for example, in the following format: `chr2',  `chr2:1000000'
                 or `chr2:1,000,000-2,000,000'. The coordinate is 1-based.

                 OPTIONS:

                 -b      Output in the BAM format.

                 -u      Output uncompressed BAM. This option saves time spent
                         on compression/decomprssion  and  is  thus  preferred
                         when the output is piped to another samtools command.

                 -h      Include the header in the output.

                 -H      Output the header only.

                 -S      Input is in SAM. If @SQ header lines are absent,  the
                         `-t' option is required.

                 -t FILE This  file  is  TAB-delimited. Each line must contain
                         the reference name and the length of  the  reference,
                         one  line  for  each  distinct  reference; additional
                         fields are ignored. This file also defines the  order
                         of  the  reference  sequences  in sorting. If you run
                         `samtools faidx <ref.fa>', the resultant  index  file
                         <ref.fa>.fai  can be used as this <in.ref_list> file.

                 -o FILE Output file [stdout]

                 -f INT  Only output alignments with all bits in  INT  present
                         in the FLAG field. INT can be in hex in the format of
                         /^0x[0-9A-F]+/ [0]

                 -F INT  Skip alignments with bits present in INT [0]

                 -q INT  Skip alignments with MAPQ smaller than INT [0]

                 -l STR  Only output reads in library STR [null]

                 -r STR  Only output reads in read group STR [null]


       faidx     samtools faidx <ref.fasta> [region1 [...]]

                 Index reference sequence in the FASTA format or extract  sub-
                 sequence  from  indexed  reference  sequence. If no region is
                 specified,   faidx   will   index   the   file   and   create
                 <ref.fasta>.fai  on the disk. If regions are speficified, the
                 subsequences will be retrieved and printed to stdout  in  the
                 FASTA  format.  The  input file can be compressed in the RAZF
                 format.


       pileup    samtools  pileup  [-f  in.ref.fasta]  [-t  in.ref_list]   [-l
                 in.site_list]    [-iscgS2]   [-T   theta]   [-N   nHap]   [-r
                 pairDiffRate] <in.bam>|<in.sam>

                 Print the alignment in the pileup format. In the pileup  for-
                 mat,  each  line represents a genomic position, consisting of
                 chromosome name, coordinate, reference base, read bases, read
                 qualities  and  alignment  mapping  qualities. Information on
                 match, mismatch, indel, strand, mapping quality and start and
                 end  of  a  read  are all encoded at the read base column. At
                 this column, a dot stands for a match to the  reference  base
                 on  the  forward  strand,  a comma for a match on the reverse
                 strand, `ACGTN' for a mismatch  on  the  forward  strand  and
                 `acgtn'  for  a  mismatch  on  the  reverse strand. A pattern
                 `\+[0-9]+[ACGTNacgtn]+'  indicates  there  is  an   insertion
                 between  this reference position and the next reference posi-
                 tion. The length of the insertion is given by the integer  in
                 the  pattern, followed by the inserted sequence. Similarly, a
                 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
                 reference.  The deleted bases will be presented as `*' in the
                 following lines. Also at the read base column, a  symbol  `^'
                 marks  the start of a read segment which is a contiguous sub-
                 sequence on the read separated by `N/S/H'  CIGAR  operations.
                 The  ASCII  of the character following `^' minus 33 gives the
                 mapping quality. A symbol `$' marks the end of  a  read  seg-
                 ment.

                 If  option -c is applied, the consensus base, consensus qual-
                 ity, SNP quality and RMS mapping quality of the reads  cover-
                 ing  the  site  will be inserted between the `reference base'
                 and the `read bases' columns. An indel occupies an additional
                 line.  Each  indel  line consists of chromosome name, coordi-
                 nate, a star, the genotype, consensus quality,  SNP  quality,
                 RMS mapping quality, # covering reads, the first alllele, the
                 second allele, # reads supporting the first allele,  #  reads
                 supporting  the  second  allele and # reads containing indels
                 different from the top two alleles.

                 OPTIONS:


                 -s        Print the mapping quality as the last column.  This
                           option  makes  the output easier to parse, although
                           this format is not space efficient.


                 -S        The input file is in SAM.


                 -i        Only output pileup lines containing indels.


                 -f FILE   The reference sequence in the FASTA  format.  Index
                           file FILE.fai will be created if absent.


                 -M INT    Cap mapping quality at INT [60]


                 -t FILE   List  of  reference  names ane sequence lengths, in
                           the format described for  the  import  command.  If
                           this  option is present, samtools assumes the input
                           <in.alignment>  is  in  SAM  format;  otherwise  it
                           assumes in BAM format.


                 -l FILE   List  of sites at which pileup is output. This file
                           is space  delimited.  The  first  two  columns  are
                           required  to  be chromosome and 1-based coordinate.
                           Additional columns are ignored. It  is  recommended
                           to use option -s together with -l as in the default
                           format we may not know the mapping quality.


                 -c        Call the consensus  sequence  using  MAQ  consensus
                           model. Options -T, -N, -I and -r are only effective
                           when -c or -g is in use.


                 -g        Generate genotype likelihood in  the  binary  GLFv3
                           format. This option suppresses -c, -i and -s.


                 -T FLOAT  The  theta parameter (error dependency coefficient)
                           in the maq consensus calling model [0.85]


                 -N INT    Number of haplotypes in the sample (>=2) [2]


                 -r FLOAT  Expected fraction of differences between a pair  of
                           haplotypes [0.001]


                 -I INT    Phred  probability  of an indel in sequencing/prep.
                           [40]



       tview     samtools tview <in.sorted.bam> [ref.fasta]

                 Text alignment viewer (based on the ncurses library). In  the
                 viewer,  press `?' for help and press `g' to check the align-
                 ment   start   from   a   region   in   the    format    like
                 `chr10:10,000,000'.



       fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>

                 Fill in mate coordinates, ISIZE and mate related flags from a
                 name-sorted alignment.


       rmdup     samtools rmdup <input.srt.bam> <out.bam>

                 Remove potential PCR duplicates: if multiple read pairs  have
                 identical  external  coordinates,  only  retain the pair with
                 highest mapping quality.  This command  ONLY  works  with  FR
                 orientation and requires ISIZE is correctly set.



       rmdupse   samtools rmdupse <input.srt.bam> <out.bam>

                 Remove potential duplicates for single-ended reads. This com-
                 mand will treat all reads as single-ended even  if  they  are
                 paired in fact.



       fillmd    samtools fillmd [-e] <aln.bam> <ref.fasta>

                 Generate  the  MD tag. If the MD tag is already present, this
                 command will give a warning if the MD tag generated  is  dif-
                 ferent from the existing tag.

                 OPTIONS:

                 -e      Convert  a  the  read base to = if it is identical to
                         the aligned reference base.  Indel  caller  does  not
                         support the = bases at the moment.



SAM FORMAT
       SAM  is  TAB-delimited.  Apart from the header lines, which are started
       with the `@' symbol, each alignment line consists of:


       +----+-------+----------------------------------------------------------+
       |Col | Field |                       Description                        |
       +----+-------+----------------------------------------------------------+
       | 1  | QNAME | Query (pair) NAME                                        |
       | 2  | FLAG  | bitwise FLAG                                             |
       | 3  | RNAME | Reference sequence NAME                                  |
       | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
       | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
       | 6  | CIAGR | extended CIGAR string                                    |
       | 7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
       | 8  | MPOS  | 1-based Mate POSistion                                   |
       | 9  | ISIZE | Inferred insert SIZE                                     |
       |10  | SEQ   | query SEQuence on the same strand as the reference       |
       |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
       |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
       +----+-------+----------------------------------------------------------+

       Each bit in the FLAG field is defined as:


             +-------+--------------------------------------------------+
             | Flag  |                   Description                    |
             +-------+--------------------------------------------------+
             |0x0001 | the read is paired in sequencing                 |
             |0x0002 | the read is mapped in a proper pair              |
             |0x0004 | the query sequence itself is unmapped            |
             |0x0008 | the mate is unmapped                             |
             |0x0010 | strand of the query (1 for reverse)              |
             |0x0020 | strand of the mate                               |
             |0x0040 | the read is the first read in a pair             |
             |0x0080 | the read is the second read in a pair            |
             |0x0100 | the alignment is not primary                     |
             |0x0200 | the read fails platform/vendor quality checks    |
             |0x0400 | the read is either a PCR or an optical duplicate |
             +-------+--------------------------------------------------+

LIMITATIONS
       o Unaligned  words  used  in  bam_import.c,  bam_endian.h,  bam.c   and
         bam_aux.c.

       o CIGAR operation P is not properly handled at the moment.

       o In  merging,  the input files are required to have the same number of
         reference sequences. The requirement can  be  relaxed.  In  addition,
         merging  does  not reconstruct the header dictionaries automatically.
         Endusers have to provide the correct  header.  Picard  is  better  at
         merging.

       o Samtools' rmdup does not work for single-end data and does not remove
         duplicates across chromosomes. Picard is better.


AUTHOR
       Heng Li from the Sanger Institute wrote the C version of samtools.  Bob
       Handsaker from the Broad Institute implemented the BGZF library and Jue
       Ruan from Beijing Genomics Institute wrote the  RAZF  library.  Various
       people  in the 1000Genomes Project contributed to the SAM format speci-
       fication.


SEE ALSO
       Samtools website: <http://samtools.sourceforge.net>



samtools-0.1.6                 2 September 2009                    samtools(1)