diff ezBAMQC/src/htslib/faidx.5 @ 0:dfa3745e5fd8

Uploaded
author youngkim
date Thu, 24 Mar 2016 17:12:52 -0400
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/ezBAMQC/src/htslib/faidx.5	Thu Mar 24 17:12:52 2016 -0400
@@ -0,0 +1,147 @@
+'\" t
+.TH faidx 5 "August 2013" "htslib" "Bioinformatics formats"
+.SH NAME
+faidx \- an index enabling random access to FASTA files
+.\"
+.\" Copyright (C) 2013 Genome Research Ltd.
+.\"
+.\" Author: John Marshall <jm18@sanger.ac.uk>
+.\"
+.\" Permission is hereby granted, free of charge, to any person obtaining a
+.\" copy of this software and associated documentation files (the "Software"),
+.\" to deal in the Software without restriction, including without limitation
+.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
+.\" and/or sell copies of the Software, and to permit persons to whom the
+.\" Software is furnished to do so, subject to the following conditions:
+.\"
+.\" The above copyright notice and this permission notice shall be included in
+.\" all copies or substantial portions of the Software.
+.\"
+.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+.\" DEALINGS IN THE SOFTWARE.
+.\"
+.SH SYNOPSIS
+.IR file.fa .fai,
+.IR file.fasta .fai
+.SH DESCRIPTION
+Using an \fBfai index\fP file in conjunction with a FASTA file containing
+reference sequences enables efficient access to arbitrary regions within
+those reference sequences.
+The index file typically has the same filename as the corresponding FASTA
+file, with \fB.fai\fP appended.
+.P
+An \fBfai index\fP file is a text file consisting of lines each with
+five TAB-delimited columns:
+.TS
+lbl.
+NAME	Name of this reference sequence
+LENGTH	Total length of this reference sequence, in bases
+OFFSET	Offset within the FASTA file of this sequence's first base
+LINEBASES	The number of bases on each line
+LINEWIDTH	The number of bytes in each line, including the newline
+.TE
+.P
+The \fBNAME\fP and \fBLENGTH\fP columns contain the same
+data as would appear in the \fBSN\fP and \fBLN\fP fields of a
+SAM \fB@SQ\fP header for the same reference sequence.
+.P
+The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes
+starting from zero, of the first base of this reference sequence, i.e., of
+the character following the newline at the end of the "\fB>\fP" header line.
+Typically the lines of a \fBfai index\fP file appear in the order in which the
+reference sequences appear in the FASTA file, so \fB.fai\fP files are typically
+sorted according to this column.
+.P
+The \fBLINEBASES\fP column contains the number of bases in each of the sequence
+lines that form the body of this reference sequence, apart from the final line
+which may be shorter.
+The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of
+the sequence lines (except perhaps the final line), thus differing from
+\fBLINEBASES\fP in that it also counts the bytes forming the line terminator.
+.SS FASTA Files
+In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text
+file of the form
+.LP
+.RS
+.RI > name
+.RI [ description ...]
+.br
+ATGCATGCATGCATGCATGCATGCATGCAT
+.br
+GCATGCATGCATGCATGCATGCATGCATGC
+.br
+ATGCAT
+.br
+.RI > name
+.RI [ description ...]
+.br
+ATGCATGCATGCAT
+.br
+GCATGCATGCATGC
+.br
+[...]
+.RE
+.LP
+In particular, each reference sequence must be "well-formatted", i.e., all
+of its sequence lines must be the same length, apart from the final sequence
+line which may be shorter.
+(While this sequence line length must be the same within each sequence,
+it may vary between different reference sequences in the same FASTA file.)
+.P
+This also means that although the FASTA file may have Unix- or Windows-style
+or other line termination, the newline characters present must be consistent,
+at least within each reference sequence.
+.P
+The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header
+line text (i.e., up to the first whitespace character) as the \fBNAME\fP column.
+At present, there may be no whitespace between the
+">" character and the \fIname\fP.
+.SH EXAMPLE
+For example, given this FASTA file
+.LP
+.RS
+>one
+.br
+ATGCATGCATGCATGCATGCATGCATGCAT
+.br
+GCATGCATGCATGCATGCATGCATGCATGC
+.br
+ATGCAT
+.br
+>two another chromosome
+.br
+ATGCATGCATGCAT
+.br
+GCATGCATGCATGC
+.br
+.RE
+.LP
+formatted with Unix-style (LF) line termination, the corresponding fai index
+would be
+.RS
+.TS
+lnnnn.
+one	66	5	30	31
+two	28	98	14	15
+.TE
+.RE
+.LP
+If the FASTA file were formatted with Windows-style (CR-LF) line termination,
+the fai index would be
+.RS
+.TS
+lnnnn.
+one	66	6	30	32
+two	28	103	14	16
+.TE
+.RE
+.SH SEE ALSO
+.IR samtools (1)
+.TP
+http://en.wikipedia.org/wiki/FASTA_format
+Further description of the FASTA format