0
|
1 '\" t
|
|
2 .TH faidx 5 "August 2013" "htslib" "Bioinformatics formats"
|
|
3 .SH NAME
|
|
4 faidx \- an index enabling random access to FASTA files
|
|
5 .\"
|
|
6 .\" Copyright (C) 2013 Genome Research Ltd.
|
|
7 .\"
|
|
8 .\" Author: John Marshall <jm18@sanger.ac.uk>
|
|
9 .\"
|
|
10 .\" Permission is hereby granted, free of charge, to any person obtaining a
|
|
11 .\" copy of this software and associated documentation files (the "Software"),
|
|
12 .\" to deal in the Software without restriction, including without limitation
|
|
13 .\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
|
14 .\" and/or sell copies of the Software, and to permit persons to whom the
|
|
15 .\" Software is furnished to do so, subject to the following conditions:
|
|
16 .\"
|
|
17 .\" The above copyright notice and this permission notice shall be included in
|
|
18 .\" all copies or substantial portions of the Software.
|
|
19 .\"
|
|
20 .\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
21 .\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
22 .\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
23 .\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
24 .\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
25 .\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
|
26 .\" DEALINGS IN THE SOFTWARE.
|
|
27 .\"
|
|
28 .SH SYNOPSIS
|
|
29 .IR file.fa .fai,
|
|
30 .IR file.fasta .fai
|
|
31 .SH DESCRIPTION
|
|
32 Using an \fBfai index\fP file in conjunction with a FASTA file containing
|
|
33 reference sequences enables efficient access to arbitrary regions within
|
|
34 those reference sequences.
|
|
35 The index file typically has the same filename as the corresponding FASTA
|
|
36 file, with \fB.fai\fP appended.
|
|
37 .P
|
|
38 An \fBfai index\fP file is a text file consisting of lines each with
|
|
39 five TAB-delimited columns:
|
|
40 .TS
|
|
41 lbl.
|
|
42 NAME Name of this reference sequence
|
|
43 LENGTH Total length of this reference sequence, in bases
|
|
44 OFFSET Offset within the FASTA file of this sequence's first base
|
|
45 LINEBASES The number of bases on each line
|
|
46 LINEWIDTH The number of bytes in each line, including the newline
|
|
47 .TE
|
|
48 .P
|
|
49 The \fBNAME\fP and \fBLENGTH\fP columns contain the same
|
|
50 data as would appear in the \fBSN\fP and \fBLN\fP fields of a
|
|
51 SAM \fB@SQ\fP header for the same reference sequence.
|
|
52 .P
|
|
53 The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes
|
|
54 starting from zero, of the first base of this reference sequence, i.e., of
|
|
55 the character following the newline at the end of the "\fB>\fP" header line.
|
|
56 Typically the lines of a \fBfai index\fP file appear in the order in which the
|
|
57 reference sequences appear in the FASTA file, so \fB.fai\fP files are typically
|
|
58 sorted according to this column.
|
|
59 .P
|
|
60 The \fBLINEBASES\fP column contains the number of bases in each of the sequence
|
|
61 lines that form the body of this reference sequence, apart from the final line
|
|
62 which may be shorter.
|
|
63 The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of
|
|
64 the sequence lines (except perhaps the final line), thus differing from
|
|
65 \fBLINEBASES\fP in that it also counts the bytes forming the line terminator.
|
|
66 .SS FASTA Files
|
|
67 In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text
|
|
68 file of the form
|
|
69 .LP
|
|
70 .RS
|
|
71 .RI > name
|
|
72 .RI [ description ...]
|
|
73 .br
|
|
74 ATGCATGCATGCATGCATGCATGCATGCAT
|
|
75 .br
|
|
76 GCATGCATGCATGCATGCATGCATGCATGC
|
|
77 .br
|
|
78 ATGCAT
|
|
79 .br
|
|
80 .RI > name
|
|
81 .RI [ description ...]
|
|
82 .br
|
|
83 ATGCATGCATGCAT
|
|
84 .br
|
|
85 GCATGCATGCATGC
|
|
86 .br
|
|
87 [...]
|
|
88 .RE
|
|
89 .LP
|
|
90 In particular, each reference sequence must be "well-formatted", i.e., all
|
|
91 of its sequence lines must be the same length, apart from the final sequence
|
|
92 line which may be shorter.
|
|
93 (While this sequence line length must be the same within each sequence,
|
|
94 it may vary between different reference sequences in the same FASTA file.)
|
|
95 .P
|
|
96 This also means that although the FASTA file may have Unix- or Windows-style
|
|
97 or other line termination, the newline characters present must be consistent,
|
|
98 at least within each reference sequence.
|
|
99 .P
|
|
100 The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header
|
|
101 line text (i.e., up to the first whitespace character) as the \fBNAME\fP column.
|
|
102 At present, there may be no whitespace between the
|
|
103 ">" character and the \fIname\fP.
|
|
104 .SH EXAMPLE
|
|
105 For example, given this FASTA file
|
|
106 .LP
|
|
107 .RS
|
|
108 >one
|
|
109 .br
|
|
110 ATGCATGCATGCATGCATGCATGCATGCAT
|
|
111 .br
|
|
112 GCATGCATGCATGCATGCATGCATGCATGC
|
|
113 .br
|
|
114 ATGCAT
|
|
115 .br
|
|
116 >two another chromosome
|
|
117 .br
|
|
118 ATGCATGCATGCAT
|
|
119 .br
|
|
120 GCATGCATGCATGC
|
|
121 .br
|
|
122 .RE
|
|
123 .LP
|
|
124 formatted with Unix-style (LF) line termination, the corresponding fai index
|
|
125 would be
|
|
126 .RS
|
|
127 .TS
|
|
128 lnnnn.
|
|
129 one 66 5 30 31
|
|
130 two 28 98 14 15
|
|
131 .TE
|
|
132 .RE
|
|
133 .LP
|
|
134 If the FASTA file were formatted with Windows-style (CR-LF) line termination,
|
|
135 the fai index would be
|
|
136 .RS
|
|
137 .TS
|
|
138 lnnnn.
|
|
139 one 66 6 30 32
|
|
140 two 28 103 14 16
|
|
141 .TE
|
|
142 .RE
|
|
143 .SH SEE ALSO
|
|
144 .IR samtools (1)
|
|
145 .TP
|
|
146 http://en.wikipedia.org/wiki/FASTA_format
|
|
147 Further description of the FASTA format
|