Mercurial > repos > youngkim > ezbamqc
comparison ezBAMQC/src/htslib/faidx.5 @ 0:dfa3745e5fd8
Uploaded
author | youngkim |
---|---|
date | Thu, 24 Mar 2016 17:12:52 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:dfa3745e5fd8 |
---|---|
1 '\" t | |
2 .TH faidx 5 "August 2013" "htslib" "Bioinformatics formats" | |
3 .SH NAME | |
4 faidx \- an index enabling random access to FASTA files | |
5 .\" | |
6 .\" Copyright (C) 2013 Genome Research Ltd. | |
7 .\" | |
8 .\" Author: John Marshall <jm18@sanger.ac.uk> | |
9 .\" | |
10 .\" Permission is hereby granted, free of charge, to any person obtaining a | |
11 .\" copy of this software and associated documentation files (the "Software"), | |
12 .\" to deal in the Software without restriction, including without limitation | |
13 .\" the rights to use, copy, modify, merge, publish, distribute, sublicense, | |
14 .\" and/or sell copies of the Software, and to permit persons to whom the | |
15 .\" Software is furnished to do so, subject to the following conditions: | |
16 .\" | |
17 .\" The above copyright notice and this permission notice shall be included in | |
18 .\" all copies or substantial portions of the Software. | |
19 .\" | |
20 .\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
21 .\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
22 .\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL | |
23 .\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
24 .\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING | |
25 .\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER | |
26 .\" DEALINGS IN THE SOFTWARE. | |
27 .\" | |
28 .SH SYNOPSIS | |
29 .IR file.fa .fai, | |
30 .IR file.fasta .fai | |
31 .SH DESCRIPTION | |
32 Using an \fBfai index\fP file in conjunction with a FASTA file containing | |
33 reference sequences enables efficient access to arbitrary regions within | |
34 those reference sequences. | |
35 The index file typically has the same filename as the corresponding FASTA | |
36 file, with \fB.fai\fP appended. | |
37 .P | |
38 An \fBfai index\fP file is a text file consisting of lines each with | |
39 five TAB-delimited columns: | |
40 .TS | |
41 lbl. | |
42 NAME Name of this reference sequence | |
43 LENGTH Total length of this reference sequence, in bases | |
44 OFFSET Offset within the FASTA file of this sequence's first base | |
45 LINEBASES The number of bases on each line | |
46 LINEWIDTH The number of bytes in each line, including the newline | |
47 .TE | |
48 .P | |
49 The \fBNAME\fP and \fBLENGTH\fP columns contain the same | |
50 data as would appear in the \fBSN\fP and \fBLN\fP fields of a | |
51 SAM \fB@SQ\fP header for the same reference sequence. | |
52 .P | |
53 The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes | |
54 starting from zero, of the first base of this reference sequence, i.e., of | |
55 the character following the newline at the end of the "\fB>\fP" header line. | |
56 Typically the lines of a \fBfai index\fP file appear in the order in which the | |
57 reference sequences appear in the FASTA file, so \fB.fai\fP files are typically | |
58 sorted according to this column. | |
59 .P | |
60 The \fBLINEBASES\fP column contains the number of bases in each of the sequence | |
61 lines that form the body of this reference sequence, apart from the final line | |
62 which may be shorter. | |
63 The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of | |
64 the sequence lines (except perhaps the final line), thus differing from | |
65 \fBLINEBASES\fP in that it also counts the bytes forming the line terminator. | |
66 .SS FASTA Files | |
67 In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text | |
68 file of the form | |
69 .LP | |
70 .RS | |
71 .RI > name | |
72 .RI [ description ...] | |
73 .br | |
74 ATGCATGCATGCATGCATGCATGCATGCAT | |
75 .br | |
76 GCATGCATGCATGCATGCATGCATGCATGC | |
77 .br | |
78 ATGCAT | |
79 .br | |
80 .RI > name | |
81 .RI [ description ...] | |
82 .br | |
83 ATGCATGCATGCAT | |
84 .br | |
85 GCATGCATGCATGC | |
86 .br | |
87 [...] | |
88 .RE | |
89 .LP | |
90 In particular, each reference sequence must be "well-formatted", i.e., all | |
91 of its sequence lines must be the same length, apart from the final sequence | |
92 line which may be shorter. | |
93 (While this sequence line length must be the same within each sequence, | |
94 it may vary between different reference sequences in the same FASTA file.) | |
95 .P | |
96 This also means that although the FASTA file may have Unix- or Windows-style | |
97 or other line termination, the newline characters present must be consistent, | |
98 at least within each reference sequence. | |
99 .P | |
100 The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header | |
101 line text (i.e., up to the first whitespace character) as the \fBNAME\fP column. | |
102 At present, there may be no whitespace between the | |
103 ">" character and the \fIname\fP. | |
104 .SH EXAMPLE | |
105 For example, given this FASTA file | |
106 .LP | |
107 .RS | |
108 >one | |
109 .br | |
110 ATGCATGCATGCATGCATGCATGCATGCAT | |
111 .br | |
112 GCATGCATGCATGCATGCATGCATGCATGC | |
113 .br | |
114 ATGCAT | |
115 .br | |
116 >two another chromosome | |
117 .br | |
118 ATGCATGCATGCAT | |
119 .br | |
120 GCATGCATGCATGC | |
121 .br | |
122 .RE | |
123 .LP | |
124 formatted with Unix-style (LF) line termination, the corresponding fai index | |
125 would be | |
126 .RS | |
127 .TS | |
128 lnnnn. | |
129 one 66 5 30 31 | |
130 two 28 98 14 15 | |
131 .TE | |
132 .RE | |
133 .LP | |
134 If the FASTA file were formatted with Windows-style (CR-LF) line termination, | |
135 the fai index would be | |
136 .RS | |
137 .TS | |
138 lnnnn. | |
139 one 66 6 30 32 | |
140 two 28 103 14 16 | |
141 .TE | |
142 .RE | |
143 .SH SEE ALSO | |
144 .IR samtools (1) | |
145 .TP | |
146 http://en.wikipedia.org/wiki/FASTA_format | |
147 Further description of the FASTA format |