annotate corebio/ssearch_io/__init__.py @ 15:981eb8c3a756 default tip

Uploaded
author davidmurphy
date Sat, 31 Mar 2012 16:07:07 -0400
parents c55bdc2fb9fa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
2 # Copyright (c) 2006 John Gilman
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
3 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
4 # This software is distributed under the MIT Open Source License.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
5 # <http://www.opensource.org/licenses/mit-license.html>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
6 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
7 # Permission is hereby granted, free of charge, to any person obtaining a
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
8 # copy of this software and associated documentation files (the "Software"),
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
9 # to deal in the Software without restriction, including without limitation
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
10 # the rights to use, copy, modify, merge, publish, distribute, sublicense,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
11 # and/or sell copies of the Software, and to permit persons to whom the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
12 # Software is furnished to do so, subject to the following conditions:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
13 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
14 # The above copyright notice and this permission notice shall be included
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
15 # in all copies or substantial portions of the Software.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
16 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
17 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
18 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
19 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
20 # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
21 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
22 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
23 # THE SOFTWARE.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
24
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
25 """ Parse the output of BLAST and similar sequence search analysis reports.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
26
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
27 The result of a sequence database search is represented by the Report class.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
28 o Each Report contains one or more results, one for each database query.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
29 o Each Result contains one or more hits
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
30 o Each Hit may contain one or more Alignments (High scoring Sequence pairs)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
31
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
32 CoreBio is often capable of guessing the correct format:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
33 >>> from corebio import ssearch_io
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
34 >>> afile = open("test_corebio/data/ssearch/ssearch_out.txt")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
35 >>> report = ssearch_io.read(afile)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
36 >>> print report
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
37
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
38 Alternatively, each report type has a seperate module. Each module defines a
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
39 read(fin) method that can parse that report format.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
40
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
41 >>> from corebio.ssearch_io import fasta
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
42 >>> report = fasta.read( open("test_corebio/data/ssearch/ssearch_out.txt") )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
43 >>> print report
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
44
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
45 Module Application Comments
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
46 ---------------------------------------------------------------------------
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
47 fasta FASTA / SSEARCH Default (-m 1) or compact (-m 9 -d 0)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
48 blastxml NCBI Blast NCBI XML format
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
49
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
50 Status: Beta
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
51 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
52 # Dev. References :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
53 # Inspired by Bioperls searchIO system
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
54 # http://www.bioperl.org/wiki/HOWTO:SearchIO
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
55
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
56 __all__ = ['read', 'Report', 'Result',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
57 'Hit','Annotation', 'Alignment']
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
58
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
59
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
60 from corebio.utils import stdrepr
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
61
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
62 def read(fin) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
63 """ Read and parse an analysis report.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
64
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
65 returns :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
66 A database search Report.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
67 raises :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
68 ValueError - If the file cannot be parsed
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
69 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
70
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
71 import fasta
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
72 import blastxml
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
73 parsers = (fasta, blastxml)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
74 for p in parsers:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
75 try:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
76 return p.read(fin)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
77 except ValueError, e:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
78 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
79 fin.seek(0) # FIXME. Non seakable stdin?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
80
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
81 raise ValueError("Cannot parse sequence file: Tried fasta and blastxml")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
82
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
83
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
84
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
85 class Report(object) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
86 """The results of a database search. The Report contains a list of 1 or more
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
87 Results, one for each query. Each query result containts a list of hits.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
88 Each Hit contains a list of HSP's (High scoring segment pairs).
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
89
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
90 The structure of the report will vary somewhat depending on the source.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
91
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
92 algorithm -- e.g. 'BLASTX'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
93 algorithm_version -- e.g. '2.2.4 [Aug-26-2002]'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
94 algorithm_reference --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
95 database_name -- e.g. 'test.fa'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
96 database_letters -- number of residues in database e.g. 1291
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
97 database_entries -- number of database entries
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
98
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
99 parameters -- Dictionary of parameters used in search
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
100
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
101 results -- A list of list of Results, one per query
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
102 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
103 __slots__ = ['algorithm', 'algorithm_version', 'algorithm_reference','database_name',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
104 'database_letters', 'database_entries', 'parameters', 'results']
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
105
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
106 def __init__(self) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
107 for name in self.__slots__ : setattr(self, name, None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
108 self.parameters = {}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
109 self.results = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
110
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
111 def __repr__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
112 return stdrepr(self)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
113
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
114
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
115 class Result(object) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
116 """ The result from searching a database with a single query sequence.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
117
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
118 query -- Information about the query sequence
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
119 statistics -- A dictionary of search statistics
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
120 hits -- A list of Hits
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
121 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
122 __slots__ = ['query', 'statistics', 'hits']
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
123
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
124 def __init__(self) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
125 for name in self.__slots__ : setattr(self, name, None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
126 self.query = Annotation()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
127 self.statistics = {}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
128 self.hits = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
129
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
130 def __repr__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
131 return stdrepr(self)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
132
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
133
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
134 class Hit(object) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
135 """ A search hit between a query sequence and a subject sequence.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
136 Each hit may have one or more Alignments
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
137
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
138 target -- Information about the target sequence.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
139 raw_score -- Typically the ignficance of the hit in bits, e.g. 92.0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
140 significance -- Typically evalue. e.g '2e-022'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
141 alignments -- A list of alignments between subject and target
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
142 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
143 __slots__ =['target', 'raw_score', 'bit_score', 'significance',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
144 'alignments']
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
145 def __init__(self) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
146 for name in self.__slots__ : setattr(self, name, None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
147 self.target = Annotation()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
148 self.alignments = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
149
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
150 def __repr__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
151 return stdrepr(self)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
152
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
153 class Annotation(object) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
154 """ Information about a subject or query sequence.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
155
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
156 name -- subject sequence name, e.g. '443893|124775'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
157 description -- e.g. 'LaForas sequence'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
158 length -- subject sequence length, e.g. 331
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
159 locus -- e.g. '124775'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
160 accession -- e.g. '443893'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
161 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
162 # Fixme: change into generic sequence annotation class?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
163 __slots__ = ['name', 'description', 'length', 'locus', 'accession', ]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
164
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
165 def __init__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
166 for name in self.__slots__ :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
167 setattr(self, name, None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
168
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
169 def __repr__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
170 return stdrepr(self)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
171
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
172 class Alignment(object):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
173 """An alignment between query and subject sequences.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
174 For BLAST, these are High scoring Segment pairs (HSPs)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
175
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
176 raw_score -- Typically signficance of the hit in bits, e.g. 92.0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
177 significance -- Typically evalue. e.g '2e-022'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
178
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
179 similar -- number of conserved residues #FIXME eiter frac or num
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
180 identical -- number of identical residues
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
181 gaps -- number of gaps
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
182 length -- length of the alignment
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
183
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
184 query_seq -- query string from alignment
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
185 target_seq -- hit string from alignment
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
186 mid_seq --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
187
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
188 query_start --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
189 query_frame --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
190
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
191 target_start --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
192 target_frame --
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
193
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
194 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
195 __slots__ = ['raw_score', 'bit_score', 'significance', 'similar',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
196 'identical', 'gaps', 'length', 'query_seq', 'target_seq', 'mid_seq',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
197 'query_start', 'query_frame', 'target_start',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
198 'target_frame']
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
199
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
200 def __init__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
201 for name in self.__slots__ :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
202 setattr(self, name, None)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
203
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
204 def __repr__(self):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
205 return stdrepr(self)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
206
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
207