annotate corebio/data.py @ 13:cd6c4bd14718

Uploaded
author davidmurphy
date Fri, 24 Feb 2012 09:26:11 -0500
parents c55bdc2fb9fa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
1 # Copyright (c) 2006, The Regents of the University of California, through
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
2 # Lawrence Berkeley National Laboratory (subject to receipt of any required
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
3 # approvals from the U.S. Dept. of Energy). All rights reserved.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
4
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
5 # This software is distributed under the new BSD Open Source License.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
6 # <http://www.opensource.org/licenses/bsd-license.html>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
7 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
8 # Redistribution and use in source and binary forms, with or without
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
9 # modification, are permitted provided that the following conditions are met:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
10 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
11 # (1) Redistributions of source code must retain the above copyright notice,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
12 # this list of conditions and the following disclaimer.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
13 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
14 # (2) Redistributions in binary form must reproduce the above copyright
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
15 # notice, this list of conditions and the following disclaimer in the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
16 # documentation and or other materials provided with the distribution.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
17 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
18 # (3) Neither the name of the University of California, Lawrence Berkeley
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
19 # National Laboratory, U.S. Dept. of Energy nor the names of its contributors
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
20 # may be used to endorse or promote products derived from this software
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
21 # without specific prior written permission.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
22 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
23 # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
24 # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
25 # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
26 # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
27 # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
28 # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
29 # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
30 # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
31 # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
32 # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
33 # POSSIBILITY OF SUCH DAMAGE.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
34
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
35 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
36 Standard information used in computational biology.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
37
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
38
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
39 To convert a property dictionary to a list :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
40 >>> comp = [ amino_acid_composition[k] for k in amino_acid_letters]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
41
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
42
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
43 Resources: (Various standard data files.)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
44
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
45
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
46 BLOSUM Scoring Matrices
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
47 Source: ftp://ftp.ncbi.nih.gov/repository/blocks/unix/blosum
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
48 These are all new blast style with 1/3 bit scaling
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
49 - blosum35
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
50 - blosum45
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
51 - blosum62
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
52 - blosum40
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
53 - blosum50
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
54 - blosum80
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
55 - blosum100
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
56
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
57 Other subsitution scoring matrices:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
58 - dist20_comp
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
59 - pam250
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
60 - pam120
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
61
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
62
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
63 Status: Beta (Data needs to be proof checked.)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
64 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
65 # TODO: add this datafile?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
66 # Description of database cross references :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
67 # - dbxref.txt (http://www.expasy.org/cgi-bin/lists?dbxref.txt)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
68
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
69
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
70 # FIXME: Move documentation of data to docstring above. docstrings
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
71 # after variables don't work.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
72
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
73
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
74 # The ExPasy ProtScale tool is a great source of amino acid properties.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
75 # http://au.expasy.org/cgi-bin/protscale.pl
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
76
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
77 from StringIO import StringIO
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
78 from corebio._future import resource_string, resource_stream,resource_filename
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
79 from corebio import utils
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
80
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
81 # Explictly list set of available data resources. We want to be able to access
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
82 # these resources in, for example, a webapp, without inadvertently allowing
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
83 # unrestricted read access to the local file system.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
84
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
85 resource_names = [
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
86 'blosum35',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
87 'blosum45',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
88 'blosum62',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
89 'blosum40',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
90 'blosum50',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
91 'blosum80',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
92 'blosum100',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
93 'dist20_comp',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
94 'pam250',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
95 'pam120',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
96 ]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
97
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
98 _resource_filenames = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
99 'blosum35': 'data/blosum35.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
100 'blosum45': 'data/blosum45.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
101 'blosum62': 'data/blosum62.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
102 'blosum40': 'data/blosum40.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
103 'blosum50': 'data/blosum50.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
104 'blosum80': 'data/blosum80.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
105 'blosum100': 'data/blosum100.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
106 'dist20_comp': 'data/dist20_comp.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
107 'pam250': 'data/pam250.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
108 'pam120': 'data/pam120.mat',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
109 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
110
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
111 # TODO: Subsitution matrix parser, SeqMatrix.read
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
112 _resource_parsers = {}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
113
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
114 def data_string( name ):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
115 fn = _resource_filenames[name]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
116 return resource_string(__name__, fn , __file__)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
117
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
118 def data_stream( name ):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
119 fn = _resource_filenames[name]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
120 return resource_stream(__name__, fn , __file__)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
121
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
122 def data_filename( name ):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
123 fn = _resource_filenames[name]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
124 return resource_filename(__name__, fn, __file__)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
125
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
126 def data_object( name, parser = None) :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
127 if parser is None :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
128 if name in _resource_parsers :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
129 parser = _resource_parsers[name]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
130 else :
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
131 parser = str
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
132 return parser( data_stream(name) )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
133
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
134
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
135 amino_acid_letters = "ACDEFGHIKLMNPQRSTVWY"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
136 """Standard codes for the 20 canonical amino acids, in alphabetic order."""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
137
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
138 amino_acid_alternative_letters = "ARNDCQEGHILKMFPSTWYV"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
139 """Amino acid one letter codes, alphabetic by three letter codes."""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
140
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
141 amino_acid_extended_letters = "ACDEFGHIKLMNOPQRSTUVWYBJZX*-"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
142
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
143
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
144 dna_letters = "GATC"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
145 dna_extended_letters = "GATCRYWSMKHBVDN"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
146
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
147 rna_letters = "GAUC"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
148 rna_extended_letters = "GAUCRYWSMKHBVDN"
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
149
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
150
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
151 dna_ambiguity = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
152 "A": "A",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
153 "C": "C",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
154 "G": "G",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
155 "T": "T",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
156 "M": "AC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
157 "R": "AG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
158 "W": "AT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
159 "S": "CG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
160 "Y": "CT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
161 "K": "GT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
162 "V": "ACG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
163 "H": "ACT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
164 "D": "AGT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
165 "B": "CGT",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
166 "X": "GATC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
167 "N": "GATC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
168 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
169
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
170 rna_ambiguity = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
171 "A": "A",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
172 "C": "C",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
173 "G": "G",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
174 "U": "U",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
175 "M": "AC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
176 "R": "AG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
177 "W": "AU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
178 "S": "CG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
179 "Y": "CU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
180 "K": "GU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
181 "V": "ACG",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
182 "H": "ACU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
183 "D": "AGU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
184 "B": "CGU",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
185 "X": "GAUC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
186 "N": "GAUC",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
187 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
188
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
189 amino_acid_ambiguity = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
190 "A": "A",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
191 "B": "ND",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
192 "C": "C",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
193 "D": "D",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
194 "E": "E",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
195 "F": "F",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
196 "G": "G",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
197 "H": "H",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
198 "I": "I",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
199 "K": "K",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
200 "L": "L",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
201 "M": "M",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
202 "N": "N",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
203 "P": "P",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
204 "Q": "Q",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
205 "R": "R",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
206 "S": "S",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
207 "T": "T",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
208 "V": "V",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
209 "W": "W",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
210 "X": "ACDEFGHIKLMNPQRSTVWY",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
211 "Y": "Y",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
212 "Z": "QE",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
213 "J": "IL",
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
214 'U': 'U',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
215 'O': 'O',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
216 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
217
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
218
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
219 # Monomer isotopically averaged molecular mass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
220 # Data Checked GEC Nov 2006
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
221 amino_acid_mass = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
222 "A": 89.09,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
223 "B" : 132.66, # Averaged proportional to amino_acid_composition
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
224 "C": 121.16,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
225 "D": 133.10,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
226 "E": 147.13,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
227 "F": 165.19,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
228 "G": 75.07,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
229 "H": 155.16,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
230 "I": 131.18,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
231 "J": 131.18,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
232 "K": 146.19,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
233 "L": 131.18,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
234 "M": 149.21,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
235 "N": 132.12,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
236 # "O" : ???, # TODO
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
237 "P": 115.13,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
238 "Q": 146.15,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
239 "R": 174.20,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
240 "S": 105.09,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
241 "T": 119.12,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
242 "U" : 168.05,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
243 "V": 117.15,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
244 "W": 204.23,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
245 "X" : 129.15, # Averaged proportional to amino_acid_composition
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
246 "Y": 181.19,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
247 "Z" : 146.76, # Averaged proportional to amino_acid_composition
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
248 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
249
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
250 dna_mass = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
251 "A": 347.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
252 "C": 323.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
253 "G": 363.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
254 "T": 322.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
255 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
256
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
257 rna_mass = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
258 "A": 363.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
259 "C": 319.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
260 "G": 379.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
261 "U": 340.,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
262 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
263
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
264 one_to_three = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
265 'A':'Ala', 'B':'Asx', 'C':'Cys', 'D':'Asp',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
266 'E':'Glu', 'F':'Phe', 'G':'Gly', 'H':'His',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
267 'I':'Ile', 'K':'Lys', 'L':'Leu', 'M':'Met',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
268 'N':'Asn', 'P':'Pro', 'Q':'Gln', 'R':'Arg',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
269 'S':'Ser', 'T':'Thr', 'V':'Val', 'W':'Trp',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
270 'Y':'Tyr', 'Z':'Glx', 'X':'Xaa',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
271 'U':'Sec', 'J':'Xle', 'O':'Pyl'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
272 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
273 """ Map between standard 1 letter amino acid codes and standard three letter codes.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
274
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
275 Ref: http://www.ebi.ac.uk/RESID/faq.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
276 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
277
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
278 standard_three_to_one = utils.invert_dict(one_to_three)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
279 """ Map between standard three letter amino acid codes and standard one letter codes.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
280
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
281 Ref: http://www.ebi.ac.uk/RESID/faq.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
282 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
283
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
284
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
285 extended_three_to_one= {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
286 '2as':'D', '3ah':'H', '5hp':'E', 'Acl':'R', 'Agm':'R', 'Aib':'A', 'Ala':'A', 'Alm':'A', 'Alo':'T', 'Aly':'K', 'Arg':'R', 'Arm':'R', 'Asa':'D', 'Asb':'D', 'Ask':'D', 'Asl':'D', 'Asn':'N', 'Asp':'D', 'Asq':'D', 'Asx':'B', 'Aya':'A', 'Bcs':'C', 'Bhd':'D', 'Bmt':'T', 'Bnn':'A', 'Buc':'C', 'Bug':'L', 'C5c':'C', 'C6c':'C', 'Ccs':'C', 'Cea':'C', 'Cgu':'E', 'Chg':'A', 'Cle':'L', 'Cme':'C', 'Csd':'A', 'Cso':'C', 'Csp':'C', 'Css':'C', 'Csw':'C', 'Csx':'C', 'Cxm':'M', 'Cy1':'C', 'Cy3':'C', 'Cyg':'C', 'Cym':'C', 'Cyq':'C', 'Cys':'C', 'Dah':'F', 'Dal':'A', 'Dar':'R', 'Das':'D', 'Dcy':'C', 'Dgl':'E', 'Dgn':'Q', 'Dha':'A', 'Dhi':'H', 'Dil':'I', 'Div':'V', 'Dle':'L', 'Dly':'K', 'Dnp':'A', 'Dpn':'F', 'Dpr':'P', 'Dsn':'S', 'Dsp':'D', 'Dth':'T', 'Dtr':'W', 'Dty':'Y', 'Dva':'V', 'Efc':'C', 'Fla':'A', 'Fme':'M', 'Ggl':'E', 'Gl3':'G', 'Gln':'Q', 'Glu':'E', 'Glx':'Z', 'Gly':'G', 'Glz':'G', 'Gma':'E', 'Gsc':'G', 'Hac':'A', 'Har':'R', 'Hic':'H', 'Hip':'H', 'His':'H', 'Hmr':'R', 'Hpq':'F', 'Htr':'W', 'Hyp':'P', 'Iil':'I', 'Ile':'I', 'Iyr':'Y', 'Kcx':'K', 'Leu':'L', 'Llp':'K', 'Lly':'K', 'Ltr':'W', 'Lym':'K', 'Lys':'K', 'Lyz':'K', 'Maa':'A', 'Men':'N', 'Met':'M', 'Mhs':'H', 'Mis':'S', 'Mle':'L', 'Mpq':'G', 'Msa':'G', 'Mse':'M', 'Mva':'V', 'Nem':'H', 'Nep':'H', 'Nle':'L', 'Nln':'L', 'Nlp':'L', 'Nmc':'G', 'Oas':'S', 'Ocs':'C', 'Omt':'M', 'Paq':'Y', 'Pca':'E', 'Pec':'C', 'Phe':'F', 'Phi':'F', 'Phl':'F', 'Pr3':'C', 'Pro':'P', 'Prr':'A', 'Ptr':'Y', 'Pyl':'O', 'Sac':'S', 'Sar':'G', 'Sch':'C', 'Scs':'C', 'Scy':'C', 'Sec':'U', 'Sel':'U', 'Sep':'S', 'Ser':'S', 'Set':'S', 'Shc':'C', 'Shr':'K', 'Smc':'C', 'Soc':'C', 'Sty':'Y', 'Sva':'S', 'Ter':'*', 'Thr':'T', 'Tih':'A', 'Tpl':'W', 'Tpo':'T', 'Tpq':'A', 'Trg':'K', 'Tro':'W', 'Trp':'W', 'Tyb':'Y', 'Tyq':'Y', 'Tyr':'Y', 'Tys':'Y', 'Tyy':'Y', 'Unk':'X', 'Val':'V', 'Xaa':'X', 'Xer':'X', 'Xle':'J'}
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
287
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
288 """ Map between three letter amino acid codes and standard one letter codes.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
289 This map contains many nonstandard three letter codes, used, for example, to specify chemically modified amino acids in PDB files.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
290
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
291 Ref: http://astral.berkeley.edu/
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
292 Ref: http://www.ebi.ac.uk/RESID/faq.html
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
293 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
294 # Initial table is from the ASTRAL RAF release notes.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
295 # added UNK
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
296 # Extra IUPAC: Xle, Xaa, Sec, Pyl
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
297 # The following have been seen in biopython code.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
298 # Ter : '*' Termination
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
299 # Sel : 'U' A typo for Sec, selenocysteine?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
300 # Xer : 'X' Another alternative for unknown?
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
301
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
302
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
303 amino_acid_names = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
304 'A' : 'alanine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
305 'M' : 'methionine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
306 'C' : 'cysteine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
307 'N' : 'asparagine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
308 'D' : 'aspartic acid',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
309 'P' : 'proline',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
310 'E' : 'glutamic acid',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
311 'Q' : 'glutamine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
312 'F' : 'phenylalanine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
313 'R' : 'arginine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
314 'G' : 'glycine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
315 'S' : 'serine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
316 'H' : 'histidine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
317 'T' : 'threonine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
318 'I' : 'isoleucine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
319 'V' : 'valine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
320 'K' : 'lysine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
321 'W' : 'tryptophan',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
322 'L' : 'leucine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
323 'Y' : 'tyrosine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
324 'B' : 'aspartic acid or asparagine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
325 'J' : 'leucine or isoleucine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
326 'X' : 'unknown',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
327 'Z' : 'glutamic acid or glutamine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
328 'U' : 'selenocysteine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
329 'O' : 'pyrrolysine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
330 '*' : 'translation stop',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
331 '-' : 'gap'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
332 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
333
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
334 amino_acid_composition = dict(
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
335 A = .082, R = .057, N = .044, D = .053, C = .017,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
336 Q = .040, E = .062, G = .072, H = .022, I = .052,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
337 L = .090, K = .057, M = .024, F =.039, P = .051,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
338 S = .069, T = .058, W = .013, Y= .032, V =.066 )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
339
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
340 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
341 Overall amino acid composition of proteins.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
342 Ref: McCaldon P., Argos P. Proteins 4:99-122 (1988).
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
343 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
344 # FIXME : Proof these values
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
345
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
346 kyte_doolittle_hydrophobicity = dict(
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
347 A=1.8, R=-4.5, N=-3.5, D=-3.5, C=2.5,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
348 Q=-3.5, E=-3.5, G=-0.4, H=-3.2, I=4.5,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
349 L=3.8, K=-3.9, M=1.9, F=2.8, P=-1.6,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
350 S=-0.8, T=-0.7, W=-0.9, Y=-1.3, V=4.2 )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
351 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
352 Kyte-Doolittle hydrophobicity scale.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
353 Ref: Kyte J., Doolittle R.F. J. Mol. Biol. 157:105-132 (1982)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
354 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
355 # FIXME : Proof these values
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
356
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
357
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
358 nucleotide_names = {
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
359 'A' : 'Adenosine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
360 'C' : 'Cytidine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
361 'G' : 'Guanine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
362 'T' : 'Thymidine',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
363 'U' : 'Uracil',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
364 'R' : 'G A (puRine)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
365 'Y' : 'T C (pYrimidine)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
366 'K' : 'G T (Ketone)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
367 'M' : 'A C (aMino group)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
368 'S' : 'G C (Strong interaction)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
369 'W' : 'A T (Weak interaction)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
370 'B' : 'G T C (not A) (B comes after A)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
371 'D' : 'G A T (not C) (D comes after C)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
372 'H' : 'A C T (not G) (H comes after G)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
373 'V' : 'G C A (not T, not U) (V comes after U)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
374 'N' : 'A G C T (aNy)',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
375 '-' : 'gap',
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
376 }
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
377
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
378
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
379
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
380
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
381
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
382
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
383
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
384
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
385