annotate corebio/seq_io/phylip_io.py @ 14:778f03497adb

Uploaded
author davidmurphy
date Fri, 24 Feb 2012 11:37:26 -0500
parents c55bdc2fb9fa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
1 #!/usr/bin/env python
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
2
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
3 # Copyright (c) 2005 David D. Ding <dding@berkeley.edu>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
4 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
5 # This software is distributed under the MIT Open Source License.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
6 # <http://www.opensource.org/licenses/mit-license.html>
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
7 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
8 # Permission is hereby granted, free of charge, to any person obtaining a
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
9 # copy of this software and associated documentation files (the "Software"),
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
10 # to deal in the Software without restriction, including without limitation
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
11 # the rights to use, copy, modify, merge, publish, distribute, sublicense,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
12 # and/or sell copies of the Software, and to permit persons to whom the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
13 # Software is furnished to do so, subject to the following conditions:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
14 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
15 # The above copyright notice and this permission notice shall be included
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
16 # in all copies or substantial portions of the Software.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
17 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
18 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
19 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
20 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
21 # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
22 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
23 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
24 # THE SOFTWARE.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
25 #
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
26
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
27 """Read Sequences in interleaved Phylip format (not sequential) and returns a
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
28 list of sequences. Phylips is a very common phylogeny generating sequence type
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
29 that has the following traits
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
30 1) First line contains number of species and number of characters in a species'
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
31 sequence. Options can may follow, and they can be spaced or unspaced. Options are
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
32 simply letters such as A and W after the number of characters.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
33 2) Options doesn't have to contain U in order for a usertree to appear.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
34 3) If there are options then options appear first, then the sequences. For the
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
35 first iteration of sequences the first ten spaces are reserved for names of
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
36 options and species, the rest is for sequences.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
37 4) For the second and following iterations the names are removed, only
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
38 sequence appears
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
39 4) At end of file an usertree may appear. First there is a number that indicts
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
40 the number of lines the usertree will take, and then the usertrees follow.
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
41
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
42 Examples:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
43 6 50 W
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
44 W 0101001111 0101110101 01011
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
45 dmras1 GTCGTCGTTG GACCTGGAGG CGTGG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
46 hschras GTGGTGGTGG GCGCCGGCCG TGTGG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
47 ddrasa GTTATTGTTG GTGGTGGTGG TGTCG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
48 spras GTAGTTGTAG GAGATGGTGG TGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
49 scras1 GTAGTTGTCG GTGGAGGTGG CGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
50 scras2 GTCGTCGTTG GTGGTGGTGG TGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
51
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
52 0101001111 0101110101 01011
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
53 GTCGTCGTTG GACCTGGAGG CGTGG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
54 GTGGTGGTGG GCGCCGGCCG TGTGG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
55 GTTATTGTTG GTGGTGGTGG TGTCG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
56 GTAGTTGTAG GAGATGGTGG TGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
57 GTAGTTGTCG GTGGAGGTGG CGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
58 GTCGTCGTTG GTGGTGGTGG TGTTG
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
59
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
60 1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
61 ((dmras1,ddrasa),((hschras,spras),(scras1,scras2)));
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
62
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
63
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
64 """
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
65
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
66 from corebio.seq import *
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
67
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
68 names = ( 'phylip',)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
69 extensions = ('phy',)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
70
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
71 def iterseq(fin, alphabet=None):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
72 """Iterate over the sequences in the file."""
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
73 # Default implementation
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
74 return iter(read(fin, alphabet) )
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
75
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
76
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
77 #Read takes in a phylip file name, read it, processes it, and returns a SeqList
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
78 def read(fin, alphabet=None):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
79
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
80
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
81 sequence=[] #where sequences are stored
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
82 idents=[]
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
83 num_seq=0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
84 num_total_seq=0 #length of sequence of 1 species
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
85 tracker=0 #track what sequence the line is on
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
86 usertree_tracker=0 #track usertree lines
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
87 options='' #options
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
88 num_options=0 #number/lens of options - U
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
89
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
90 line=fin.readline()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
91 while line:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
92 s_line=line.split() #for ease of use, not used in all scenarios, but easier on the eye
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
93
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
94 if s_line == []: #see nothing do nothing
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
95 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
96
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
97 elif (s_line[0].isdigit() and len(s_line) == 1 and len(sequence)==num_seq and len(sequence[0])==num_total_seq): #identifies usertree
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
98 usertree_tracker = int(s_line[0])
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
99 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
100
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
101 elif num_options > 0:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
102 if len(sequence) < num_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
103 if s_line[0][0] in options:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
104 num_options -= 1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
105 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
106 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
107 raise ValueError('Not an option, but it should be one')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
108 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
109 num_options -= 1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
110 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
111
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
112 elif usertree_tracker > 0: #baskically skip usertree
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
113 if len(sequence[num_seq-1]) == num_total_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
114 usertree_tracker -=1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
115 pass
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
116 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
117 raise ValueError('User Tree in Wrong Place')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
118
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
119 #####problems parse error unexpected
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
120 elif s_line[0].isdigit():
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
121 if len(s_line) >= 2 and len(sequence) == 0: #identifies first line of file
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
122 num_seq = int(s_line[0]) #get number of sequences
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
123 num_total_seq = int(s_line[1]) #get length of sequences
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
124 if len(s_line) > 2: #takes care of the options
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
125 options= (''.join(s_line[2:]))
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
126 num_options=len(options) - options.count('U')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
127 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
128 raise ValueError('parse error')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
129
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
130
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
131 #when options end, this take care of the sequence
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
132 elif num_options == 0:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
133 if (num_seq==0):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
134 raise ValueError("Empty File, or possibly wrong file")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
135 elif tracker < num_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
136 if num_seq > len(sequence):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
137 sequence.append(''.join(line[10:].split())) #removes species name
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
138 idents.append(line[0:10].strip())
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
139 tracker +=1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
140
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
141 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
142 sequence[tracker] += (''.join(s_line))
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
143 tracker +=1
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
144
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
145 if tracker == num_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
146 tracker = 0
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
147 num_options = len(options)-options.count('U')
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
148
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
149 line=fin.readline()
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
150
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
151 if len(sequence) != len(idents) or len(sequence)!=num_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
152 raise ValueError("Number of different sequences wrong")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
153
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
154 seqs = []
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
155 for i in range (0, len(idents)):
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
156 if len(sequence[i])==num_total_seq:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
157 seqs.append(Seq(sequence[i], alphabet, idents[i]))
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
158 else:
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
159 raise ValueError("extra sequence in list")
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
160
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
161 return SeqList(seqs)
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
162
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
163
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
164
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
165
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
166
c55bdc2fb9fa Uploaded
davidmurphy
parents:
diff changeset
167