annotate tools/protein_analysis/predictnls.py @ 0:6e26c5a48e9a draft

Uploaded v0.0.4, first public release.
author peterjc
date Wed, 20 Feb 2013 11:39:06 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
2
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
3 #Copyright 2011-2013 by Peter Cock, James Hutton Institute (formerly SCRI), UK
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
4 #
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
5 #Licenced under the GPL (GNU General Public Licence) version 3.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
6 #
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
7 #Based on Perl script predictNLS v1.3, copyright 2001-2005 and the later
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
8 #versions up to predictnls v1.0.20 (copright 2012), by Rajesh Nair
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
9 #(nair@rostlab.org) and Burkhard Rost (rost@rostlab.org), Rost Lab,
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
10 #Columbia University http://rostlab.org/
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
11
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
12 """Batch mode predictNLS, for finding nuclear localization signals
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
13
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
14 This is a Python script re-implementing the predictNLS method, originally
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
15 written in Perl, described here:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
16
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
17 Murat Cokol, Rajesh Nair, and Burkhard Rost.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
18 Finding nuclear localization signals.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
19 EMBO reports 1(5), 411-415, 2000
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
20
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
21 http://dx.doi.org/10.1093/embo-reports/kvd092
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
22
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
23 The original Perl script was designed to work on a single sequence at a time,
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
24 but offers quite detailed output, including HTML (webpage).
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
25
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
26 This Python version is designed to work on a single FASTA file containing
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
27 multiple sequences, and produces a single tabular output file, with one line
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
28 per NLS found (i.e. zero or more rows per query sequence).
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
29
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
30 It takes either two or three command line arguments:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
31
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
32 predictNLS_batch input_file output_file [nls_motif_file]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
33
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
34 The input file should be protein sequences in FASTA format, the output file
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
35 is tab separated plain text, and the NLS motif file defaults to using the
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
36 plain text My_NLS_list file located next to the script file, or in a data
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
37 subdirectory.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
38
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
39 For convience if using this outside Galaxy, the input filename can be '-'
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
40 to mean stdin, and likewise the output filename can be '-' to mean stdout.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
41
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
42 Tested with the My_NLS_list file included with predictnls-1.0.7.tar.gz to
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
43 predictnls-1.0.20.tar.gz inclusive (the list was extended in v1.0.7 in
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
44 August 2010, see the change log included in those tar-balls).
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
45
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
46 The Rost Lab provide source code tar balls for predictNLS on the FTP site
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
47 ftp://rostlab.org/predictnls/ but for Debian or RedHat based Linux they
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
48 recommend their package repository instead,
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
49 https://rostlab.org/owiki/index.php/Packages
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
50 """
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
51
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
52 import os
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
53 import sys
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
54 import re
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
55
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
56 def stop_err(msg, return_code=1):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
57 sys.stderr.write(msg.rstrip() + "\n")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
58 sys.exit(return_code)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
59
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
60 if len(sys.argv) == 4:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
61 fasta_filename, tabular_filename, re_filename = sys.argv[1:]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
62 elif len(sys.argv) == 3:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
63 fasta_filename, tabular_filename = sys.argv[1:]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
64 #Use os.path.realpath(...) to handle being called via a symlink
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
65 #Try under subdirectory data:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
66 re_filename = os.path.join(os.path.dirname(os.path.realpath(sys.argv[0])),
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
67 "data", "My_NLS_list")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
68 if not os.path.isfile(re_filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
69 #Try in same directory as this script:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
70 re_filename = os.path.join(os.path.dirname(os.path.realpath(sys.argv[0])),
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
71 "My_NLS_list")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
72 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
73 stop_err("Expect 2 or 3 arguments: input FASTA file, output tabular file, and NLS motif file")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
74
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
75 if not os.path.isfile(fasta_filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
76 stop_err("Could not find FASTA input file: %s" % fasta_filename)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
77
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
78 if not os.path.isfile(re_filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
79 stop_err("Could not find NLS motif file: %s" % re_filename)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
80
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
81 def load_re(filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
82 """Parse the 5+ column tabular NLS motif file."""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
83 handle = open(filename, "rU")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
84 for line in handle:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
85 line = line.rstrip("\n")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
86 if not line:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
87 continue
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
88 parts = line.split("\t")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
89 assert 5 <= len(parts), parts
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
90 regex, evidence, p_count, percent_nuc, precent_non_nuc = parts[0:5]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
91 try:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
92 regex = re.compile(regex)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
93 p_count = int(p_count)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
94 except ValueError:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
95 stop_err("Bad data in line: %s" % line)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
96 if 6 <= len(parts):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
97 proteins = parts[5]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
98 assert p_count == len(proteins.split(",")), line
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
99 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
100 proteins = ""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
101 assert p_count == 0
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
102 if 7 <= len(parts):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
103 domains = parts[6]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
104 assert int(p_count) == len(domains.split(",")), line
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
105 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
106 domains = ""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
107 assert p_count == 0
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
108 #There can be further columns (DNA binding?), but we don't use them.
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
109 yield regex, evidence, p_count, percent_nuc, proteins, domains
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
110 handle.close()
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
111
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
112 def fasta_iterator(filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
113 """Simple FASTA parser yielding tuples of (name, upper case sequence)."""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
114 if filename == "-":
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
115 handle = sys.stdin
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
116 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
117 handle = open(filename)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
118 name, seq = "", ""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
119 for line in handle:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
120 if line.startswith(">"):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
121 if name:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
122 yield name, seq
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
123 #Take the first word only as the name:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
124 name = line[1:].rstrip().split(None,1)[0]
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
125 seq = ""
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
126 elif name:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
127 #Simple way would leave in any internal white space,
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
128 #seq += line.strip().upper()
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
129 seq += "".join(line.strip().upper().split())
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
130 elif not line.strip():
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
131 #Ignore blank lines before first record
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
132 pass
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
133 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
134 raise ValueError("Bad FASTA line %r" % line)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
135 if filename != "-":
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
136 handle.close()
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
137 if name:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
138 yield name, seq
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
139 raise StopIteration
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
140
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
141 motifs = list(load_re(re_filename))
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
142 print "Looking for %i NLS motifs" % len(motifs)
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
143
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
144 if tabular_filename == "-":
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
145 out_handle = sys.stdout
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
146 else:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
147 out_handle = open(tabular_filename, "w")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
148 out_handle.write("#ID\tNLS start\tNLS seq\tNLS pattern\tType\tProtCount\t%NucProt\tProtList\tProtLoci\n")
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
149 count = 0
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
150 nls = 0
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
151 for idn, seq in fasta_iterator(fasta_filename):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
152 for regex, evidence, p_count, percent_nuc_prot, proteins, domains in motifs:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
153 #Perl predictnls v1.0.17 (and older) take right most hit only, Bug #40
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
154 #This has been fixed (v1.0.18 onwards, June 2011), so we return all the matches
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
155 for match in regex.finditer(seq):
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
156 #Perl predictnls v1.0.17 (and older) return NLS start position with zero
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
157 #but changed to one based counting in v1.0.18 (June 2011) onwards, Bug #38
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
158 #We therefore also use one based couting, hence the start+1 here:
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
159 out_handle.write("%s\t%i\t%s\t%s\t%s\t%i\t%s\t%s\t%s\n" \
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
160 % (idn, match.start()+1, match.group(),
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
161 regex.pattern, evidence, p_count,
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
162 percent_nuc_prot, proteins, domains))
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
163 nls += 1
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
164 count += 1
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
165 if tabular_filename != "-":
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
166 out_handle.close()
6e26c5a48e9a Uploaded v0.0.4, first public release.
peterjc
parents:
diff changeset
167 print "Found %i NLS motifs in %i sequences" % (nls, count)