annotate blast_tool.py @ 0:ec8fe63afdb8 draft

Uploaded
author nedias
date Wed, 12 Oct 2016 00:03:16 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
1 """
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
2 Util class of BLAST API
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
3 Use Biopython's BLAST interface to access BLAST API
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
4 BLAST call may take more than 5 min to complete.
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
5
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
6 Author Nedias Sept, 2016
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
7
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
8 """
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
9
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
10
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
11 # BLAST Web api, use to calling BLAST using url
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
12 from Bio.Blast import NCBIWWW
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
13 # BLAST XML util, use to read data from BLAST's response
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
14 from Bio.Blast import NCBIXML
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
15
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
16
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
17 def exec_tool(options):
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
18 if options.format and options.format == "fasta":
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
19 blast_call(options.input, options.output)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
20
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
21
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
22
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
23
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
24 # Call BLAST protein API
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
25 # input: an fasta format file with polypeptide data
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
26 # output: the searching result file from BLAST in XML format
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
27 # return: execute status code(developing)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
28 def blast_call(in_file, out_file):
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
29
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
30 # Open input file(read only)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
31 input_file = open(in_file, "rU")
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
32 # Create an output file(create or override)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
33 output_file = open(out_file, "w+")
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
34
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
35 # Read all data from the input fasta file
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
36 fasta_string = input_file.read()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
37
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
38 # Call BLAST to search similar protein data
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
39 # blastp's "p" means using the protein search API
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
40 # "nr" means not specific database, which will perform a search base on all possible database
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
41 # TODO:May consider parametrize these options
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
42 result_handle = NCBIWWW.qblast("blastp", "nr", fasta_string)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
43
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
44 # Read result form responded data and save them in output file
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
45 result = result_handle.read()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
46 output_file.write(result)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
47
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
48 return 0
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
49
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
50
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
51 # Call BLAST protein API (text version)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
52 # input: polypeptide data text
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
53 # return: the searching result from BLAST in StringIO format
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
54 def blast_call_text(in_text):
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
55
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
56 # Call BLAST to search similar protein data
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
57 result_handle = NCBIWWW.qblast("blastp", "nr", in_text)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
58
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
59 return result_handle
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
60
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
61
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
62 # Abstract data from BLAST responded XML
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
63 # input: the searching result file from BLAST in XML format
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
64 # return: an dictionary include all data from BLAST result.
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
65 # the structure of the dictionary is:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
66 # dict -- hits: Integer, number of total hits during searching
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
67 # -- alignments: List, alignments for all polypeptide chains(there could be many chains in a single file)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
68 # | (TODO:May consider change to dictionary and use the tag of every polypeptide chain as entry)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
69 # |
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
70 # -- hsps: List, all high-scoring pairs(alignment) for one polypeptide chain
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
71 # |
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
72 # hps: Dictionary, contains all data of the high-scoring pair
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
73 def read_blast_result(blast_xml):
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
74
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
75 # Open the BLAST result
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
76 result_file = open(blast_xml)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
77 # Read all result using biopython
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
78 blast_record = NCBIXML.read(result_file)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
79 # E-value limit, every data with a less-than 96% e-value will be filtered
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
80 E_VALUE_THRESH = 0.04
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
81
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
82 # Initialize the result dictionary
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
83 result_dict = dict()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
84
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
85 # Store the hits number
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
86 result_dict['hits': len(blast_record.alignments)]
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
87 # Initialize the alignment list
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
88 aligns = list()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
89 # Scan through all alignments
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
90 for alignment in blast_record.alignments:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
91 # Initialize the high-score pairs list
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
92 hsps = list()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
93 # Scan through all hsps
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
94 for hsp in alignment.hsps:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
95 # filter all hsp has less e-value than E_VALUE_THRESH
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
96 if hsp.expect < E_VALUE_THRESH:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
97 # Initialize high-score pair dictionary
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
98 hsp_dict = dict()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
99 # Store all data obtained from BLAST into the dictionary
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
100 hsp_dict['seq'] = alignment.title
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
101 hsp_dict['len'] = alignment.length
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
102 hsp_dict['eval'] = hsp.expect
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
103 # Identity is calculated as the percentage of identities in the alignment
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
104 hsp_dict['ident'] = int(round(hsp.identities * float(100) / alignment.length))
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
105 hsp_dict['query'] = hsp.query
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
106 hsp_dict['match'] = hsp.match
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
107 hsp_dict['sbjct'] = hsp.sbjct
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
108 hsps.append(hsp_dict)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
109 # Store the hsps into the alignment data
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
110 aligns.append(hsps)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
111
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
112 # Store alignment data into the dictionary
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
113 result_dict['alignments', aligns]
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
114 return result_dict
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
115
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
116
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
117 # Abstract data from BLAST responded XML (text version)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
118 # input: the searching result file from BLAST in StringIO
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
119 # return: an dictionary include all data from BLAST result.
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
120 def read_blast_result_text(blast_handle):
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
121
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
122 blast_record = NCBIXML.read(blast_handle)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
123
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
124 E_VALUE_THRESH = 0.04
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
125
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
126 result_dict = dict()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
127
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
128 result_dict['hits'] = len(blast_record.alignments)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
129 aligns = list()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
130 for alignment in blast_record.alignments:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
131 hsps = list()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
132 for hsp in alignment.hsps:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
133
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
134 if hsp.expect < E_VALUE_THRESH:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
135 hsp_dict = dict()
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
136 hsp_dict['seq'] = alignment.title
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
137 hsp_dict['len'] = alignment.length
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
138 hsp_dict['eval'] = hsp.expect
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
139 hsp_dict['ident'] = int(round(hsp.identities * float(100) / alignment.length))
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
140 hsp_dict['query'] = hsp.query
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
141 hsp_dict['match'] = hsp.match
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
142 hsp_dict['sbjct'] = hsp.sbjct
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
143 hsps.append(hsp_dict)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
144 aligns.append(hsps)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
145
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
146 result_dict['alignments'] = aligns
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
147 return result_dict
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
148
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
149
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
150 # Sample of usage (text version)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
151 def text():
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
152 result = blast_call_text('MTSQTTINKPGPGDGSPAGSAPAKGWRGHPWVTLVTVAVGVMMVALDGTIVAIANPAIQD\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
153 DLDASLADVQWITNAYFLALAVALITAGKLGDRFGHRQTFLIGVAGFAASSGAIGLSGSI\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
154 AAVIVFRVFQGLFGALLMPAALGLLRATFPAEKLNMAIGIWGMVIGASTAGGPILGGVLV\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
155 EHVNWQSVFFINVPVGIVAVVLGVMILLDHRAANAPRSFDVVGIVLLSASMFALVWALIK\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
156 APEWGWGSGQTWVYIGGSVVGFVLFSVWETKVKEPLIPLAMFRSVPLSAGVVLMVLMAIA\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
157 FMGGLFFVTFYLQNVHGMSPVDAGLHLLPLTGMMIVASPLAGAMITKVGPRIPLAGGMVC\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
158 TAVAMFGISTLETDTGSGLMSIWFGLLGLGLAPVMVGATEVIVGNAPMELSGVAGGLQQA\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
159 AMQIGGSLGTAVLGAVMASKVDSDLAGNWKDAGLPELTPQQADQASEAVRVGVPPVAPGT\
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
160 PAEVAGKITDVAHDTFISGMSLASLVAAGVAVVAVFVAFLTKRGENAEAGAGVGHI'
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
161 )
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
162 re_dict = read_blast_result_text(result)
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
163 print(re_dict['hits'])
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
164 for alignments in re_dict['alignments']:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
165 for hsp in alignments:
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
166 print(hsp['seq'])
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
167 print(hsp['len'])
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
168 print(hsp['eval'])
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
169 print(hsp['ident'])
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
170 print(hsp['query'][0:75] + "...")
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
171 print(hsp['match'][0:75] + "...")
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
172 print(hsp['sbjct'][0:75] + "...")
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
173
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
174
ec8fe63afdb8 Uploaded
nedias
parents:
diff changeset
175