annotate gbk_to_gff.py @ 11:5c6f33e20fcc default tip

requirement tag added
author vipints <vipin@cbio.mskcc.org>
date Fri, 24 Apr 2015 18:04:27 -0400
parents c42c69aa81f8
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
10
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
1 #!/usr/bin/env python
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
2 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
3 Convert data from Genbank format to GFF.
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
4
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
5 Usage:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
6 python gbk_to_gff.py in.gbk > out.gff
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
7
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
8 Requirements:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
9 BioPython:- http://biopython.org/
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
10 helper.py:- https://github.com/vipints/GFFtools-GX/blob/master/helper.py
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
11
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
12 Copyright (C)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
13 2009-2012 Friedrich Miescher Laboratory of the Max Planck Society, Tubingen, Germany.
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
14 2012-2015 Memorial Sloan Kettering Cancer Center New York City, USA.
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
15 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
16
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
17 import os
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
18 import re
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
19 import sys
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
20 import helper
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
21 import collections
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
22 from Bio import SeqIO
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
23
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
24 def feature_table(chr_id, source, orient, genes, transcripts, cds, exons, unk):
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
25 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
26 Write the feature information
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
27 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
28 for gname, ginfo in genes.items():
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
29 line = [str(chr_id),
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
30 'gbk2gff',
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
31 ginfo[3],
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
32 str(ginfo[0]),
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
33 str(ginfo[1]),
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
34 '.',
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
35 ginfo[2],
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
36 '.',
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
37 'ID=%s;Name=%s' % (str(gname), str(gname))]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
38 sys.stdout.write('\t'.join(line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
39 ## construct the transcript line is not defined in the original file
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
40 t_line = [str(chr_id), 'gbk2gff', source, 0, 1, '.', ginfo[2], '.']
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
41
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
42 if not transcripts:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
43 t_line.append('ID=Transcript:%s;Parent=%s' % (str(gname), str(gname)))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
44
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
45 if exons: ## get the entire transcript region from the defined feature
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
46 t_line[3] = str(exons[gname][0][0])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
47 t_line[4] = str(exons[gname][0][-1])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
48 elif cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
49 t_line[3] = str(cds[gname][0][0])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
50 t_line[4] = str(cds[gname][0][-1])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
51
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
52 if not cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
53 t_line[2] = 'transcript'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
54 else:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
55 t_line[2] = 'mRNA'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
56 sys.stdout.write('\t'.join(t_line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
57
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
58 if exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
59 exon_line_print(t_line, exons[gname], 'Transcript:'+str(gname), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
60
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
61 if cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
62 exon_line_print(t_line, cds[gname], 'Transcript:'+str(gname), 'CDS')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
63 if not exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
64 exon_line_print(t_line, cds[gname], 'Transcript:'+str(gname), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
65
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
66 else: ## transcript is defined
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
67 for idx in transcripts[gname]:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
68 t_line[2] = idx[3]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
69 t_line[3] = str(idx[0])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
70 t_line[4] = str(idx[1])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
71 t_line.append('ID='+str(idx[2])+';Parent='+str(gname))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
72 sys.stdout.write('\t'.join(t_line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
73
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
74 ## feature line print call
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
75 if exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
76 exon_line_print(t_line, exons[gname], str(idx[2]), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
77 if cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
78 exon_line_print(t_line, cds[gname], str(idx[2]), 'CDS')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
79 if not exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
80 exon_line_print(t_line, cds[gname], str(idx[2]), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
81
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
82 if len(genes) == 0: ## feature entry with fragment information
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
83
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
84 line = [str(chr_id), 'gbk2gff', source, 0, 1, '.', orient, '.']
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
85 fStart = fStop = None
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
86
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
87 for eid, ex in cds.items():
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
88 fStart = ex[0][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
89 fStop = ex[0][-1]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
90
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
91 for eid, ex in exons.items():
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
92 fStart = ex[0][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
93 fStop = ex[0][-1]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
94
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
95 if fStart or fStart:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
96
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
97 line[2] = 'gene'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
98 line[3] = str(fStart)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
99 line[4] = str(fStop)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
100 line.append('ID=Unknown_Gene_' + str(unk) + ';Name=Unknown_Gene_' + str(unk))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
101 sys.stdout.write('\t'.join(line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
102
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
103 if not cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
104 line[2] = 'transcript'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
105 else:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
106 line[2] = 'mRNA'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
107
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
108 line[8] = 'ID=Unknown_Transcript_' + str(unk) + ';Parent=Unknown_Gene_' + str(unk)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
109 sys.stdout.write('\t'.join(line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
110
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
111 if exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
112 exon_line_print(line, cds[None], 'Unknown_Transcript_' + str(unk), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
113
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
114 if cds:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
115 exon_line_print(line, cds[None], 'Unknown_Transcript_' + str(unk), 'CDS')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
116 if not exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
117 exon_line_print(line, cds[None], 'Unknown_Transcript_' + str(unk), 'exon')
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
118
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
119 unk +=1
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
120
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
121 return unk
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
122
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
123
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
124 def exon_line_print(temp_line, trx_exons, parent, ftype):
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
125 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
126 Print the EXON feature line
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
127 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
128 for ex in trx_exons:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
129 temp_line[2] = ftype
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
130 temp_line[3] = str(ex[0])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
131 temp_line[4] = str(ex[1])
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
132 temp_line[8] = 'Parent=%s' % parent
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
133 sys.stdout.write('\t'.join(temp_line)+"\n")
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
134
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
135
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
136 def gbk_parse(fname):
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
137 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
138 Extract genome annotation recods from genbank format
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
139
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
140 @args fname: gbk file name
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
141 @type fname: str
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
142 """
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
143 fhand = helper.open_file(gbkfname)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
144 unk = 1
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
145
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
146 for record in SeqIO.parse(fhand, "genbank"):
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
147 gene_tags = dict()
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
148 tx_tags = collections.defaultdict(list)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
149 exon = collections.defaultdict(list)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
150 cds = collections.defaultdict(list)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
151 mol_type, chr_id = None, None
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
152
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
153 for rec in record.features:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
154
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
155 if rec.type == 'source':
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
156 try:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
157 mol_type = rec.qualifiers['mol_type'][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
158 except:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
159 mol_type = '.'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
160 pass
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
161 try:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
162 chr_id = rec.qualifiers['chromosome'][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
163 except:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
164 chr_id = record.name
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
165 continue
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
166
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
167 strand='-'
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
168 strand='+' if rec.strand>0 else strand
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
169
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
170 fid = None
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
171 try:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
172 fid = rec.qualifiers['gene'][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
173 except:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
174 pass
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
175
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
176 transcript_id = None
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
177 try:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
178 transcript_id = rec.qualifiers['transcript_id'][0]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
179 except:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
180 pass
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
181
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
182 if re.search(r'gene', rec.type):
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
183 gene_tags[fid] = (rec.location._start.position+1,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
184 rec.location._end.position,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
185 strand,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
186 rec.type
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
187 )
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
188 elif rec.type == 'exon':
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
189 exon[fid].append((rec.location._start.position+1,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
190 rec.location._end.position))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
191 elif rec.type=='CDS':
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
192 cds[fid].append((rec.location._start.position+1,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
193 rec.location._end.position))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
194 else:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
195 # get all transcripts
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
196 if transcript_id:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
197 tx_tags[fid].append((rec.location._start.position+1,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
198 rec.location._end.position,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
199 transcript_id,
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
200 rec.type))
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
201 # record extracted, generate feature table
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
202 unk = feature_table(chr_id, mol_type, strand, gene_tags, tx_tags, cds, exon, unk)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
203
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
204 fhand.close()
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
205
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
206
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
207 if __name__=='__main__':
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
208
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
209 try:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
210 gbkfname = sys.argv[1]
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
211 except:
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
212 print __doc__
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
213 sys.exit(-1)
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
214
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
215 ## extract gbk records
c42c69aa81f8 fixed manually the upload of version 2.1.0 - deleted accidentally added files to the repo
vipints <vipin@cbio.mskcc.org>
parents:
diff changeset
216 gbk_parse(gbkfname)