annotate fml_gff_groomer/scripts/gff_id_mapper.py @ 0:79726c328621 default tip

Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
author vipints
date Tue, 07 Jun 2011 17:29:24 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
1 #!/usr/bin/env python
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
2 #
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
3 # This program is free software; you can redistribute it and/or modify
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
4 # it under the terms of the GNU General Public License as published by
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
5 # the Free Software Foundation; either version 3 of the License, or
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
6 # (at your option) any later version.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
7 #
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
8 # Written (W) 2010 Vipin T Sreedharan, Friedrich Miescher Laboratory of the Max Planck Society
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
9 # Copyright (C) 2010 Max Planck Society
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
10 #
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
11
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
12 # Description : Provides feature to sub feature identifier mapping in a given GFF3 file.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
13
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
14 import re, sys
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
15 import collections
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
16 import urllib
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
17 import time
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
18
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
19 def _gff_line_map(line):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
20 """Parses a line of GFF into a dictionary.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
21 Given an input line from a GFF file, this:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
22 - breaks it into component elements
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
23 - determines the type of attribute (flat, parent, child or annotation)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
24 - generates a dictionary of GFF info
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
25 """
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
26 gff3_kw_pat = re.compile("\w+=")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
27 def _split_keyvals(keyval_str):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
28 """Split key-value pairs in a GFF2, GTF and GFF3 compatible way.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
29
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
30 GFF3 has key value pairs like:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
31 count=9;gene=amx-2;sequence=SAGE:aacggagccg
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
32 GFF2 and GTF have:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
33 Sequence "Y74C9A" ; Note "Clone Y74C9A; Genbank AC024206"
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
34 name "fgenesh1_pg.C_chr_1000003"; transcriptId 869
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
35 """
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
36 quals = collections.defaultdict(list)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
37 if keyval_str is None:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
38 return quals
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
39 # ensembl GTF has a stray semi-colon at the end
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
40 if keyval_str[-1] == ';':
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
41 keyval_str = keyval_str[:-1]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
42 # GFF2/GTF has a semi-colon with at least one space after it.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
43 # It can have spaces on both sides; wormbase does this.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
44 # GFF3 works with no spaces.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
45 # Split at the first one we can recognize as working
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
46 parts = keyval_str.split(" ; ")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
47 if len(parts) == 1:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
48 parts = keyval_str.split("; ")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
49 if len(parts) == 1:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
50 parts = keyval_str.split(";")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
51 # check if we have GFF3 style key-vals (with =)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
52 is_gff2 = True
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
53 if gff3_kw_pat.match(parts[0]):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
54 is_gff2 = False
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
55 key_vals = [p.split('=') for p in parts]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
56 # otherwise, we are separated by a space with a key as the first item
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
57 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
58 pieces = []
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
59 for p in parts:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
60 # fix misplaced semi-colons in keys in some GFF2 files
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
61 if p and p[0] == ';':
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
62 p = p[1:]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
63 pieces.append(p.strip().split(" "))
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
64 key_vals = [(p[0], " ".join(p[1:])) for p in pieces]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
65 for key, val in key_vals:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
66 # remove quotes in GFF2 files
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
67 if (len(val) > 0 and val[0] == '"' and val[-1] == '"'):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
68 val = val[1:-1]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
69 if val:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
70 quals[key].extend(val.split(','))
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
71 # if we don't have a value, make this a key=True/False style
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
72 # attribute
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
73 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
74 quals[key].append('true')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
75 for key, vals in quals.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
76 quals[key] = [urllib.unquote(v) for v in vals]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
77 return quals, is_gff2
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
78
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
79 def _nest_gff2_features(gff_parts):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
80 """Provide nesting of GFF2 transcript parts with transcript IDs.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
81
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
82 exons and coding sequences are mapped to a parent with a transcript_id
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
83 in GFF2. This is implemented differently at different genome centers
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
84 and this function attempts to resolve that and map things to the GFF3
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
85 way of doing them.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
86 """
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
87 # map protein or transcript ids to a parent
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
88 for transcript_id in ["transcript_id", "transcriptId", "proteinId"]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
89 try:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
90 gff_parts["quals"]["Parent"] = \
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
91 gff_parts["quals"][transcript_id]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
92 break
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
93 except KeyError:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
94 pass
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
95 # case for WormBase GFF -- everything labelled as Transcript or CDS
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
96 for flat_name in ["Transcript", "CDS"]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
97 if gff_parts["quals"].has_key(flat_name):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
98 # parent types
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
99 if gff_parts["type"] in [flat_name]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
100 if not gff_parts["id"]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
101 gff_parts["id"] = gff_parts["quals"][flat_name][0]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
102 gff_parts["quals"]["ID"] = [gff_parts["id"]]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
103 # children types
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
104 elif gff_parts["type"] in ["intron", "exon", "three_prime_UTR",
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
105 "coding_exon", "five_prime_UTR", "CDS", "stop_codon",
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
106 "start_codon"]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
107 gff_parts["quals"]["Parent"] = gff_parts["quals"][flat_name]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
108 break
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
109
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
110 return gff_parts
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
111
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
112 line = line.strip()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
113 if line == '':return [('directive', line)] # sometimes the blank lines will be there
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
114 if line[0] == '>':return [('directive', '')] # sometimes it will be a FATSA header
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
115 if line[0] == "#":
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
116 return [('directive', line[2:])]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
117 elif line:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
118 parts = line.split('\t')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
119 if len(parts) == 1 and re.search(r'\w+', parts[0]):return [('directive', '')] ## GFF files with FASTA sequence together
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
120 assert len(parts) == 9, line
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
121 gff_parts = [(None if p == '.' else p) for p in parts]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
122 gff_info = dict()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
123
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
124 # collect all of the base qualifiers for this item
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
125 quals, is_gff2 = _split_keyvals(gff_parts[8])
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
126
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
127 gff_info["is_gff2"] = is_gff2
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
128
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
129 if gff_parts[1]:quals["source"].append(gff_parts[1])
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
130 gff_info['quals'] = dict(quals)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
131
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
132 # if we are describing a location, then we are a feature
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
133 if gff_parts[3] and gff_parts[4]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
134 gff_info['type'] = gff_parts[2]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
135 gff_info['id'] = quals.get('ID', [''])[0]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
136
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
137 if is_gff2:gff_info = _nest_gff2_features(gff_info)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
138 # features that have parents need to link so we can pick up
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
139 # the relationship
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
140 if gff_info['quals'].has_key('Parent'):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
141 final_key = 'child'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
142 elif gff_info['id']:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
143 final_key = 'parent'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
144 # Handle flat features
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
145 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
146 final_key = 'feature'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
147 # otherwise, associate these annotations with the full record
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
148 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
149 final_key = 'annotation'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
150 return [(final_key, gff_info)]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
151
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
152 def parent_child_id_map(gff_handle):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
153 """Provide a mapping of parent to child relationships in the file.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
154
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
155 Gives a dictionary of parent child relationships:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
156
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
157 keys -- tuple of (source, type) for each parent
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
158 values -- tuple of (source, type) as children of that parent"""
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
159
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
160 # collect all of the parent and child types mapped to IDs
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
161 parent_sts = dict()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
162 child_sts = collections.defaultdict(list)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
163
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
164 for line in gff_handle:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
165 line_type, line_info = _gff_line_map(line)[0]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
166 if (line_type == 'parent' or (line_type == 'child' and line_info['id'])):
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
167 parent_sts[line_info['id']] = (line_info['quals']['source'][0], line_info['type'])
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
168 if line_type == 'child':
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
169 for parent_id in line_info['quals']['Parent']:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
170 child_sts[parent_id].append((line_info['quals']['source'][0], line_info['type']))
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
171 gff_handle.close()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
172
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
173 # generate a dictionary of the unique final type relationships
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
174 pc_map = collections.defaultdict(list)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
175 for parent_id, parent_type in parent_sts.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
176 for child_type in child_sts[parent_id]:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
177 pc_map[parent_type].append(child_type)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
178 pc_final_map = dict()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
179 for ptype, ctypes in pc_map.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
180 unique_ctypes = list(set(ctypes))
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
181 unique_ctypes.sort()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
182 pc_final_map[ptype] = unique_ctypes
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
183
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
184 # Check for Parent Child relations
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
185 level1, level2, level3, sec_level_mis = {}, {}, {}, {}
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
186 for etype, fchild in pc_final_map.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
187 level2_flag = 0
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
188 for kp, vp in pc_final_map.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
189 if etype in vp:level2_flag = 1; level2[etype] = 1 # check for second level features
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
190 if level2_flag == 0: # first level features
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
191 level1[etype] =1
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
192 for eachfch in fchild: # perform a check for all level1 objects values were defined as level2 keys.
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
193 if not eachfch in pc_final_map.keys(): # figure out the missing level2 objects
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
194 if etype in sec_level_mis:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
195 sec_level_mis[etype].append(eachfch)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
196 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
197 sec_level_mis[etype]=[eachfch]
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
198 if level2_flag == 1:level3[str(fchild)] =1 # taking third level features
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
199 # disply the result
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
200 if level1==level2==level3=={} and sec_level_mis == {}:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
201 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
202 print 'ONLY FIRST level feature(s):'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
203 source_type = dict()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
204 gff_handle = open(gff_handle.name, 'rU')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
205 for line in gff_handle:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
206 line = line.strip('\n\r')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
207 if line[0] == '#': continue
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
208 parts = line.split('\t')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
209 if parts[-1] == '':parts.pop()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
210 assert len(parts) == 9, line
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
211 source_type[(parts[1], parts[2])] = 1
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
212 gff_handle.close()
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
213 for ele in source_type:print '\t' + str(ele)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
214 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
215 else:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
216 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
217 print '===Report on different level features from GFF file==='
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
218 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
219 print 'FIRST level feature(s):'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
220 for ele in level1: print '\t' + str(ele)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
221 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
222 print 'SECOND level feature(s):'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
223 for ele in level2: print '\t' + str(ele)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
224 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
225 print 'THIRD level feature(s):'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
226 for ele in level3:print '\t' + str(ele[1:-1])
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
227 print
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
228 # wrong way mapped feature mapping description
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
229 for wf, wfv in sec_level_mis.items():
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
230 if wf[1]=='gene':
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
231 print 'GFF Parsing modules from publicly available packages like Bio-python, Bio-perl etc. are heavily dependent on feature identifier mapping.'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
232 print 'Here few features seems to be wrongly mapped to its child, which inturn cause problems while extracting the annotation based on feature identifier.\n'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
233 for ehv in wfv:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
234 if ehv[1]=='exon' or ehv[1]=='intron' or ehv[1]=='CDS' or ehv[1]=='three_prime_UTR' or ehv[1]=='five_prime_UTR':print 'Error in ID mapping: Level1 feature ' + str(wf) + ' maps to Level3 feature ' + str(ehv)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
235
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
236 if __name__=='__main__':
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
237
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
238 stime = time.asctime( time.localtime(time.time()) )
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
239 print '-------------------------------------------------------'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
240 print 'GFFExamine started on ' + stime
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
241 print '-------------------------------------------------------'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
242
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
243 try:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
244 gff_handle = open(sys.argv[1], 'rU')
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
245 except:
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
246 sys.stderr.write("Can't open the GFF3 file, Cannot continue...\n")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
247 sys.stderr.write("USAGE: gff_id_mapper.py <gff3 file> \n")
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
248 sys.exit(-1)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
249 parent_child_id_map(gff_handle)
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
250
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
251 stime = time.asctime( time.localtime(time.time()) )
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
252 print '-------------------------------------------------------'
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
253 print 'GFFExamine finished at ' + stime
79726c328621 Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
vipints
parents:
diff changeset
254 print '-------------------------------------------------------'