annotate ensembl_longest_cds_per_gene.py @ 0:4dba69135845 draft

planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
author earlhaminst
date Tue, 07 Mar 2017 05:54:30 -0500
parents
children 6cf9f7f6509c
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
1 """
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
2 This script reads a CDS FASTA file from Ensembl and outputs a FASTA file with
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
3 only the longest CDS sequence for each gene. The header of the sequences in the
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
4 output file will be the transcript id without version.
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
5 """
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
6 from __future__ import print_function
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
7
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
8 import collections
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
9 import optparse
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
10 import sys
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
11
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
12 Sequence = collections.namedtuple('Sequence', ['header', 'sequence'])
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
13
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
14
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
15 def FASTAReader_gen(fasta_filename):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
16 with open(fasta_filename) as fasta_file:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
17 line = fasta_file.readline()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
18 while True:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
19 if not line:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
20 return
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
21 assert line.startswith('>'), "FASTA headers must start with >"
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
22 header = line.rstrip()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
23 sequence_parts = []
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
24 line = fasta_file.readline()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
25 while line and line[0] != '>':
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
26 sequence_parts.append(line.rstrip())
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
27 line = fasta_file.readline()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
28 sequence = "\n".join(sequence_parts)
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
29 yield Sequence(header, sequence)
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
30
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
31
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
32 def remove_id_version(s):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
33 """
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
34 Remove the optional '.VERSION' from an Ensembl id.
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
35 """
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
36 return s.split('.')[0]
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
37
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
38
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
39 parser = optparse.OptionParser()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
40 parser.add_option('-f', '--fasta', dest="input_fasta_filename",
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
41 help='CDS file in FASTA format from Ensembl')
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
42 parser.add_option('-o', '--output', dest="output_fasta_filename",
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
43 help='Output FASTA file name')
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
44 options, args = parser.parse_args()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
45
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
46 if options.input_fasta_filename is None:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
47 raise Exception('-f option must be specified')
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
48 if options.output_fasta_filename is None:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
49 raise Exception('-o option must be specified')
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
50
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
51 gene_transcripts_dict = dict()
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
52
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
53 for entry in FASTAReader_gen(options.input_fasta_filename):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
54 transcript_id, rest = entry.header[1:].split(' ', 1)
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
55 transcript_id = remove_id_version(transcript_id)
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
56 gene_id = None
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
57 for s in rest.split(' '):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
58 if s.startswith('gene:'):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
59 gene_id = remove_id_version(s[5:])
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
60 break
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
61 else:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
62 print("Gene id not found in header '%s'" % entry.header, file=sys.stderr)
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
63 continue
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
64 if gene_id in gene_transcripts_dict:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
65 gene_transcripts_dict[gene_id].append((transcript_id, len(entry.sequence)))
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
66 else:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
67 gene_transcripts_dict[gene_id] = [(transcript_id, len(entry.sequence))]
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
68
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
69 # For each gene, select the transcript with the longest sequence
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
70 # If more than one transcripts have the same longest sequence for a gene, the
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
71 # first one to appear in the FASTA file is selected
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
72 selected_transcript_ids = [max(transcript_id_lengths, key=lambda _: _[1])[0] for transcript_id_lengths in gene_transcripts_dict.values()]
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
73
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
74 with open(options.output_fasta_filename, 'w') as output_fasta_file:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
75 for entry in FASTAReader_gen(options.input_fasta_filename):
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
76 transcript_id = remove_id_version(entry.header[1:].split(' ')[0])
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
77 if transcript_id in selected_transcript_ids:
4dba69135845 planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 26c70aecb56c19099455bb5a432615b09ad322d1
earlhaminst
parents:
diff changeset
78 output_fasta_file.write(">%s\n%s\n" % (transcript_id, entry.sequence))