annotate aggregate_linelisting.py @ 0:515c0c885f5d draft default tip

planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
author public-health-bioinformatics
date Thu, 04 Jul 2019 19:40:13 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
1 #!/usr/bin/env python
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
2 '''Reads in a fasta file of antigenic maps and one with the reference antigenic map as
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
3 protein SeqRecords. Compares amino acids of sample antigenic maps to corresponding sites
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
4 in the reference and masks identical amino acids with dots. Writes headers (including
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
5 amino acid position numbers read from the respective index array), the reference amino
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
6 acid sequence and column headings required for both non-aggregated and aggregated line lists.
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
7 Outputs all headers and modified (i.e. dotted) sample sequences to a csv file.'''
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
8
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
9 '''Author: Diane Eisler, Molecular Microbiology & Genomics, BCCDC Public Health Laboratory, Jan 2018'''
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
10
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
11 import sys,string,os, time, Bio, re, argparse
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
12 from Bio import Seq, SeqIO, SeqUtils, Alphabet, SeqRecord
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
13 from Bio.SeqRecord import SeqRecord
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
14 from Bio.Alphabet import IUPAC
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
15 from Bio.Seq import Seq
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
16
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
17 inputAntigenicMaps = sys.argv[1] #batch fasta file with antigenic map sequences
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
18 refAntigenicMap = sys.argv[2] #fasta file of reference antigenic map sequence
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
19 antigenicSiteIndexArray = sys.argv[3] #antigenic site index array csv file
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
20 cladeDefinitionFile = sys.argv[4] #clade definition csv file
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
21 outFileHandle = sys.argv[5] #user-specifed output filename
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
22
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
23 agg_lineListFile = open(outFileHandle,'w') #open a writable output file
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
24
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
25 indicesLine = "" #comma-separated antigenic site positions
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
26 cladeList = [] #list of clade names read from clade definition file
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
27 ref_seq = "" #reference antigenic map (protein sequence)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
28 seqList = [] #list of aa sequences to compare to reference
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
29
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
30 BC_list = [] #empty list for BC samples
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
31 AB_list = [] #empty list for AB samples
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
32 ON_list = [] #empty list for ON samples
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
33 QC_list = [] #empty list for QC samples
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
34 nonprov_list = [] #empty list for samples not in above 4 provinces
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
35 #dictionary for location-separated sequence lists
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
36 prov_lists = {'1_BC':BC_list,'2_AB':AB_list,'3_ON':ON_list,'4_QC': QC_list, '5_nonprov': nonprov_list}
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
37
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
38 def replace_matching_aa_with_dot(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
39 """Compare amino acids in record to reference, mask identical symbols with dots, and return modified record."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
40 orig_seq = str(record.seq) #sequence string from SeqRecord
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
41 mod_seq = ""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
42 #replace only those aa's matching the reference with dots
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
43 for i in range(0, len(orig_seq)):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
44 if (orig_seq[i] == ref_seq[i]):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
45 mod_seq = mod_seq + '.'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
46 else:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
47 mod_seq = mod_seq + orig_seq[i]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
48 #assign modified sequence to new SeqRecord and return it
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
49 rec = SeqRecord(Seq(mod_seq,IUPAC.protein), id = record.id, name = "", description = "")
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
50 return rec
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
51
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
52 def extract_clade(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
53 """Extract clade name (or 'No_Match') from sequence name and return as clade name. """
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
54 if record.id.endswith('No_Match'):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
55 clade_name = 'No_Match'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
56 else: #
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
57 for clade in cladeList:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
58 if record.id.endswith(clade):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
59 clade_name = clade
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
60 return clade_name
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
61
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
62 def extract_sample_name(record, clade):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
63 """Extract sample name from sequence name and return sample name. """
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
64 end_index = record.id.index(clade)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
65 sample_name = record.id[:end_index -1]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
66 #return sample name as sequence name minus underscore and clade name
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
67 return sample_name
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
68
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
69 def sort_by_location(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
70 """Search sequence name for province name or 2-letter province code and add SeqRecord to
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
71 province-specific dictionary."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
72 seq_name = record.id
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
73 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
74 BC_list.append(record) #add Sequence record to BC_list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
75 elif ('-AB-' in seq_name) or ('/Alberta/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
76 AB_list.append(record) #add Sequence record to AB_list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
77 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
78 ON_list.append(record) #add Sequence record to ON_list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
79 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
80 QC_list.append(record) #add Sequence record to QC_list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
81 else:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
82 nonprov_list.append(record) #add Sequence record to nonprov_list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
83 return
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
84
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
85 def extract_province(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
86 """Search sequence name for province name or 2-letter province code and return province."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
87 seq_name = record.id
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
88 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
89 province = 'British Columbia'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
90 elif ('-AB-' in seq_name) or ('Alberta' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
91 province = '/Alberta/'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
92 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
93 province = 'Ontario'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
94 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
95 province = 'Quebec'
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
96 else:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
97 province = "other"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
98 return province
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
99
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
100 def get_sequence_length(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
101 """Return length of sequence in a SeqRecord."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
102 sequenceLength = len(str((record.seq)))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
103 return sequenceLength
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
104
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
105 def get_antigenic_site_substitutions(record):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
106 """Count number of non-dotted amino acids in SeqRecord sequence and return as substitutions."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
107 sequenceLength = get_sequence_length(record)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
108 seqString = str(record.seq)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
109 matches = seqString.count('.')
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
110 substitutions = sequenceLength - matches
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
111 return substitutions
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
112
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
113 def calculate_percent_id(record, substitutions):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
114 """Calculate percent sequence identity to reference sequence, based on substitutions
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
115 and sequence length and return percent id as a ratio (i.e. 0.90 no 90%)."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
116 sequenceLength = get_sequence_length(record)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
117 percentID = (1.00 - (float(substitutions)/float(sequenceLength)))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
118 return percentID
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
119
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
120 def output_aggregated_linelist(a_list):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
121 """Output aggregated line list of SeqRecords in csv format."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
122 sequevars = {} #dict of sequevar: SeqRecord list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
123 firstRecordID = None
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
124 #examine dotted/masked sequences in list and assign unique ones as dict keys
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
125 for rec in a_list:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
126 rec = replace_matching_aa_with_dot(rec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
127 sequence =str(rec.seq)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
128 #if the sequence is a key in the dict, add SeqRecord to list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
129 if sequence in sequevars:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
130 #if sequence already in dict as a key, increment the value
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
131 sequevars[sequence].append(rec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
132 else:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
133 #if sequence not in dict, add is as new key with list of 1 SeqRecord
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
134 sequevars[sequence] = [rec]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
135 #get list of sorted unique sequence keys
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
136 sorted_unique_seq_keys = sorted(sequevars.keys())
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
137 #process each list of SeqRecords sharing a unique sequence
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
138 for u in sorted_unique_seq_keys:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
139 #access list of sequences by unique sequence
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
140 listOfSeqs = sequevars[u]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
141 #sort this list of SeqRecords by record.id (i.e. name)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
142 listOfSeqs = [f for f in sorted(listOfSeqs, key = lambda x : x.id)]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
143 N = len(listOfSeqs)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
144 #output details of first SeqRecord to csv
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
145 firstRecord = listOfSeqs[0]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
146 province = extract_province(firstRecord)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
147 clade = extract_clade(firstRecord)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
148 substitutions = get_antigenic_site_substitutions(firstRecord)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
149 percentID = calculate_percent_id(firstRecord,substitutions)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
150 name = extract_sample_name(firstRecord, clade)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
151 name_part = name.rstrip() + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
152 N_part = str(N) + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
153 clade_part = clade + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
154 substitutions_part = str(substitutions) + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
155 percID_part = str(percentID) + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
156 col = " ," #empty column
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
157 sequence = str(firstRecord.seq).strip()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
158 csv_seq = ",".join(sequence) +","
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
159 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
160 #write first member of unique sequence list to csv
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
161 agg_lineListFile.write(comma_sep_output)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
162 #print sequence records in sequevar to console
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
163 print("\n\t\t%i SeqRecords matching Sequevar: %s" % (len(listOfSeqs), u))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
164
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
165 #to uncollapse sequevar group, print each member of the sequevar list to csv output
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
166 '''for i in range(1,len(listOfSeqs)):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
167 currentRec = listOfSeqs[i]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
168 province = extract_province(currentRec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
169 clade = extract_clade(currentRec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
170 substitutions = get_antigenic_site_substitutions(currentRec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
171 percentID = calculate_percent_id(currentRec,substitutions)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
172 name_part = (currentRec.id).rstrip() + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
173 N_part = "n/a" + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
174 clade_part = clade + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
175 substitutions_part = str(substitutions) + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
176 percID_part = str(percentID) + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
177 col = " ," #empty column
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
178 sequence = str(currentRec.seq).strip()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
179 csv_seq = ",".join(sequence) +","
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
180 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
181 agg_lineListFile.write(comma_sep_output) '''
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
182 return
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
183
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
184 with open (antigenicSiteIndexArray,'r') as siteIndices:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
185 """Read amino acid positions from antigenic site index array and print as header after one empty row."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
186 col = "," #empty column
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
187 #read amino acid positions and remove trailing whitespace
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
188 for line in siteIndices:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
189 #remove whitespace from the end of each line
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
190 indicesLine = line.rstrip()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
191 row1 = "\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
192 #add comma-separated AA positions to header line
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
193 row2 = col + col + col + col + indicesLine + "\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
194 #write first (empty) and 2nd (amino acid position) lines to output file
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
195 agg_lineListFile.write(row1)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
196 agg_lineListFile.write(row2)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
197
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
198 with open (refAntigenicMap,'r') as refMapFile:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
199 """Read reference antigenic map from fasta and output amino acids, followed by column headers."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
200 #read sequences from fasta to SeqRecord, uppercase, and store sequence string to ref_seq
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
201 record = SeqIO.read(refMapFile,"fasta",alphabet=IUPAC.protein)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
202 record = record.upper()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
203 ref_seq = str(record.seq).strip() #store sequence in variable for comparison to sample seqs
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
204 col = "," #empty column
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
205 name_part = (record.id).rstrip() + ','
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
206 sequence = str(record.seq).strip()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
207 csv_seq = ",".join(sequence)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
208 #output row with reference sequence displayed above sample sequences
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
209 row3 = name_part + col + col + col + csv_seq + "\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
210 agg_lineListFile.write(row3)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
211 positions = indicesLine.split(',')
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
212 numPos = len(positions)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
213 empty_indicesLine = ',' * numPos
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
214 #print column headers for sample sequences
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
215 row4 = "Sequence Name,N,Clade,Extra Substitutions," + empty_indicesLine + "Number of Amino Acid Substitutions in Antigenic Sites,% Identity of Antigenic Site Residues\n"
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
216 agg_lineListFile.write(row4)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
217 print("\nREFERENCE ANTIGENIC MAP: '%s' (%i amino acids)" % (record.id, len(record)))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
218
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
219 with open(cladeDefinitionFile,'r') as cladeFile:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
220 """Read clade definition file and store clade names in list."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
221 #remove whitespace from the end of each line and split elements at commas
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
222 for line in cladeFile:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
223 elementList = line.rstrip().split(',')
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
224 name = elementList[0] #move 1st element to name field
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
225 cladeList.append(name)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
226
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
227 with open(inputAntigenicMaps,'r') as extrAntigMapFile:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
228 """Read antigenic maps as protein SeqRecords and add to list."""
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
229 #read Sequences from fasta file, uppercase and add to seqList
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
230 for record in SeqIO.parse(extrAntigMapFile, "fasta", alphabet=IUPAC.protein):
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
231 record = record.upper()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
232 seqList.append(record) #add Seq to list of Sequences
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
233
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
234 #print number of sequences to be processed as user check
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
235 print("\nCOMPARING %i flu antigenic map sequences to the reference..." % len(seqList))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
236 for record in seqList:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
237 #assign SeqRecords to province-specific dictionaries
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
238 sort_by_location(record)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
239
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
240 #access prov segregated lists in order
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
241 sorted_prov_keys = sorted(prov_lists.keys())
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
242 print("\nSequence Lists Sorted by Province: ")
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
243 for prov in sorted_prov_keys:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
244 current_list = prov_lists[prov]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
245 #mask AA's identical to reference sequence with dot
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
246 masked_list = [] # empty temporary list to park masked sequences
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
247 for record in current_list:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
248 masked_rec = replace_matching_aa_with_dot(record)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
249 masked_list.append(masked_rec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
250 prov_lists[prov] = masked_list #replace original SeqRecord list with masked list
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
251
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
252 #group sequences in province-sorted list into clades
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
253 for prov in sorted_prov_keys:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
254 prov_list = prov_lists[prov]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
255 by_clades_dict = {} #empty dict for clade:seqRecord list groups
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
256 print("\n'%s' List (Amino Acids identical to Reference are Masked): " % (prov))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
257 for rec in prov_list:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
258 clade = extract_clade(rec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
259 if clade in by_clades_dict:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
260 #if clade already in dict as key, append record to list (value)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
261 by_clades_dict[clade].append(rec)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
262 else: #add clade as key to dict, value is list of 1 SeqRecord
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
263 by_clades_dict[clade] = [rec]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
264 #get list of alphabetically sorted clade keys
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
265 sorted_clade_keys = sorted(by_clades_dict.keys())
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
266 print("\tNumber of clades: ", len(by_clades_dict))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
267 #group each list of sequences in clade by sequevars
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
268 for key in sorted_clade_keys:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
269 print("\n\tCLADE: %s Number of Members: %i" % (key, len(by_clades_dict[key])))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
270 a_list = by_clades_dict[key]
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
271 for seqrec in a_list:
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
272 print("\t %s: %s" %(seqrec.id,str(seqrec.seq)))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
273 #output the list to csv as aggregated linelist
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
274 output_aggregated_linelist(a_list)
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
275
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
276 print("Aggregated Linelist written to file: '%s\n'" % (outFileHandle))
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
277 extrAntigMapFile.close()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
278 refMapFile.close()
515c0c885f5d planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff changeset
279 agg_lineListFile.close()