Mercurial > repos > public-health-bioinformatics > aggregate_linelisting
annotate aggregate_linelisting.py @ 0:515c0c885f5d draft default tip
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
| author | public-health-bioinformatics | 
|---|---|
| date | Thu, 04 Jul 2019 19:40:13 -0400 | 
| parents | |
| children | 
| rev | line source | 
|---|---|
| 
0
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
1 #!/usr/bin/env python | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
2 '''Reads in a fasta file of antigenic maps and one with the reference antigenic map as | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
3 protein SeqRecords. Compares amino acids of sample antigenic maps to corresponding sites | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
4 in the reference and masks identical amino acids with dots. Writes headers (including | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
5 amino acid position numbers read from the respective index array), the reference amino | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
6 acid sequence and column headings required for both non-aggregated and aggregated line lists. | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
7 Outputs all headers and modified (i.e. dotted) sample sequences to a csv file.''' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
8 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
9 '''Author: Diane Eisler, Molecular Microbiology & Genomics, BCCDC Public Health Laboratory, Jan 2018''' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
10 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
11 import sys,string,os, time, Bio, re, argparse | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
12 from Bio import Seq, SeqIO, SeqUtils, Alphabet, SeqRecord | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
13 from Bio.SeqRecord import SeqRecord | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
14 from Bio.Alphabet import IUPAC | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
15 from Bio.Seq import Seq | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
16 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
17 inputAntigenicMaps = sys.argv[1] #batch fasta file with antigenic map sequences | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
18 refAntigenicMap = sys.argv[2] #fasta file of reference antigenic map sequence | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
19 antigenicSiteIndexArray = sys.argv[3] #antigenic site index array csv file | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
20 cladeDefinitionFile = sys.argv[4] #clade definition csv file | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
21 outFileHandle = sys.argv[5] #user-specifed output filename | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
22 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
23 agg_lineListFile = open(outFileHandle,'w') #open a writable output file | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
24 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
25 indicesLine = "" #comma-separated antigenic site positions | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
26 cladeList = [] #list of clade names read from clade definition file | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
27 ref_seq = "" #reference antigenic map (protein sequence) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
28 seqList = [] #list of aa sequences to compare to reference | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
29 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
30 BC_list = [] #empty list for BC samples | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
31 AB_list = [] #empty list for AB samples | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
32 ON_list = [] #empty list for ON samples | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
33 QC_list = [] #empty list for QC samples | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
34 nonprov_list = [] #empty list for samples not in above 4 provinces | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
35 #dictionary for location-separated sequence lists | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
36 prov_lists = {'1_BC':BC_list,'2_AB':AB_list,'3_ON':ON_list,'4_QC': QC_list, '5_nonprov': nonprov_list} | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
37 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
38 def replace_matching_aa_with_dot(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
39 """Compare amino acids in record to reference, mask identical symbols with dots, and return modified record.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
40 orig_seq = str(record.seq) #sequence string from SeqRecord | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
41 mod_seq = "" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
42 #replace only those aa's matching the reference with dots | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
43 for i in range(0, len(orig_seq)): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
44 if (orig_seq[i] == ref_seq[i]): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
45 mod_seq = mod_seq + '.' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
46 else: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
47 mod_seq = mod_seq + orig_seq[i] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
48 #assign modified sequence to new SeqRecord and return it | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
49 rec = SeqRecord(Seq(mod_seq,IUPAC.protein), id = record.id, name = "", description = "") | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
50 return rec | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
51 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
52 def extract_clade(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
53 """Extract clade name (or 'No_Match') from sequence name and return as clade name. """ | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
54 if record.id.endswith('No_Match'): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
55 clade_name = 'No_Match' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
56 else: # | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
57 for clade in cladeList: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
58 if record.id.endswith(clade): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
59 clade_name = clade | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
60 return clade_name | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
61 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
62 def extract_sample_name(record, clade): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
63 """Extract sample name from sequence name and return sample name. """ | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
64 end_index = record.id.index(clade) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
65 sample_name = record.id[:end_index -1] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
66 #return sample name as sequence name minus underscore and clade name | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
67 return sample_name | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
68 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
69 def sort_by_location(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
70 """Search sequence name for province name or 2-letter province code and add SeqRecord to | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
71 province-specific dictionary.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
72 seq_name = record.id | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
73 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
74 BC_list.append(record) #add Sequence record to BC_list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
75 elif ('-AB-' in seq_name) or ('/Alberta/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
76 AB_list.append(record) #add Sequence record to AB_list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
77 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
78 ON_list.append(record) #add Sequence record to ON_list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
79 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
80 QC_list.append(record) #add Sequence record to QC_list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
81 else: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
82 nonprov_list.append(record) #add Sequence record to nonprov_list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
83 return | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
84 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
85 def extract_province(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
86 """Search sequence name for province name or 2-letter province code and return province.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
87 seq_name = record.id | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
88 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
89 province = 'British Columbia' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
90 elif ('-AB-' in seq_name) or ('Alberta' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
91 province = '/Alberta/' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
92 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
93 province = 'Ontario' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
94 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
95 province = 'Quebec' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
96 else: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
97 province = "other" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
98 return province | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
99 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
100 def get_sequence_length(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
101 """Return length of sequence in a SeqRecord.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
102 sequenceLength = len(str((record.seq))) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
103 return sequenceLength | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
104 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
105 def get_antigenic_site_substitutions(record): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
106 """Count number of non-dotted amino acids in SeqRecord sequence and return as substitutions.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
107 sequenceLength = get_sequence_length(record) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
108 seqString = str(record.seq) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
109 matches = seqString.count('.') | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
110 substitutions = sequenceLength - matches | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
111 return substitutions | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
112 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
113 def calculate_percent_id(record, substitutions): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
114 """Calculate percent sequence identity to reference sequence, based on substitutions | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
115 and sequence length and return percent id as a ratio (i.e. 0.90 no 90%).""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
116 sequenceLength = get_sequence_length(record) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
117 percentID = (1.00 - (float(substitutions)/float(sequenceLength))) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
118 return percentID | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
119 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
120 def output_aggregated_linelist(a_list): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
121 """Output aggregated line list of SeqRecords in csv format.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
122 sequevars = {} #dict of sequevar: SeqRecord list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
123 firstRecordID = None | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
124 #examine dotted/masked sequences in list and assign unique ones as dict keys | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
125 for rec in a_list: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
126 rec = replace_matching_aa_with_dot(rec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
127 sequence =str(rec.seq) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
128 #if the sequence is a key in the dict, add SeqRecord to list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
129 if sequence in sequevars: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
130 #if sequence already in dict as a key, increment the value | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
131 sequevars[sequence].append(rec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
132 else: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
133 #if sequence not in dict, add is as new key with list of 1 SeqRecord | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
134 sequevars[sequence] = [rec] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
135 #get list of sorted unique sequence keys | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
136 sorted_unique_seq_keys = sorted(sequevars.keys()) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
137 #process each list of SeqRecords sharing a unique sequence | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
138 for u in sorted_unique_seq_keys: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
139 #access list of sequences by unique sequence | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
140 listOfSeqs = sequevars[u] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
141 #sort this list of SeqRecords by record.id (i.e. name) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
142 listOfSeqs = [f for f in sorted(listOfSeqs, key = lambda x : x.id)] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
143 N = len(listOfSeqs) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
144 #output details of first SeqRecord to csv | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
145 firstRecord = listOfSeqs[0] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
146 province = extract_province(firstRecord) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
147 clade = extract_clade(firstRecord) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
148 substitutions = get_antigenic_site_substitutions(firstRecord) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
149 percentID = calculate_percent_id(firstRecord,substitutions) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
150 name = extract_sample_name(firstRecord, clade) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
151 name_part = name.rstrip() + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
152 N_part = str(N) + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
153 clade_part = clade + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
154 substitutions_part = str(substitutions) + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
155 percID_part = str(percentID) + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
156 col = " ," #empty column | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
157 sequence = str(firstRecord.seq).strip() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
158 csv_seq = ",".join(sequence) +"," | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
159 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
160 #write first member of unique sequence list to csv | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
161 agg_lineListFile.write(comma_sep_output) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
162 #print sequence records in sequevar to console | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
163 print("\n\t\t%i SeqRecords matching Sequevar: %s" % (len(listOfSeqs), u)) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
164 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
165 #to uncollapse sequevar group, print each member of the sequevar list to csv output | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
166 '''for i in range(1,len(listOfSeqs)): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
167 currentRec = listOfSeqs[i] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
168 province = extract_province(currentRec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
169 clade = extract_clade(currentRec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
170 substitutions = get_antigenic_site_substitutions(currentRec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
171 percentID = calculate_percent_id(currentRec,substitutions) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
172 name_part = (currentRec.id).rstrip() + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
173 N_part = "n/a" + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
174 clade_part = clade + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
175 substitutions_part = str(substitutions) + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
176 percID_part = str(percentID) + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
177 col = " ," #empty column | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
178 sequence = str(currentRec.seq).strip() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
179 csv_seq = ",".join(sequence) +"," | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
180 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
181 agg_lineListFile.write(comma_sep_output) ''' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
182 return | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
183 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
184 with open (antigenicSiteIndexArray,'r') as siteIndices: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
185 """Read amino acid positions from antigenic site index array and print as header after one empty row.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
186 col = "," #empty column | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
187 #read amino acid positions and remove trailing whitespace | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
188 for line in siteIndices: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
189 #remove whitespace from the end of each line | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
190 indicesLine = line.rstrip() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
191 row1 = "\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
192 #add comma-separated AA positions to header line | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
193 row2 = col + col + col + col + indicesLine + "\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
194 #write first (empty) and 2nd (amino acid position) lines to output file | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
195 agg_lineListFile.write(row1) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
196 agg_lineListFile.write(row2) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
197 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
198 with open (refAntigenicMap,'r') as refMapFile: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
199 """Read reference antigenic map from fasta and output amino acids, followed by column headers.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
200 #read sequences from fasta to SeqRecord, uppercase, and store sequence string to ref_seq | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
201 record = SeqIO.read(refMapFile,"fasta",alphabet=IUPAC.protein) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
202 record = record.upper() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
203 ref_seq = str(record.seq).strip() #store sequence in variable for comparison to sample seqs | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
204 col = "," #empty column | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
205 name_part = (record.id).rstrip() + ',' | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
206 sequence = str(record.seq).strip() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
207 csv_seq = ",".join(sequence) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
208 #output row with reference sequence displayed above sample sequences | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
209 row3 = name_part + col + col + col + csv_seq + "\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
210 agg_lineListFile.write(row3) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
211 positions = indicesLine.split(',') | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
212 numPos = len(positions) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
213 empty_indicesLine = ',' * numPos | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
214 #print column headers for sample sequences | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
215 row4 = "Sequence Name,N,Clade,Extra Substitutions," + empty_indicesLine + "Number of Amino Acid Substitutions in Antigenic Sites,% Identity of Antigenic Site Residues\n" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
216 agg_lineListFile.write(row4) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
217 print("\nREFERENCE ANTIGENIC MAP: '%s' (%i amino acids)" % (record.id, len(record))) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
218 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
219 with open(cladeDefinitionFile,'r') as cladeFile: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
220 """Read clade definition file and store clade names in list.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
221 #remove whitespace from the end of each line and split elements at commas | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
222 for line in cladeFile: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
223 elementList = line.rstrip().split(',') | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
224 name = elementList[0] #move 1st element to name field | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
225 cladeList.append(name) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
226 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
227 with open(inputAntigenicMaps,'r') as extrAntigMapFile: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
228 """Read antigenic maps as protein SeqRecords and add to list.""" | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
229 #read Sequences from fasta file, uppercase and add to seqList | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
230 for record in SeqIO.parse(extrAntigMapFile, "fasta", alphabet=IUPAC.protein): | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
231 record = record.upper() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
232 seqList.append(record) #add Seq to list of Sequences | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
233 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
234 #print number of sequences to be processed as user check | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
235 print("\nCOMPARING %i flu antigenic map sequences to the reference..." % len(seqList)) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
236 for record in seqList: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
237 #assign SeqRecords to province-specific dictionaries | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
238 sort_by_location(record) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
239 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
240 #access prov segregated lists in order | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
241 sorted_prov_keys = sorted(prov_lists.keys()) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
242 print("\nSequence Lists Sorted by Province: ") | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
243 for prov in sorted_prov_keys: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
244 current_list = prov_lists[prov] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
245 #mask AA's identical to reference sequence with dot | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
246 masked_list = [] # empty temporary list to park masked sequences | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
247 for record in current_list: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
248 masked_rec = replace_matching_aa_with_dot(record) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
249 masked_list.append(masked_rec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
250 prov_lists[prov] = masked_list #replace original SeqRecord list with masked list | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
251 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
252 #group sequences in province-sorted list into clades | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
253 for prov in sorted_prov_keys: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
254 prov_list = prov_lists[prov] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
255 by_clades_dict = {} #empty dict for clade:seqRecord list groups | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
256 print("\n'%s' List (Amino Acids identical to Reference are Masked): " % (prov)) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
257 for rec in prov_list: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
258 clade = extract_clade(rec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
259 if clade in by_clades_dict: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
260 #if clade already in dict as key, append record to list (value) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
261 by_clades_dict[clade].append(rec) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
262 else: #add clade as key to dict, value is list of 1 SeqRecord | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
263 by_clades_dict[clade] = [rec] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
264 #get list of alphabetically sorted clade keys | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
265 sorted_clade_keys = sorted(by_clades_dict.keys()) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
266 print("\tNumber of clades: ", len(by_clades_dict)) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
267 #group each list of sequences in clade by sequevars | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
268 for key in sorted_clade_keys: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
269 print("\n\tCLADE: %s Number of Members: %i" % (key, len(by_clades_dict[key]))) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
270 a_list = by_clades_dict[key] | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
271 for seqrec in a_list: | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
272 print("\t %s: %s" %(seqrec.id,str(seqrec.seq))) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
273 #output the list to csv as aggregated linelist | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
274 output_aggregated_linelist(a_list) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
275 | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
276 print("Aggregated Linelist written to file: '%s\n'" % (outFileHandle)) | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
277 extrAntigMapFile.close() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
278 refMapFile.close() | 
| 
 
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
 
public-health-bioinformatics 
parents:  
diff
changeset
 | 
279 agg_lineListFile.close() | 
