Mercurial > repos > public-health-bioinformatics > aggregate_linelisting
annotate aggregate_linelisting.py @ 0:515c0c885f5d draft default tip
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
author | public-health-bioinformatics |
---|---|
date | Thu, 04 Jul 2019 19:40:13 -0400 |
parents | |
children |
rev | line source |
---|---|
0
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
1 #!/usr/bin/env python |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
2 '''Reads in a fasta file of antigenic maps and one with the reference antigenic map as |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
3 protein SeqRecords. Compares amino acids of sample antigenic maps to corresponding sites |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
4 in the reference and masks identical amino acids with dots. Writes headers (including |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
5 amino acid position numbers read from the respective index array), the reference amino |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
6 acid sequence and column headings required for both non-aggregated and aggregated line lists. |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
7 Outputs all headers and modified (i.e. dotted) sample sequences to a csv file.''' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
8 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
9 '''Author: Diane Eisler, Molecular Microbiology & Genomics, BCCDC Public Health Laboratory, Jan 2018''' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
10 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
11 import sys,string,os, time, Bio, re, argparse |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
12 from Bio import Seq, SeqIO, SeqUtils, Alphabet, SeqRecord |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
13 from Bio.SeqRecord import SeqRecord |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
14 from Bio.Alphabet import IUPAC |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
15 from Bio.Seq import Seq |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
16 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
17 inputAntigenicMaps = sys.argv[1] #batch fasta file with antigenic map sequences |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
18 refAntigenicMap = sys.argv[2] #fasta file of reference antigenic map sequence |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
19 antigenicSiteIndexArray = sys.argv[3] #antigenic site index array csv file |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
20 cladeDefinitionFile = sys.argv[4] #clade definition csv file |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
21 outFileHandle = sys.argv[5] #user-specifed output filename |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
22 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
23 agg_lineListFile = open(outFileHandle,'w') #open a writable output file |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
24 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
25 indicesLine = "" #comma-separated antigenic site positions |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
26 cladeList = [] #list of clade names read from clade definition file |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
27 ref_seq = "" #reference antigenic map (protein sequence) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
28 seqList = [] #list of aa sequences to compare to reference |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
29 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
30 BC_list = [] #empty list for BC samples |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
31 AB_list = [] #empty list for AB samples |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
32 ON_list = [] #empty list for ON samples |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
33 QC_list = [] #empty list for QC samples |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
34 nonprov_list = [] #empty list for samples not in above 4 provinces |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
35 #dictionary for location-separated sequence lists |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
36 prov_lists = {'1_BC':BC_list,'2_AB':AB_list,'3_ON':ON_list,'4_QC': QC_list, '5_nonprov': nonprov_list} |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
37 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
38 def replace_matching_aa_with_dot(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
39 """Compare amino acids in record to reference, mask identical symbols with dots, and return modified record.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
40 orig_seq = str(record.seq) #sequence string from SeqRecord |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
41 mod_seq = "" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
42 #replace only those aa's matching the reference with dots |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
43 for i in range(0, len(orig_seq)): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
44 if (orig_seq[i] == ref_seq[i]): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
45 mod_seq = mod_seq + '.' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
46 else: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
47 mod_seq = mod_seq + orig_seq[i] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
48 #assign modified sequence to new SeqRecord and return it |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
49 rec = SeqRecord(Seq(mod_seq,IUPAC.protein), id = record.id, name = "", description = "") |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
50 return rec |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
51 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
52 def extract_clade(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
53 """Extract clade name (or 'No_Match') from sequence name and return as clade name. """ |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
54 if record.id.endswith('No_Match'): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
55 clade_name = 'No_Match' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
56 else: # |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
57 for clade in cladeList: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
58 if record.id.endswith(clade): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
59 clade_name = clade |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
60 return clade_name |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
61 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
62 def extract_sample_name(record, clade): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
63 """Extract sample name from sequence name and return sample name. """ |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
64 end_index = record.id.index(clade) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
65 sample_name = record.id[:end_index -1] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
66 #return sample name as sequence name minus underscore and clade name |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
67 return sample_name |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
68 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
69 def sort_by_location(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
70 """Search sequence name for province name or 2-letter province code and add SeqRecord to |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
71 province-specific dictionary.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
72 seq_name = record.id |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
73 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
74 BC_list.append(record) #add Sequence record to BC_list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
75 elif ('-AB-' in seq_name) or ('/Alberta/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
76 AB_list.append(record) #add Sequence record to AB_list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
77 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
78 ON_list.append(record) #add Sequence record to ON_list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
79 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
80 QC_list.append(record) #add Sequence record to QC_list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
81 else: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
82 nonprov_list.append(record) #add Sequence record to nonprov_list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
83 return |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
84 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
85 def extract_province(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
86 """Search sequence name for province name or 2-letter province code and return province.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
87 seq_name = record.id |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
88 if ('-BC-' in seq_name) or ('/British_Columbia/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
89 province = 'British Columbia' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
90 elif ('-AB-' in seq_name) or ('Alberta' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
91 province = '/Alberta/' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
92 elif ('-ON-' in seq_name) or ('/Ontario/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
93 province = 'Ontario' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
94 elif ('-QC-' in seq_name) or ('/Quebec/' in seq_name): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
95 province = 'Quebec' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
96 else: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
97 province = "other" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
98 return province |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
99 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
100 def get_sequence_length(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
101 """Return length of sequence in a SeqRecord.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
102 sequenceLength = len(str((record.seq))) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
103 return sequenceLength |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
104 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
105 def get_antigenic_site_substitutions(record): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
106 """Count number of non-dotted amino acids in SeqRecord sequence and return as substitutions.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
107 sequenceLength = get_sequence_length(record) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
108 seqString = str(record.seq) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
109 matches = seqString.count('.') |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
110 substitutions = sequenceLength - matches |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
111 return substitutions |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
112 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
113 def calculate_percent_id(record, substitutions): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
114 """Calculate percent sequence identity to reference sequence, based on substitutions |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
115 and sequence length and return percent id as a ratio (i.e. 0.90 no 90%).""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
116 sequenceLength = get_sequence_length(record) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
117 percentID = (1.00 - (float(substitutions)/float(sequenceLength))) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
118 return percentID |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
119 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
120 def output_aggregated_linelist(a_list): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
121 """Output aggregated line list of SeqRecords in csv format.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
122 sequevars = {} #dict of sequevar: SeqRecord list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
123 firstRecordID = None |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
124 #examine dotted/masked sequences in list and assign unique ones as dict keys |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
125 for rec in a_list: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
126 rec = replace_matching_aa_with_dot(rec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
127 sequence =str(rec.seq) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
128 #if the sequence is a key in the dict, add SeqRecord to list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
129 if sequence in sequevars: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
130 #if sequence already in dict as a key, increment the value |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
131 sequevars[sequence].append(rec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
132 else: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
133 #if sequence not in dict, add is as new key with list of 1 SeqRecord |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
134 sequevars[sequence] = [rec] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
135 #get list of sorted unique sequence keys |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
136 sorted_unique_seq_keys = sorted(sequevars.keys()) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
137 #process each list of SeqRecords sharing a unique sequence |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
138 for u in sorted_unique_seq_keys: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
139 #access list of sequences by unique sequence |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
140 listOfSeqs = sequevars[u] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
141 #sort this list of SeqRecords by record.id (i.e. name) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
142 listOfSeqs = [f for f in sorted(listOfSeqs, key = lambda x : x.id)] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
143 N = len(listOfSeqs) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
144 #output details of first SeqRecord to csv |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
145 firstRecord = listOfSeqs[0] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
146 province = extract_province(firstRecord) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
147 clade = extract_clade(firstRecord) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
148 substitutions = get_antigenic_site_substitutions(firstRecord) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
149 percentID = calculate_percent_id(firstRecord,substitutions) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
150 name = extract_sample_name(firstRecord, clade) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
151 name_part = name.rstrip() + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
152 N_part = str(N) + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
153 clade_part = clade + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
154 substitutions_part = str(substitutions) + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
155 percID_part = str(percentID) + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
156 col = " ," #empty column |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
157 sequence = str(firstRecord.seq).strip() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
158 csv_seq = ",".join(sequence) +"," |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
159 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
160 #write first member of unique sequence list to csv |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
161 agg_lineListFile.write(comma_sep_output) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
162 #print sequence records in sequevar to console |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
163 print("\n\t\t%i SeqRecords matching Sequevar: %s" % (len(listOfSeqs), u)) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
164 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
165 #to uncollapse sequevar group, print each member of the sequevar list to csv output |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
166 '''for i in range(1,len(listOfSeqs)): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
167 currentRec = listOfSeqs[i] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
168 province = extract_province(currentRec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
169 clade = extract_clade(currentRec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
170 substitutions = get_antigenic_site_substitutions(currentRec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
171 percentID = calculate_percent_id(currentRec,substitutions) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
172 name_part = (currentRec.id).rstrip() + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
173 N_part = "n/a" + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
174 clade_part = clade + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
175 substitutions_part = str(substitutions) + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
176 percID_part = str(percentID) + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
177 col = " ," #empty column |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
178 sequence = str(currentRec.seq).strip() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
179 csv_seq = ",".join(sequence) +"," |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
180 comma_sep_output = name_part + N_part + clade_part + col + csv_seq + substitutions_part + percID_part + "\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
181 agg_lineListFile.write(comma_sep_output) ''' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
182 return |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
183 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
184 with open (antigenicSiteIndexArray,'r') as siteIndices: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
185 """Read amino acid positions from antigenic site index array and print as header after one empty row.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
186 col = "," #empty column |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
187 #read amino acid positions and remove trailing whitespace |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
188 for line in siteIndices: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
189 #remove whitespace from the end of each line |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
190 indicesLine = line.rstrip() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
191 row1 = "\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
192 #add comma-separated AA positions to header line |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
193 row2 = col + col + col + col + indicesLine + "\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
194 #write first (empty) and 2nd (amino acid position) lines to output file |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
195 agg_lineListFile.write(row1) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
196 agg_lineListFile.write(row2) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
197 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
198 with open (refAntigenicMap,'r') as refMapFile: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
199 """Read reference antigenic map from fasta and output amino acids, followed by column headers.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
200 #read sequences from fasta to SeqRecord, uppercase, and store sequence string to ref_seq |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
201 record = SeqIO.read(refMapFile,"fasta",alphabet=IUPAC.protein) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
202 record = record.upper() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
203 ref_seq = str(record.seq).strip() #store sequence in variable for comparison to sample seqs |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
204 col = "," #empty column |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
205 name_part = (record.id).rstrip() + ',' |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
206 sequence = str(record.seq).strip() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
207 csv_seq = ",".join(sequence) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
208 #output row with reference sequence displayed above sample sequences |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
209 row3 = name_part + col + col + col + csv_seq + "\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
210 agg_lineListFile.write(row3) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
211 positions = indicesLine.split(',') |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
212 numPos = len(positions) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
213 empty_indicesLine = ',' * numPos |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
214 #print column headers for sample sequences |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
215 row4 = "Sequence Name,N,Clade,Extra Substitutions," + empty_indicesLine + "Number of Amino Acid Substitutions in Antigenic Sites,% Identity of Antigenic Site Residues\n" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
216 agg_lineListFile.write(row4) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
217 print("\nREFERENCE ANTIGENIC MAP: '%s' (%i amino acids)" % (record.id, len(record))) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
218 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
219 with open(cladeDefinitionFile,'r') as cladeFile: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
220 """Read clade definition file and store clade names in list.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
221 #remove whitespace from the end of each line and split elements at commas |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
222 for line in cladeFile: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
223 elementList = line.rstrip().split(',') |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
224 name = elementList[0] #move 1st element to name field |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
225 cladeList.append(name) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
226 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
227 with open(inputAntigenicMaps,'r') as extrAntigMapFile: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
228 """Read antigenic maps as protein SeqRecords and add to list.""" |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
229 #read Sequences from fasta file, uppercase and add to seqList |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
230 for record in SeqIO.parse(extrAntigMapFile, "fasta", alphabet=IUPAC.protein): |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
231 record = record.upper() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
232 seqList.append(record) #add Seq to list of Sequences |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
233 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
234 #print number of sequences to be processed as user check |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
235 print("\nCOMPARING %i flu antigenic map sequences to the reference..." % len(seqList)) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
236 for record in seqList: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
237 #assign SeqRecords to province-specific dictionaries |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
238 sort_by_location(record) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
239 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
240 #access prov segregated lists in order |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
241 sorted_prov_keys = sorted(prov_lists.keys()) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
242 print("\nSequence Lists Sorted by Province: ") |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
243 for prov in sorted_prov_keys: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
244 current_list = prov_lists[prov] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
245 #mask AA's identical to reference sequence with dot |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
246 masked_list = [] # empty temporary list to park masked sequences |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
247 for record in current_list: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
248 masked_rec = replace_matching_aa_with_dot(record) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
249 masked_list.append(masked_rec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
250 prov_lists[prov] = masked_list #replace original SeqRecord list with masked list |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
251 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
252 #group sequences in province-sorted list into clades |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
253 for prov in sorted_prov_keys: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
254 prov_list = prov_lists[prov] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
255 by_clades_dict = {} #empty dict for clade:seqRecord list groups |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
256 print("\n'%s' List (Amino Acids identical to Reference are Masked): " % (prov)) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
257 for rec in prov_list: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
258 clade = extract_clade(rec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
259 if clade in by_clades_dict: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
260 #if clade already in dict as key, append record to list (value) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
261 by_clades_dict[clade].append(rec) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
262 else: #add clade as key to dict, value is list of 1 SeqRecord |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
263 by_clades_dict[clade] = [rec] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
264 #get list of alphabetically sorted clade keys |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
265 sorted_clade_keys = sorted(by_clades_dict.keys()) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
266 print("\tNumber of clades: ", len(by_clades_dict)) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
267 #group each list of sequences in clade by sequevars |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
268 for key in sorted_clade_keys: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
269 print("\n\tCLADE: %s Number of Members: %i" % (key, len(by_clades_dict[key]))) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
270 a_list = by_clades_dict[key] |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
271 for seqrec in a_list: |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
272 print("\t %s: %s" %(seqrec.id,str(seqrec.seq))) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
273 #output the list to csv as aggregated linelist |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
274 output_aggregated_linelist(a_list) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
275 |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
276 print("Aggregated Linelist written to file: '%s\n'" % (outFileHandle)) |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
277 extrAntigMapFile.close() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
278 refMapFile.close() |
515c0c885f5d
planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit b96b6e06f6eaa6ae8ef4c24630dbb72a4aed7dbe
public-health-bioinformatics
parents:
diff
changeset
|
279 agg_lineListFile.close() |