annotate MDtag_filter.py @ 1:f3d024fd9bf4 draft

Uploaded python script
author boris
date Tue, 24 Apr 2012 12:14:56 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
1 #!/usr/bin/env python
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
2 # Boris Rebolledo Jaramillo
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
3 # berebolledo@gmail.com
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
4 # v1.0.2
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
5
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
6 """Filter mapped reads on MD tag string.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
7
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
8 This tool reads the MD tag of mapped reads (see SAM format specification).
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
9 The user defines the 5' and 3' windows n and m (in bp), respectively.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
10 The mapped read is discarded if it contains any number of mismatches within n
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
11 bases of the read 5' end and within m bases of the read 3' end.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
12 Option: save discarded reads in an additional SAM file.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
13
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
14 usage: %prog in.sam n m out.sam saveDiscarded?[yes/no] [discarded.sam]
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
15
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
16 """
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
17
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
18 import sys,re
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
19
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
20 if len(sys.argv) < 6:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
21 sys.exit('ERROR! Usage: %prog <in.sam> <n> <m> <out.sam> '
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
22 '<saveDiscarded?[yes/no]> [<discarded.sam>]')
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
23 if sys.argv[5] == 'yes' and len(sys.argv) < 7:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
24 sys.exit('ERROR! Usage: %prog <in.sam> <n> <m> <out.sam> '
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
25 '<saveDiscarded?[yes/no]> [<discarded.sam>]')
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
26
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
27 sam_file=list(open(sys.argv[1]))
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
28 sam_mdfiltered=open(sys.argv[4],'w+')
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
29 if sys.argv[5] == 'yes':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
30 sam_discarded=open(sys.argv[6],'w+')
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
31
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
32
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
33 # The MD tag is generated out of the alignment operations, regardless the strand
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
34 # the read was mapped to. It describes the alignment from the leftmost aligned
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
35 # base, which might be either the 5' or 3' end of the original read.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
36 # Luckily, the read orientation in the alignment can be extracted from the
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
37 # alignment flag.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
38 # The code used here to identify the mapped reads orientation was taken from
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
39 # the 'explain_sam_flags.py' script of the Picard-Tools.
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
40 # It defines the meaning of the sam flags in plain English, so they can be used
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
41 # to identify reads mapped to the reverse strand: flag = 16
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
42 # http://picard-tools.sourcearchive.com/documentation/1.25-1/
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
43
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
44 lstFlags = [
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
45 ("read paired", 0x1),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
46 ("read mapped in proper pair", 0x2),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
47 ("read unmapped", 0x4),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
48 ("mate unmapped", 0x8),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
49 ("read reverse strand", 0x10),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
50 ("mate reverse strand", 0x20),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
51 ("first in pair", 0x40),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
52 ("second in pair", 0x80),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
53 ("not primary alignment", 0x100),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
54 ("read fails platform/vendor quality checks", 0x200),
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
55 ("read is PCR or optical duplicate", 0x400)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
56 ]
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
57
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
58 for line in sam_file:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
59 if line.split("\t")[0].startswith('@'):
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
60 if sys.argv[5] == 'yes':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
61 sam_discarded.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
62 sam_mdfiltered.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
63 else:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
64 sam_mdfiltered.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
65 else:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
66 if len(line.split("\t")) > 11: # Any SAM optional fields?
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
67 for field in line.split("\t")[11:]:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
68 if re.match('MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*', field): # is MD tag field present?
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
69 flag = int(line.split("\t")[1])
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
70 for strFlagName, iMask in lstFlags:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
71 if flag & iMask:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
72 if iMask == 16:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
73 strand = 'reverse'
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
74 break
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
75 else:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
76 strand = 'forward'
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
77 else: # flag = 0 is not part of the definitions,
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
78 # but it corresponds to forward strand mapping
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
79 strand = 'forward'
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
80 # The position of the optional fields in the SAM format is variable. Finds the location of the MD tag:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
81 mdtag_idx = [i for i, item in enumerate(line.split("\t")[11:]) if re.match('MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*', item)][0]
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
82 mdtag=line.split("\t")[mdtag_idx+11].strip().split(":")[2]
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
83 md_list = re.split(r'(\D)',mdtag)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
84 if strand == 'forward':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
85 if int(md_list[0]) >= int(sys.argv[2]) and int(md_list[-1]) >= int(sys.argv[3]):
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
86 sam_mdfiltered.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
87 else:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
88 if sys.argv[5] == 'yes':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
89 sam_discarded.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
90 elif strand == 'reverse':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
91 if int(md_list[0]) >= int(sys.argv[3]) and int(md_list[-1]) >= int(sys.argv[2]):
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
92 sam_mdfiltered.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
93 else:
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
94 if sys.argv[5] == 'yes':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
95 sam_discarded.write(line)
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
96 break
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
97
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
98 sam_mdfiltered.close()
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
99
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
100 if sys.argv[5] == 'yes':
f3d024fd9bf4 Uploaded python script
boris
parents:
diff changeset
101 sam_discarded.close()