annotate tools/fastq/fastq_paired_unpaired.py @ 3:528ba9c896e0 draft

Uploaded v0.0.8, MIT licence and reST for README, citation information, development moved to GitHub
author peterjc
date Wed, 18 Sep 2013 06:13:27 -0400
parents 95a632a71951
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Divides a FASTQ into paired and single (orphan reads) as separate files.
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
4 The input file should be a valid FASTQ file which has been sorted so that
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 any partner forward+reverse reads are consecutive. The output files all
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 preserve this sort order. Pairing are recognised based on standard name
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
7 suffices. See below or run the tool with no arguments for more details.
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 Note that the FASTQ variant is unimportant (Sanger, Solexa, Illumina, or even
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10 Color Space should all work equally well).
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
11
3
528ba9c896e0 Uploaded v0.0.8, MIT licence and reST for README, citation information, development moved to GitHub
peterjc
parents: 2
diff changeset
12 This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
13 (formerly SCRI), Scotland, UK. All rights reserved.
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
14
3
528ba9c896e0 Uploaded v0.0.8, MIT licence and reST for README, citation information, development moved to GitHub
peterjc
parents: 2
diff changeset
15 See accompanying text file for licence details (MIT license).
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 """
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 import os
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18 import sys
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19 import re
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21
2
95a632a71951 Uploaded v0.0.6, adds unit test
peterjc
parents: 1
diff changeset
22 if "-v" in sys.argv or "--version" in sys.argv:
3
528ba9c896e0 Uploaded v0.0.8, MIT licence and reST for README, citation information, development moved to GitHub
peterjc
parents: 2
diff changeset
23 print "Version 0.0.8"
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
24 sys.exit(0)
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
25
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
26 def stop_err(msg, err=1):
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27 sys.stderr.write(msg.rstrip() + "\n")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
28 sys.exit(err)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
29
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 msg = """Expect either 3 or 4 arguments, all FASTQ filenames.
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
32 If you want two output files, use four arguments:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 - FASTQ variant (e.g. sanger, solexa, illumina or cssanger)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 - Sorted input FASTQ filename,
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35 - Output paired FASTQ filename (forward then reverse interleaved),
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 - Output singles FASTQ filename (orphan reads)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38 If you want three output files, use five arguments:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39 - FASTQ variant (e.g. sanger, solexa, illumina or cssanger)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
40 - Sorted input FASTQ filename,
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
41 - Output forward paired FASTQ filename,
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
42 - Output reverse paired FASTQ filename,
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
43 - Output singles FASTQ filename (orphan reads)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 The input file should be a valid FASTQ file which has been sorted so that
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 any partner forward+reverse reads are consecutive. The output files all
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 preserve this sort order.
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
48
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
49 Any reads where the forward/reverse naming suffix used is not recognised
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
50 are treated as orphan reads. The tool supports the /1 and /2 convention
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
51 originally used by Illumina, the .f and .r convention, and the Sanger
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
52 convention (see http://staden.sourceforge.net/manual/pregap4_unix_50.html
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
53 for details), and the new Illumina convention where the reads have the
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
54 same identifier with the fragment at the start of the description, e.g.
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
55
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
56 @HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
57 @HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
58
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59 Note that this does support multiple forward and reverse reads per template
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
60 (which is quite common with Sanger sequencing), e.g. this which is sorted
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
61 alphabetically:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
62
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
63 WTSI_1055_4p17.p1kapIBF
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
64 WTSI_1055_4p17.p1kpIBF
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65 WTSI_1055_4p17.q1kapIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 WTSI_1055_4p17.q1kpIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
67
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68 or this where the reads already come in pairs:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
69
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
70 WTSI_1055_4p17.p1kapIBF
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
71 WTSI_1055_4p17.q1kapIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
72 WTSI_1055_4p17.p1kpIBF
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
73 WTSI_1055_4p17.q1kpIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
74
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
75 both become:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
76
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
77 WTSI_1055_4p17.p1kapIBF paired with WTSI_1055_4p17.q1kapIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78 WTSI_1055_4p17.p1kpIBF paired with WTSI_1055_4p17.q1kpIBR
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 """
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
81 if len(sys.argv) == 5:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82 format, input_fastq, pairs_fastq, singles_fastq = sys.argv[1:]
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
83 elif len(sys.argv) == 6:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
84 pairs_fastq = None
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
85 format, input_fastq, pairs_f_fastq, pairs_r_fastq, singles_fastq = sys.argv[1:]
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
86 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
87 stop_err(msg)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
88
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
89 format = format.replace("fastq", "").lower()
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
90 if not format:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
91 format="sanger" #safe default
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
92 elif format not in ["sanger","solexa","illumina","cssanger"]:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
93 stop_err("Unrecognised format %s" % format)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
94
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
95 def f_match(name):
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
96 if name.endswith("/1") or name.endswith(".f"):
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
97 return True
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
98
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
99 #Cope with three widely used suffix naming convensions,
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
100 #Illumina: /1 or /2
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
101 #Forward/revered: .f or .r
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
102 #Sanger, e.g. .p1k and .q1k
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
103 #See http://staden.sourceforge.net/manual/pregap4_unix_50.html
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
104 re_f = re.compile(r"(/1|\.f|\.[sfp]\d\w*)$")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
105 re_r = re.compile(r"(/2|\.r|\.[rq]\d\w*)$")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
106
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
107 #assert re_f.match("demo/1")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
108 assert re_f.search("demo.f")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
109 assert re_f.search("demo.s1")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
110 assert re_f.search("demo.f1k")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
111 assert re_f.search("demo.p1")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
112 assert re_f.search("demo.p1k")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
113 assert re_f.search("demo.p1lk")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
114 assert re_r.search("demo/2")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
115 assert re_r.search("demo.r")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
116 assert re_r.search("demo.q1")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
117 assert re_r.search("demo.q1lk")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
118 assert not re_r.search("demo/1")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
119 assert not re_r.search("demo.f")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
120 assert not re_r.search("demo.p")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
121 assert not re_f.search("demo/2")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
122 assert not re_f.search("demo.r")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
123 assert not re_f.search("demo.q")
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
124
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
125 re_illumina_f = re.compile(r"^@[a-zA-Z0-9_:-]+ 1:.*$")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
126 re_illumina_r = re.compile(r"^@[a-zA-Z0-9_:-]+ 2:.*$")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
127 assert re_illumina_f.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
128 assert re_illumina_r.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
129 assert not re_illumina_f.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 2:N:0:TGNCCA")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
130 assert not re_illumina_r.match("@HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 1:N:0:TGNCCA")
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
131
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
132
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
133 count, forward, reverse, neither, pairs, singles = 0, 0, 0, 0, 0, 0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
134 in_handle = open(input_fastq)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
135 if pairs_fastq:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
136 pairs_f_writer = fastqWriter(open(pairs_fastq, "w"), format)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
137 pairs_r_writer = pairs_f_writer
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
138 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
139 pairs_f_writer = fastqWriter(open(pairs_f_fastq, "w"), format)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
140 pairs_r_writer = fastqWriter(open(pairs_r_fastq, "w"), format)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
141 singles_writer = fastqWriter(open(singles_fastq, "w"), format)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
142 last_template, buffered_reads = None, []
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
143
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
144 for record in fastqReader(in_handle, format):
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
145 count += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
146 name = record.identifier.split(None,1)[0]
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
147 assert name[0]=="@", record.identifier #Quirk of the Galaxy parser
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
148 is_forward = False
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
149 suffix = re_f.search(name)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
150 if suffix:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
151 #============
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
152 #Forward read
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
153 #============
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
154 template = name[:suffix.start()]
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
155 is_forward = True
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
156 elif re_illumina_f.match(record.identifier):
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
157 template = name #No suffix
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
158 is_forward = True
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
159 if is_forward:
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
160 #print name, "forward", template
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
161 forward += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
162 if last_template == template:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
163 buffered_reads.append(record)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
164 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
165 #Any old buffered reads are orphans
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
166 for old in buffered_reads:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
167 singles_writer.write(old)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
168 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
169 #Save this read in buffer
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
170 buffered_reads = [record]
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
171 last_template = template
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
172 else:
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
173 is_reverse = False
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
174 suffix = re_r.search(name)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
175 if suffix:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
176 #============
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
177 #Reverse read
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
178 #============
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
179 template = name[:suffix.start()]
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
180 is_reverse = True
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
181 elif re_illumina_r.match(record.identifier):
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
182 template = name #No suffix
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
183 is_reverse = True
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
184 if is_reverse:
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
185 #print name, "reverse", template
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
186 reverse += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
187 if last_template == template and buffered_reads:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
188 #We have a pair!
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
189 #If there are multiple buffered forward reads, want to pick
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
190 #the first one (although we could try and do something more
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
191 #clever looking at the suffix to match them up...)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
192 old = buffered_reads.pop(0)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
193 pairs_f_writer.write(old)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
194 pairs_r_writer.write(record)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
195 pairs += 2
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
196 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
197 #As this is a reverse read, this and any buffered read(s) are
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
198 #all orphans
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
199 for old in buffered_reads:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
200 singles_writer.write(old)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
201 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
202 buffered_reads = []
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
203 singles_writer.write(record)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
204 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
205 last_template = None
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
206 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
207 #===========================
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
208 #Neither forward nor reverse
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
209 #===========================
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
210 singles_writer.write(record)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
211 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
212 neither += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
213 for old in buffered_reads:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
214 singles_writer.write(old)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
215 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
216 buffered_reads = []
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
217 last_template = None
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
218 if last_template:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
219 #Left over singles...
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
220 for old in buffered_reads:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
221 singles_writer.write(old)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
222 singles += 1
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
223 in_handle.close
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
224 singles_writer.close()
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
225 if pairs_fastq:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
226 pairs_f_writer.close()
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
227 assert pairs_r_writer.file.closed
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
228 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
229 pairs_f_writer.close()
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
230 pairs_r_writer.close()
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
231
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
232 if neither:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
233 print "%i reads (%i forward, %i reverse, %i neither), %i in pairs, %i as singles" \
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
234 % (count, forward, reverse, neither, pairs, singles)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
235 else:
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
236 print "%i reads (%i forward, %i reverse), %i in pairs, %i as singles" \
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
237 % (count, forward, reverse, pairs, singles)
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
238
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
239 assert count == pairs + singles == forward + reverse + neither, \
1
7ed81e36fc1c Uploaded v0.0.5 which handles Illumina 1.8 style pair naming.
peterjc
parents: 0
diff changeset
240 "%i vs %i+%i=%i vs %i+%i+%i=%i" \
0
72e9fcaec61f Migrated tool version 0.0.4 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
241 % (count,pairs,singles,pairs+singles,forward,reverse,neither,forward+reverse+neither)