annotate tools/sr_mapping/lastz_paired_reads_wrapper.py @ 0:9071e359b9a3

Uploaded
author xuebing
date Fri, 09 Mar 2012 19:37:19 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
1 #!/usr/bin/env python
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
3 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
4 Runs Lastz paired read alignment process
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
5 Written for Lastz v. 1.02.00.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
6
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
7 # Author(s): based on various scripts written by Bob Harris (rsharris@bx.psu.edu),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
8 # then tweaked to this form by Greg Von Kuster (greg@bx.psu.edu)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
9
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
10 This tool takes the following input:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
11 a. A collection of 454 paired end reads ( a fasta file )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
12 b. A linker sequence ( a very small fasta file )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
13 c. A reference genome ( nob, 2bit or fasta )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
14
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
15 and uses the following process:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
16 1. Split reads into mates: the input to this step is the read file XXX.fasta, and the output is three
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
17 files; XXX.short.fasta, XXX.long.fasta and XXX.mapping. The mapping file records the information necessary
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
18 to convert mate coordinates back into the original read, which is needed later in the process.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
19
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
20 2. Align short mates to the reference: this runs lastz against every chromosome. The input is XXX.short.fasta
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
21 and the reference genome, and the output is a SAM file, XXX.short.sam.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
22
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
23 3. Align long mates to the reference: this runs lastz against every chromosome. The input is XXX.long.fasta
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
24 and the reference genome, and the output is a SAM file, XXX.long.sam.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
25
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
26 4. Combine, and convert mate coordinates back to read coordinates. The input is XXX.mapping, XXX.short.sam and
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
27 XXX.long.sam, and the output is XXX.sam.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
28
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
29 usage: lastz_paired_reads_wrapper.py [options]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
30 --ref_name: The reference name to change all output matches to
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
31 --ref_source: The reference is cached or from the history
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
32 --source_select: Use pre-set or cached reference file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
33 --input1: The name of the reference file if using history or reference base name if using cached
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
34 --input2: The reads file to align
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
35 --input3: The sequencing linker file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
36 --input4: The base quality score 454 file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
37 --ref_sequences: The number of sequences in the reference file if using one from history
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
38 --output: The name of the output file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
39 --lastz_seqs_file_dir: Directory of local lastz_seqs.loc file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
40 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
41 import optparse, os, subprocess, shutil, sys, tempfile, time
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
42 from string import maketrans
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
43
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
44 from galaxy import eggs
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
45 import pkg_resources
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
46 pkg_resources.require( 'bx-python' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
47 from bx.seq.twobit import *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
48 from bx.seq.fasta import FastaReader
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
49 from galaxy.util.bunch import Bunch
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
50 from galaxy.util import string_as_bool
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
51
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
52 # Column indexes for SAM required fields
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
53 SAM_QNAME_COLUMN = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
54 SAM_FLAG_COLUMN = 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
55 SAM_RNAME_COLUMN = 2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
56 SAM_POS_COLUMN = 3
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
57 SAM_MAPQ_COLUMN = 4
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
58 SAM_CIGAR_COLUMN = 5
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
59 SAM_MRNM_COLUMN = 6
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
60 SAM_MPOS_COLUMN = 7
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
61 SAM_ISIZE_COLUMN = 8
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
62 SAM_SEQ_COLUMN = 9
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
63 SAM_QUAL_COLUMN = 10
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
64 SAM_MIN_COLUMNS = 11
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
65 # SAM bit-encoded flags
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
66 BAM_FPAIRED = 1 # the read is paired in sequencing, no matter whether it is mapped in a pair
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
67 BAM_FPROPER_PAIR = 2 # the read is mapped in a proper pair
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
68 BAM_FUNMAP = 4 # the read itself is unmapped; conflictive with BAM_FPROPER_PAIR
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
69 BAM_FMUNMAP = 8 # the mate is unmapped
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
70 BAM_FREVERSE = 16 # the read is mapped to the reverse strand
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
71 BAM_FMREVERSE = 32 # the mate is mapped to the reverse strand
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
72 BAM_FREAD1 = 64 # this is read1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
73 BAM_FREAD2 = 128 # this is read2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
74 BAM_FSECONDARY = 256 # not primary alignment
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
75 BAM_FQCFAIL = 512 # QC failure
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
76 BAM_FDUP = 1024 # optical or PCR duplicate
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
77
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
78 # Keep track of all created temporary files so they can be deleted
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
79 global tmp_file_names
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
80 tmp_file_names = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
81 # The values in the skipped_lines dict are tuples consisting of:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
82 # - the number of skipped lines for that error
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
83 # If not a sequence error:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
84 # - the 1st line number on which the error was found
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
85 # - the text of the 1st line on which the error was found
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
86 # If a sequence error:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
87 # - The number of the sequence in the file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
88 # - the sequence name on which the error occurred
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
89 # We may need to improve dealing with file position and text as
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
90 # much of it comes from temporary files that are created from the
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
91 # inputs, and not the inputs themselves, so this could be confusing
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
92 # to the user.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
93 global skipped_lines
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
94 skipped_lines = dict( bad_interval=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
95 inconsistent_read_lengths=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
96 inconsistent_reads=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
97 inconsistent_sizes=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
98 missing_mate=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
99 missing_quals=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
100 missing_seq=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
101 multiple_seqs=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
102 no_header=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
103 num_fields=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
104 reads_paired=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
105 sam_flag=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
106 sam_headers=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
107 sam_min_columns=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
108 two_mate_names=( 0, 0, '' ),
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
109 wrong_seq_len=( 0, 0, '' ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
110 global total_skipped_lines
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
111 total_skipped_lines = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
112
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
113 def stop_err( msg ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
114 sys.stderr.write( "%s" % msg )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
115 sys.exit()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
116
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
117 def skip_line( error_key, position, text ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
118 if not skipped_lines[ error_key ][2]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
119 skipped_lines[ error_key ][1] = position
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
120 skipped_lines[ error_key ][2] = text
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
121 skipped_lines[ error_key ][0] += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
122 total_skipped_lines += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
123
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
124 def get_tmp_file_name( dir=None, suffix=None ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
125 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
126 Return a unique temporary file name that can be managed. The
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
127 file must be manually removed after use.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
128 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
129 if dir and suffix:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
130 tmp_fd, tmp_name = tempfile.mkstemp( dir=dir, suffix=suffix )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
131 elif dir:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
132 tmp_fd, tmp_name = tempfile.mkstemp( dir=dir )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
133 elif suffix:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
134 tmp_fd, tmp_name = tempfile.mkstemp( suffix=suffix )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
135 os.close( tmp_fd )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
136 tmp_file_names.append( tmp_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
137 return tmp_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
138
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
139 def run_command( command ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
140 proc = subprocess.Popen( args=command, shell=True, stderr=subprocess.PIPE, )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
141 proc.wait()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
142 stderr = proc.stderr.read()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
143 proc.wait()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
144 if stderr:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
145 stop_err( stderr )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
146
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
147 def split_paired_reads( input2, combined_linker_file_name ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
148 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
149 Given a fasta file of allegedly paired end reads ( input2 ), and a list of intervals
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
150 showing where the linker is on each read ( combined_linker_file_name ), split the reads into left and right
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
151 halves.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
152
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
153 The input intervals look like this. Note that they may include multiple intervals for the same read
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
154 ( which should overlap ), and we use the union of them as the linker interval. Non-overlaps are
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
155 reported to the user, and those reads are not processed. Starts are origin zero.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
156
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
157 #name strand start len size
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
158 FG3OYDA05FTEES + 219 42 283
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
159 FG3OYDA05FVOLL + 263 41 416
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
160 FG3OYDA05FFL7J + 81 42 421
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
161 FG3OYDA05FOQWE + 55 42 332
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
162 FG3OYDA05FV4DW + 297 42 388
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
163 FG3OYDA05FWAQV + 325 42 419
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
164 FG3OYDA05FVLGA + 90 42 367
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
165 FG3OYDA05FWJ71 + 58 42 276
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
166
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
167 The output gives each half-sequence on a separate line, like this. This allows easy sorting of the
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
168 sequences by length, after the fact.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
169
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
170 219 FG3OYDA05FTEES_L TTTAGTTACACTTAACTCACTTCCATCCTCTAAATACGTGATTACCTTTC...
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
171 22 FG3OYDA05FTEES_R CCTTCCTTAAGTCCTAAAACTG
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
172 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
173 # Bob says these should be hard-coded.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
174 seq_len_lower_threshold = 17
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
175 short_mate_cutoff = 50
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
176 # We need to pass the name of this file back to the caller.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
177 tmp_mates_file_name = get_tmp_file_name( suffix='mates.txt' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
178 mates_file = file( tmp_mates_file_name, "w+b" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
179 # Read the linker intervals
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
180 combined_linker_file = file( combined_linker_file_name, "rb" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
181 read_to_linker_dict = {}
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
182 i = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
183 for i, line in enumerate( combined_linker_file ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
184 line = line.strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
185 if line.startswith( "#" ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
186 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
187 if line.find( '#' ) >= 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
188 line = line.split( "#", 1 )[0].rstrip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
189 fields = line.split()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
190 if len( fields ) != 4:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
191 skip_line( 'num_fields', i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
192 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
193 name, start, length, size = fields
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
194 start = int( start )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
195 length = int( length )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
196 size = int( size )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
197 end = start + length
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
198 if end > size:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
199 skip_line[ 'bad_interval' ] += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
200 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
201 if name not in read_to_linker_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
202 read_to_linker_dict[ name ] = ( start, end, size )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
203 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
204 if read_to_linker_dict[ name ] == None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
205 # Read previously marked as non-overlapping intervals, so skip this sequence - see below
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
206 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
207 ( s, e, sz ) = read_to_linker_dict[ name ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
208 if sz != size:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
209 skip_line( 'inconsistent_sizes', i+1, name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
210 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
211 if s > end or e < start:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
212 # Non-overlapping intervals, so skip this sequence
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
213 read_to_linker_dict[ name ] = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
214 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
215 read_to_linker_dict[ name ] = ( min( s, start ), max( e, end ), size )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
216 combined_linker_file.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
217 # We need to pass the name of this file back to the caller.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
218 tmp_mates_mapping_file_name = get_tmp_file_name( suffix='mates.mapping' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
219 mates_mapping_file = file( tmp_mates_mapping_file_name, 'w+b' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
220 # Process the sequences
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
221 seqs = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
222 fasta_reader = FastaReader( file( input2, 'rb' ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
223 while True:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
224 seq = fasta_reader.next()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
225 if not seq:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
226 break
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
227 seqs += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
228 if seq.name not in read_to_linker_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
229 if seq.length > seq_len_lower_threshold:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
230 mates_file.write( "%-3d %s %s\n" % ( seq.length, seq.name, seq.text ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
231 read_to_linker_dict[ seq.name ] = ""
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
232 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
233 if read_to_linker_dict[ seq.name ] == "":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
234 skip_line( 'multiple_seqs', seqs, seq.name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
235 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
236 if read_to_linker_dict[ seq.name ] == None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
237 # Read previously marked as non-overlapping intervals, so skip this sequence - see above
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
238 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
239 ( start, end, size ) = read_to_linker_dict[ seq.name ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
240 if seq.length != size:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
241 skip_line( 'wrong_seq_len', seqs, seq.name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
242 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
243 left = seq.text[ :start ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
244 right = seq.text[ end: ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
245 left_is_small = len( left ) <= seq_len_lower_threshold
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
246 right_is_small = len( right ) <= seq_len_lower_threshold
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
247 if left_is_small and right_is_small:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
248 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
249 if not left_is_small:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
250 mates_file.write( "%-3d %s %s\n" % ( len( left ), seq.name + "_L", left ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
251 mates_mapping_file.write( "%s %s %s %s\n" % ( seq.name + "_L", seq.name, 0, size - start ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
252 if not right_is_small:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
253 mates_file.write( "%-3d %s %s\n" % ( len( right ), seq.name + "_R", right ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
254 mates_mapping_file.write( "%s %s %s %s\n" % ( seq.name + "_R", seq.name, end, 0 ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
255 read_to_linker_dict[ seq.name ] = ""
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
256 combined_linker_file.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
257 mates_file.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
258 mates_mapping_file.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
259 # Create temporary files for short and long mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
260 tmp_mates_short_file_name = get_tmp_file_name( suffix='mates.short' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
261 tmp_mates_long_file_name = get_tmp_file_name( suffix='mates.long' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
262 tmp_mates_short = open( tmp_mates_short_file_name, 'w+b' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
263 tmp_mates_long = open( tmp_mates_long_file_name, 'w+b' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
264 i = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
265 for i, line in enumerate( file( tmp_mates_file_name, 'rb' ) ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
266 fields = line.split()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
267 seq_len = int( fields[0] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
268 seq_name = fields[1]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
269 seq_text = fields[2]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
270 if seq_len <= short_mate_cutoff:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
271 tmp_mates_short.write( ">%s\n%s\n" % ( seq_name, seq_text ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
272 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
273 tmp_mates_long.write( ">%s\n%s\n" % ( seq_name, seq_text ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
274 tmp_mates_short.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
275 tmp_mates_long.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
276 return tmp_mates_mapping_file_name, tmp_mates_file_name, tmp_mates_short_file_name, tmp_mates_long_file_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
277
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
278 def align_mates( input1, ref_source, ref_name, ref_sequences, tmp_mates_short_file_name, tmp_mates_long_file_name ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
279 tmp_align_file_names = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
280 if ref_source == 'history':
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
281 # Reference is a fasta dataset from the history
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
282 # Create temporary files to contain the output from lastz executions
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
283 tmp_short_file_name = get_tmp_file_name( suffix='short_out' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
284 tmp_align_file_names.append( tmp_short_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
285 tmp_long_file_name = get_tmp_file_name( suffix='long_out' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
286 tmp_align_file_names.append( tmp_long_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
287 seqs = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
288 fasta_reader = FastaReader( open( input1 ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
289 while True:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
290 # Read the next sequence from the reference dataset. Note that if the reference contains
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
291 # a small number of chromosomes this loop is ok, but in many cases the genome has a bunch
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
292 # of small straggler scaffolds and contigs and it is a computational waste to do each one
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
293 # of these in its own run. There is an I/O down side to running by subsets (even if they are
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
294 # one sequence per subset), compared to splitting the reference into sizes of 250 mb. With
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
295 # the subset action, lastz still has to read and parse the entire file for every run (this
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
296 # is true for fasta, but for .2bit files it can access each sequence directly within the file,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
297 # so the overhead is minimal).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
298 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
299 :> output_file (this creates the output file, empty)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
300 while there are more sequences to align
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
301 find the next sequences that add up to 250M, put their names in farf.names
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
302 lastz ${refFile}[subset=farf.names][multi][unmask] ${matesPath}/${matesFile} ...
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
303 >> output_file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
304 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
305 seq = fasta_reader.next()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
306 if not seq:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
307 break
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
308 seqs += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
309 # Create a temporary file to contain the current sequence as input to lastz.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
310 # We're doing this a bit differently here since we could be generating a huge
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
311 # number of temporary files.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
312 tmp_in_fd, tmp_in_file_name = tempfile.mkstemp( suffix='seq_%d_in' % seqs )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
313 tmp_in_file = os.fdopen( tmp_in_fd, 'w+b' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
314 tmp_in_file.write( '>%s\n%s\n' % ( seq.name, seq.text ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
315 tmp_in_file.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
316 # Align short mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
317 command = 'lastz %s[unmask]%s %s ' % ( tmp_in_file_name, ref_name, tmp_mates_short_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
318 command += 'Z=1 --seed=1111111011111 --notrans --maxwordcount=90% --match=1,3 O=1 E=3 X=15 K=10 Y=12 L=18 --ambiguousn --noytrim --identity=95 --coverage=80 --continuity=95 --format=softsam- '
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
319 command += '>> %s' % tmp_short_file_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
320 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
321 # Align long mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
322 command = 'lastz %s[unmask]%s %s ' % ( tmp_in_file_name, ref_name, tmp_mates_long_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
323 command += 'Z=15 W=13 --notrans --exact=18 --maxwordcount=90% --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --noytrim --identity=95 --coverage=90 --continuity=95 --format=softsam- '
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
324 command += '>> %s' % tmp_long_file_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
325 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
326 # Remove the temporary file that contains the current sequence
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
327 os.remove( tmp_in_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
328 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
329 # Reference is a locally cached 2bit file, split lastz calls across number of chroms in 2bit file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
330 tbf = TwoBitFile( open( input1, 'rb' ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
331 for chrom in tbf.keys():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
332 # Align short mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
333 tmp_short_file_name = get_tmp_file_name( suffix='short_vs_%s' % chrom )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
334 tmp_align_file_names.append( tmp_short_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
335 command = 'lastz %s/%s[unmask]%s %s ' % ( input1, chrom, ref_name, tmp_mates_short_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
336 command += 'Z=1 --seed=1111111011111 --notrans --maxwordcount=90% --match=1,3 O=1 E=3 X=15 K=10 Y=12 L=18 --ambiguousn --noytrim --identity=95 --coverage=80 --continuity=95 --format=softsam- '
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
337 command += '> %s' % tmp_short_file_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
338 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
339 # Align long mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
340 tmp_long_file_name = get_tmp_file_name( suffix='long_vs_%s' % chrom )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
341 tmp_align_file_names.append( tmp_long_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
342 command = 'lastz %s/%s[unmask]%s %s ' % ( input1, chrom, ref_name, tmp_mates_long_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
343 command += 'Z=15 W=13 --notrans --exact=18 --maxwordcount=90% --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --noytrim --identity=95 --coverage=90 --continuity=95 --format=softsam- '
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
344 command += '> %s' % tmp_long_file_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
345 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
346 return tmp_align_file_names
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
347
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
348 def paired_mate_unmapper( input2, input4, tmp_mates_mapping_file_name, tmp_align_file_name_list, output ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
349 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
350 Given a SAM file corresponding to alignments of *subsegments* of paired 'reads' to a reference sequence,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
351 convert the positions on the subsegments to positions on the reads. Also (optionally) add quality values.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
352
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
353 The input file is in SAM format, as shown below. Each line represents the alignment of a part of a read
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
354 to a reference sequence. Read pairs are indicated by suffixes in their names. Normally, the suffixes _L
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
355 and _R indicate the left and right mates of reads (this can be overridden with the --left and --right
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
356 options). Reads that were not mates have no suffix.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
357
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
358 (SAM header lines omitted)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
359 F2YP0BU02G7LK5_R 16 chr21 15557360 255 40M * 0 0 ATTTTATTCTCTTTGAAGCAATTGTGAATGGGAGTTTACT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
360 F2YP0BU02HXV58_L 16 chr21 15952091 255 40M6S * 0 0 GCAAATTGTGCTGCTTTAAACATGCGTGTGCAAGTATCTTtttcat *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
361 F2YP0BU02HREML_R 0 chr21 16386077 255 33M5S * 0 0 CCAAAGTTCTGGGATTACAGGCGTGAGCCATCGcgccc *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
362 F2YP0BU02IOF1F_L 0 chr21 17567321 255 7S28M * 0 0 taaagagAAGAATTCTCAACCCAGAATTTCATATC *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
363 F2YP0BU02IKX84_R 16 chr21 18491628 255 22M1D18M9S * 0 0 GTCTCTACCAAAAAATACAAAAATTAGCCGGGCGTGGTGGcatgtctgt *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
364 F2YP0BU02GW5VA_L 16 chr21 20255344 255 6S32M * 0 0 caagaaCAAACACATTCAAAAGCTAGTAGAAGGCAAGA *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
365 F2YP0BU02JIMJ4_R 0 chr21 22383051 255 19M * 0 0 CCCTTTATCATTTTTTATT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
366 F2YP0BU02IXZGF_L 16 chr21 23094798 255 13M1I18M * 0 0 GCAAGCTCCACTTCCCGGGTTCACGCCATTCT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
367 F2YP0BU02IODR5_L 0 chr21 30935325 255 37M * 0 0 GAAATAAAGGGTATTCAATTAGGAAAAGAGGAAGTCA *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
368 F2YP0BU02IMZBL_L 16 chr21 31603486 255 28M1D1M * 0 0 ATACAAAAATTAGCCGGGCACAGTGGCAG *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
369 F2YP0BU02JA9PR_L 16 chr21 31677159 255 23M * 0 0 CACACCTGTAACCCCAGCACTTT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
370 F2YP0BU02HKC61_R 0 chr21 31678718 255 40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
371 F2YP0BU02HKC61_R 0 chr21 31678718 255 40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
372 F2YP0BU02HVA88 16 chr21 31703558 255 1M1D35M8S * 0 0 TGGGATTACAGGCGTGAGCTACCACACCCAGCCAGAgttcaaat *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
373 F2YP0BU02JDCF1_L 0 chr21 31816600 255 38M * 0 0 AGGAGAATCGCTTGAACCCAGGAGGCAGAGGTTGCGGT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
374 F2YP0BU02GZ1GO_R 0 chr21 33360122 255 6S38M * 0 0 cctagaCTTCACACACACACACACACACACACACACACACACAC *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
375 F2YP0BU02FX387_L 16 chr22 14786201 255 26M * 0 0 TGGATGAAGCTGGAAACCATCATTCT *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
376 F2YP0BU02IF2NE_R 0 chr22 16960842 255 40M10S * 0 0 TGGCATGCACCTGTAGTCTCAGCTACTTGGGAGGCTGAGGtgggaggatc *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
377 F2YP0BU02F4TVA 0 chr22 19200522 255 49M * 0 0 CCTGGGAGGCGGAGGTTGCAGTGAGCCGAGATCACGCCATTGCACTCCA *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
378 F2YP0BU02HKC61_R 16 chr22 29516998 255 8S32M * 0 0 agacagagTCTTGCTTTGTCACCCAGGCTGGAGTGCAGTG *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
379 F2YP0BU02FS4EM_R 0 chr22 30159364 255 29M * 0 0 CTCCTGCCTCAGCCTCCCGAGTAGTTGGG *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
380 F2YP0BU02G197P_L 0 chr22 32044496 255 40M10S * 0 0 TTGTTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTgaataatgcc *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
381 F2YP0BU02FIING 16 chr22 45959944 255 3M1I11M1I26M * 0 0 AGCTATGGTACTGGCTATGAAAGCAGACACATAGACCAATGG *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
382 F2YP0BU02GUB9L_L 16 chr22 49198404 255 16M1I20M * 0 0 CACCACGCTCGGCTAATTTTTGTATTTTTAGTAGAGA *
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
383
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
384 The user must provide a mapping file (which might better be called an unmapping file). This file is usually
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
385 created by split_paired_reads, and tells us how to map the subsegments back to original coordinates in a single
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
386 read (this means the left and right mates were part of a single read). The mapping file contains four columns.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
387 The first two give the mates's name (including the suffix) and the read name. The last two columns describe how
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
388 much of the full original sequence is missing from the mate. For example, in the read below, the left mate is
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
389 missing 63 on the right (42 for the linker and 21 for the right half). The right mate is missing 339 on the left.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
390
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
391 left half: TTTCAACATATGCAAATCAATAAATGTAATCCAGCATATAAACAGAACCA
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
392 AAGACAAAAACCACATGATTATCTCAATAGATGCAGAAAAGGCCTTCGGC
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
393 AAAATTCAACAAAACTCCATGCTAAAACTCTCAATAAGGTATTGATGGGA
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
394 CATGCCGCATAATAATAAGACATATCTATGACAAACCCACAGCCAATATC
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
395 ATGCTGAATGCACAAAAATTGGAAGCATTCCCTTTGAAAACTGGCACAAG
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
396 ACTGGGATGCCCTCTCTCACAACTCCTATTCAACATAGTGTTGGAAG
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
397 linker: CGTAATAACTTCGTATAGCATACATTATACGAAGTCATACGA
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
398 right half: CTCCTGCCTCAGCCTCCCGAG
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
399
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
400 mate_name read_name offset_to_start offset_from_end
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
401 F2YP0BU02FS4EM_L F2YP0BU02FS4EM 0 71
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
402 F2YP0BU02FS4EM_R F2YP0BU02FS4EM 339 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
403
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
404 The user can also specify a quality scores file, which should look something like this. Quality values are presumed
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
405 to be PHRED scores, written in space-delimited decimal.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
406
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
407 >F2YP0BU02FS4EM
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
408 38 38 38 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 38 21 21 21 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
409 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 33
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
410 32 32 40 40 40 21 21 18 18 21 34 34 31 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
411 40 40 40 40 40 40 40 40 40 40 40 32 32 32 32 40 40 40 40 40 40 40 34 34 35
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
412 31 31 28 28 33 33 33 36 36 36 17 17 17 19 26 36 36 36 40 40 40 40 40 33 34
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
413 34 34 39 39 39 40 40 40 40 40 33 33 34 34 40 40 40 40 40 40 40 39 39 39 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
414 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
415 40 40 40 40 40 40 40 39 39 39 39 39 39 40 40 40 39 39 39 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
416 40 40 40 40 40 40 40 40 40 40 40 40 40 26 26 26 26 26 40 40 38 38 37 35 33
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
417 36 40 19 17 17 17 17 19 19 23 30 20 20 20 23 35 40 36 36 36 36 36 36 36 36
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
418 39 40 34 20 27 27 35 39 40 37 40 40 40 40 40 40 40 40 40 40 34 34 35 39 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
419 40 40 40 40 40 40 39 39 39 40 40 40 40 36 36 32 32 28 28 29 30 36 40 30 26
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
420 26 26 34 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
421 40 39 35 34 34 40 40 40 40 30 30 30 35 40 40 40 40 40 39 39 36 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
422 39 39 39 39 30 30 28 35 35 39 40 40 40 40 40 35 35 35
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
423 >F2YP0BU02G197P
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
424 40 40 40 40 40 40 40 40 40 40 39 39 39 39 39 39 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
425 40 40 40 40 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
426 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
427 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
428 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 34 34 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
429 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
430 40 40 40 40 40 40 40 34 34 34 34 40 40 40 40 34 34 34 34 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
431 40 40 40 40 40 39 39 39 34 34 34 34 40 40 40 40 39 39 25 25 26 39 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
432 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
433 33 33 33 33 40 35 21 21 21 30 38 40 40 40 40 40 40 40 40 35 35 30 30 30 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
434 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
435 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
436 40 40 40 39 39 39 40 40
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
437 >F2YP0BU02FIING
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
438 32 32 32 25 25 25 25 24 25 30 31 30 27 27 27 28 28 21 19 19 13 13 13 14 19
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
439 19 17 19 16 16 25 28 22 21 17 17 18 25 24 25 25 25
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
440
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
441 The output file is also SAM:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
442
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
443 (SAM header lines omitted)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
444 F2YP0BU02G7LK5 81 chr21 15557360 255 40M303H * 0 0 ATTTTATTCTCTTTGAAGCAATTGTGAATGGGAGTTTACT D>>>>IIIIIIHHG???IIIIIIIIIHHHFFEIH999HII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
445 F2YP0BU02HXV58 145 chr21 15952091 255 226H40M6S * 0 0 GCAAATTGTGCTGCTTTAAACATGCGTGTGCAAGTATCTTtttcat AA===DDDDAAAAD???:::ABBBBBAAA:888ECF;F>>>?8??@
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
446 F2YP0BU02HREML 65 chr21 16386077 255 320H33M5S * 0 0 CCAAAGTTCTGGGATTACAGGCGTGAGCCATCGcgccc HH???HHIIIHFHIIIIIIICDDHHIIIIIIHHHHHHH
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
447 F2YP0BU02IOF1F 129 chr21 17567321 255 7S28M409H * 0 0 taaagagAAGAATTCTCAACCCAGAATTTCATATC 4100<<A>4113:<EFGGGFFFHHHHHHDFFFFED
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
448 F2YP0BU02IKX84 81 chr21 18491628 255 22M1D18M9S341H * 0 0 GTCTCTACCAAAAAATACAAAAATTAGCCGGGCGTGGTGGcatgtctgt ;;;=7@.55------?2?11112GGB=CCCCDIIIIIIIIIHHHHHHII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
449 F2YP0BU02GW5VA 145 chr21 20255344 255 286H6S32M * 0 0 caagaaCAAACACATTCAAAAGCTAGTAGAAGGCAAGA IIIIIIIHHHIIIIIIICCCCIIIIIIIIIIIIIIIII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
450 F2YP0BU02JIMJ4 65 chr21 22383051 255 208H19M * 0 0 CCCTTTATCATTTTTTATT 555544E?GE113344I22
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
451 F2YP0BU02IXZGF 145 chr21 23094798 255 291H13M1I18M * 0 0 GCAAGCTCCACTTCCCGGGTTCACGCCATTCT IIIIIIIIIIIGG;;;GGHIIIIIGGGIIIII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
452 F2YP0BU02IODR5 129 chr21 30935325 255 37M154H * 0 0 GAAATAAAGGGTATTCAATTAGGAAAAGAGGAAGTCA 6...7/--..,30;9<<>@BFFFAAAAHIIIIIH@@@
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
453 F2YP0BU02IMZBL 145 chr21 31603486 255 342H28M1D1M * 0 0 ATACAAAAATTAGCCGGGCACAGTGGCAG BB1552222<<>9==8;;?AA=??A???A
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
454 F2YP0BU02JA9PR 145 chr21 31677159 255 229H23M * 0 0 CACACCTGTAACCCCAGCACTTT IIIIIIIIIIICCCCIIIIIHHH
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
455 F2YP0BU02HKC61 65 chr21 31678718 255 300H40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT AA@BD:::==AAA@A?8888:<90004<>>?><<<<4442
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
456 F2YP0BU02HKC61 65 chr21 31678718 255 300H40M * 0 0 CACTGCACTCCAGCCTGGGTGACAAAGCAAGACTCTGTCT AA@BD:::==AAA@A?8888:<90004<>>?><<<<4442
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
457 F2YP0BU02HVA88 16 chr21 31703558 255 1M1D35M8S * 0 0 TGGGATTACAGGCGTGAGCTACCACACCCAGCCAGAgttcaaat >8888DFFHHGFHHHH@@?@?DDC96666HIIIFFFFFFFFFFF
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
458 F2YP0BU02JDCF1 129 chr21 31816600 255 38M103H * 0 0 AGGAGAATCGCTTGAACCCAGGAGGCAGAGGTTGCGGT IIIIIIIIIIIHHHIIHHHIIIIIIIIIIIIIIIIIII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
459 F2YP0BU02GZ1GO 65 chr21 33360122 255 76H6S38M * 0 0 cctagaCTTCACACACACACACACACACACACACACACACACAC BBBBD?:688CFFFFFFFFFFFFFFFFFFFFFFFFFFDDBBB51
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
460 F2YP0BU02FX387 145 chr22 14786201 255 201H26M * 0 0 TGGATGAAGCTGGAAACCATCATTCT IIHHHHHHHHHHHHHFFFFFFFFFFF
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
461 F2YP0BU02IF2NE 65 chr22 16960842 255 209H40M10S * 0 0 TGGCATGCACCTGTAGTCTCAGCTACTTGGGAGGCTGAGGtgggaggatc BAAADDDDFDDDDDDBBA889<A?4444000@<>AA?9444;;8>77<7-
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
462 F2YP0BU02F4TVA 0 chr22 19200522 255 49M * 0 0 CCTGGGAGGCGGAGGTTGCAGTGAGCCGAGATCACGCCATTGCACTCCA FFF???FFFFFIIIIIIIIIIIIIIIIIIIIIIIHHIIFHFFFGDDB=5
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
463 F2YP0BU02HKC61 81 chr22 29516998 255 8S32M300H * 0 0 agacagagTCTTGCTTTGTCACCCAGGCTGGAGTGCAGTG 2444<<<<>?>><40009<:8888?A@AAA==:::DB@AA
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
464 F2YP0BU02FS4EM 65 chr22 30159364 255 339H29M * 0 0 CTCCTGCCTCAGCCTCCCGAGTAGTTGGG IIIIHHEIIIIHHHH??=DDHIIIIIDDD
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
465 F2YP0BU02G197P 129 chr22 32044496 255 40M10S258H * 0 0 TTGTTGGACATTTGGGTTGGTTCCAAGTCTTTGCTATTGTgaataatgcc IIIIIIIIIIHHHHHHIIIIIIIIIIIII;;;IIIIIIIIIIIIIIIIII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
466 F2YP0BU02FIING 16 chr22 45959944 255 3M1I11M1I26M * 0 0 AGCTATGGTACTGGCTATGAAAGCAGACACATAGACCAATGG :::9:32267=:114244/...446==<<<?@?:9::::AAA
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
467 F2YP0BU02GUB9L 145 chr22 49198404 255 176H16M1I20M * 0 0 CACCACGCTCGGCTAATTTTTGTATTTTTAGTAGAGA IIIIIIIIIHAAC;<</////@4F5778;IIIIIIII
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
468
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
469 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
470 left_suffix = "_L"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
471 right_suffix = "_R"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
472 # Read the mapping
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
473 mate_to_read_dict = {}
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
474 i = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
475 for i, line in enumerate( file( tmp_mates_mapping_file_name, 'rb' ) ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
476 line = line.strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
477 if not line.startswith( "#" ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
478 fields = line.split()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
479 if len( fields ) != 4:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
480 skip_line( "num_fields", i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
481 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
482 mate_name, read_name, s_offset, e_offset = fields
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
483 if mate_name in mate_to_read_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
484 skip_line( 'two_mate_names', i+1, mate_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
485 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
486 mate_to_read_dict[ mate_name ] = ( read_name, int( s_offset ), int( e_offset ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
487 # Read sequence data
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
488 read_to_nucs_dict = {}
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
489 seqs = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
490 fasta_reader = FastaReader( file( input2, 'rb' ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
491 while True:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
492 seq = fasta_reader.next()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
493 if not seq:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
494 break
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
495 seqs += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
496 seq_text_upper = seq.text.upper()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
497 if seq.name in read_to_nucs_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
498 if seq_text_upper != read_to_nucs_dict[ seq.name ]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
499 skip_line( 'inconsistent_reads', seqs, seq.name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
500 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
501 read_to_nucs_dict[ seq.name ] = seq_text_upper
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
502 # Read quality data
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
503 def quality_sequences( f ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
504 seq_name = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
505 seq_quals = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
506 line_number = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
507 for line in f:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
508 line_number += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
509 line = line.strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
510 if line.startswith( ">" ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
511 if seq_name != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
512 yield ( seq_name, seq_quals, seq_line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
513 seq_name = sequence_name( line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
514 seq_line = line_number
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
515 seq_quals = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
516 elif seq_name is None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
517 skip_line( 'no_header', line_number, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
518 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
519 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
520 seq_quals += [ int( q ) for q in line.split() ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
521 if seq_name is not None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
522 yield ( seq_name, seq_quals, seq_line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
523 def sequence_name( s ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
524 s = s[ 1: ].strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
525 if not s:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
526 return ""
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
527 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
528 return s.split()[ 0 ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
529 read_to_quals_dict = {}
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
530 # TODO: should we use Dan's fastaNamedReader here?
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
531 for seq_name, quals, line_number in quality_sequences( file( input4 ) ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
532 quals = samify_phred_scores( quals )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
533 if seq_name in read_to_quals_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
534 if quals != read_to_quals_dict[ seq_name ]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
535 skip_line( 'inconsistent_reads', line_number, seq_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
536 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
537 if len( quals ) != len( read_to_nucs_dict[ seq_name ] ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
538 skip_line( 'inconsistent_read_lengths', line_number, seq_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
539 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
540 read_to_quals_dict[ seq_name ] = quals
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
541 # process the SAM file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
542 tmp_align_file_names = ' '.join( tmp_align_file_name_list )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
543 combined_chrom_file_name = get_tmp_file_name( suffix='combined_chrom' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
544 command = 'cat %s | grep -v "^@" | sort -k 1 > %s' % ( tmp_align_file_names, combined_chrom_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
545 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
546 fout = file( output, 'w+b' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
547 has_non_header = False
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
548 i = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
549 for i, line in enumerate( file( combined_chrom_file_name, 'rb' ) ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
550 line = line.strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
551 if line.startswith( "@" ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
552 if has_non_header:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
553 skip_line( 'sam_headers', i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
554 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
555 fout.write( "%s\n" % line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
556 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
557 has_non_header = True
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
558 fields = line.split()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
559 num_fields = len( fields )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
560 if num_fields < SAM_MIN_COLUMNS:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
561 skip_line( 'sam_min_columns', i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
562 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
563 # Set flags for mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
564 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
565 flag = int( fields[ SAM_FLAG_COLUMN ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
566 except ValueError:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
567 skip_line( 'sam_flag', i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
568 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
569 if not( flag & ( BAM_FPAIRED + BAM_FREAD1 + BAM_FREAD2 ) == 0 ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
570 skip_line( 'reads_paired', i+1, line )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
571 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
572 mate_name = fields[ SAM_QNAME_COLUMN ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
573 unmap_it = False
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
574 half = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
575 if mate_name.endswith( left_suffix ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
576 flag += BAM_FPAIRED + BAM_FREAD2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
577 fields[ SAM_FLAG_COLUMN ] = "%d" % flag
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
578 unmap_it = True
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
579 half = "L"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
580 elif mate_name.endswith( right_suffix ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
581 flag += BAM_FPAIRED + BAM_FREAD1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
582 fields[ SAM_FLAG_COLUMN ] = "%d" % flag
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
583 unmap_it = True
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
584 half = "R"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
585 on_plus_strand = ( flag & BAM_FREVERSE == 0 )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
586 # Convert position from mate to read by adding clipping to cigar
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
587 if not unmap_it:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
588 read_name = mate_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
589 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
590 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
591 read_name, s_offset, e_offset = mate_to_read_dict[ mate_name ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
592 except KeyError:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
593 skip_line( 'missing_mate', i+1, mate_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
594 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
595 cigar = fields[ SAM_CIGAR_COLUMN ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
596 cigar_prefix = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
597 cigar_suffix = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
598 if half == "L":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
599 if on_plus_strand:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
600 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
601 cigar_prefix = ( s_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
602 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
603 cigar_suffix = ( e_offset, "H" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
604 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
605 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
606 cigar_prefix = ( e_offset, "H" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
607 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
608 cigar_suffix = ( s_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
609 elif half == "R":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
610 if on_plus_strand:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
611 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
612 cigar_prefix = ( s_offset, "H" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
613 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
614 cigar_suffix = ( e_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
615 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
616 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
617 cigar_prefix = ( e_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
618 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
619 cigar_suffix = ( s_offset, "H" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
620 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
621 if on_plus_strand:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
622 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
623 cigar_prefix = ( s_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
624 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
625 cigar_suffix = ( e_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
626 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
627 if e_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
628 cigar_prefix = ( e_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
629 if s_offset > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
630 cigar_suffix = ( s_offset, "S" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
631 if cigar_prefix != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
632 count, op = cigar_prefix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
633 cigar = prefix_cigar( "%d%s" % ( count, op ), cigar )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
634 if op == "S":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
635 refPos = int( fields[ SAM_POS_COLUMN ] ) - count
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
636 fields[ SAM_POS_COLUMN ] = "%d" % refPos
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
637 if cigar_suffix != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
638 count, op = cigar_suffix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
639 cigar = suffix_cigar( cigar,"%d%s" % ( count, op) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
640 fields[ SAM_QNAME_COLUMN ] = read_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
641 fields[ SAM_CIGAR_COLUMN ] = cigar
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
642 # Fetch sequence and quality values, and flip/clip them
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
643 if read_name not in read_to_nucs_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
644 skip_line( 'missing_seq', i+1, read_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
645 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
646 nucs = read_to_nucs_dict[ read_name ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
647 if not on_plus_strand:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
648 nucs = reverse_complement( nucs )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
649 quals = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
650 if read_to_quals_dict != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
651 if read_name not in read_to_quals_dict:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
652 skip_line( 'missing_quals', i+1, read_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
653 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
654 quals = read_to_quals_dict[ read_name ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
655 if not on_plus_strand:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
656 quals = reverse_string( quals )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
657 cigar = split_cigar( fields[ SAM_CIGAR_COLUMN ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
658 nucs, quals = clip_for_cigar( cigar, nucs, quals )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
659 fields[ SAM_SEQ_COLUMN ] = nucs
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
660 if quals != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
661 fields[ SAM_QUAL_COLUMN ] = quals
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
662 # Output the line
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
663 fout.write( "%s\n" % "\t".join( fields ) )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
664 fout.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
665
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
666 def prefix_cigar( prefix, cigar ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
667 ix = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
668 while cigar[ ix ].isdigit():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
669 ix += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
670 if cigar[ ix ] != prefix[ -1 ]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
671 return prefix + cigar
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
672 count = int( prefix[ :-1 ] ) + int( cigar[ :ix ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
673 return "%d%s%s" % ( count, prefix[ -1 ], cigar[ ix+1: ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
674
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
675 def suffix_cigar( cigar, suffix ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
676 if cigar[ -1 ] != suffix[ -1 ]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
677 return cigar + suffix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
678 ix = len( cigar ) - 2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
679 while cigar[ix].isdigit():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
680 ix -= 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
681 ix += 1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
682 count = int( cigar[ ix:-1 ] ) + int( suffix[ :-1 ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
683 return "%s%d%s" % ( cigar[ :ix ], count, suffix[ -1 ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
684
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
685 def split_cigar( text ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
686 fields = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
687 field = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
688 for ch in text:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
689 if ch not in "MIDHS":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
690 field += ch
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
691 continue
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
692 if field == []:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
693 raise ValueError
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
694 fields += [ ( int( "".join( field ) ), ch ) ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
695 field = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
696 if field != []:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
697 raise ValueError
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
698 return fields
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
699
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
700 def clip_for_cigar( cigar, nucs, quals ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
701 # Hard clip prefix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
702 count, op = cigar[0]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
703 if op == "H":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
704 nucs = nucs[ count: ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
705 if quals != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
706 quals = quals[ count: ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
707 count, op = cigar[ 1 ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
708 # Soft clip prefix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
709 if op == "S":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
710 nucs = nucs[ :count ].lower() + nucs[ count: ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
711 # Hard clip suffix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
712 count,op = cigar[ -1 ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
713 if op == "H":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
714 nucs = nucs[ :-count ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
715 if quals != None:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
716 quals = quals[ :-count ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
717 count, op = cigar[ -2 ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
718 # Soft clip suffix
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
719 if op == "S":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
720 nucs = nucs[ :-count ] + nucs[ -count: ].lower()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
721 return nucs, quals
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
722
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
723 def samify_phred_scores( quals ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
724 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
725 Convert a decimal list of phred base-quality scores to a sam quality string.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
726 Note that if a quality is outside the dynamic range of sam's ability to
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
727 represent it, we clip the value to the max allowed. SAM quality scores
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
728 range from chr(33) to chr(126).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
729 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
730 if min( quals ) >= 0 and max( quals ) <= 126-33:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
731 return "".join( [ chr( 33 + q ) for q in quals ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
732 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
733 return "".join( [ chr( max( 33, min( 126, 33+q ) ) ) for q in quals ] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
734
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
735 def reverse_complement( nucs ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
736 complementMap = maketrans( "ACGTacgt", "TGCAtgca" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
737 return nucs[ ::-1 ].translate( complementMap )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
738
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
739 def reverse_string( s ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
740 return s[ ::-1 ]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
741
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
742 def __main__():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
743 # Parse command line
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
744 # input1: a reference genome ( 2bit or fasta )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
745 # input2: a collection of 454 paired end reads ( a fasta file )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
746 # input3: a linker sequence ( a very small fasta file )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
747 # input4: a base quality score 454 file ( qual454 )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
748 parser = optparse.OptionParser()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
749 parser.add_option( '', '--ref_name', dest='ref_name', help='The reference name to change all output matches to' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
750 parser.add_option( '', '--ref_source', dest='ref_source', help='The reference is cached or from the history' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
751 parser.add_option( '', '--ref_sequences', dest='ref_sequences', help='Number of sequences in the reference dataset' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
752 parser.add_option( '', '--source_select', dest='source_select', help='Use pre-set or cached reference file' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
753 parser.add_option( '', '--input1', dest='input1', help='The name of the reference file if using history or reference base name if using cached' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
754 parser.add_option( '', '--input2', dest='input2', help='The 454 reads file to align' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
755 parser.add_option( '', '--input3', dest='input3', help='The sequencing linker file' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
756 parser.add_option( '', '--input4', dest='input4', help='The base quality score 454 file' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
757 parser.add_option( '', '--output', dest='output', help='The output file' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
758 parser.add_option( '', '--lastz_seqs_file_dir', dest='lastz_seqs_file_dir', help='Directory of local lastz_seqs.loc file' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
759 ( options, args ) = parser.parse_args()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
760
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
761 # output version # of tool
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
762 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
763 tmp = tempfile.NamedTemporaryFile().name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
764 tmp_stdout = open( tmp, 'wb' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
765 proc = subprocess.Popen( args='lastz -v', shell=True, stdout=tmp_stdout )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
766 tmp_stdout.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
767 returncode = proc.wait()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
768 stdout = None
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
769 for line in open( tmp_stdout.name, 'rb' ):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
770 if line.lower().find( 'version' ) >= 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
771 stdout = line.strip()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
772 break
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
773 if stdout:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
774 sys.stdout.write( '%s\n' % stdout )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
775 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
776 raise Exception
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
777 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
778 sys.stdout.write( 'Could not determine Lastz version\n' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
779
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
780 if options.ref_name:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
781 ref_name = '[nickname=%s]' % options.ref_name
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
782 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
783 ref_name = ''
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
784 if options.ref_source == 'history':
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
785 # Reference is a fasta dataset from the history
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
786 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
787 # Ensure there is at least 1 sequence in the dataset ( this may not be necessary ).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
788 error_msg = "The reference dataset is missing metadata, click the pencil icon in the history item and 'auto-detect' the metadata attributes."
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
789 ref_sequences = int( options.ref_sequences )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
790 if ref_sequences < 1:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
791 stop_err( error_msg )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
792 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
793 stop_err( error_msg )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
794 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
795 ref_sequences = 0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
796 tmp_w12_name = get_tmp_file_name( suffix='vs_linker.W12' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
797 tmp_T1_name = get_tmp_file_name( suffix='vs_linker.T1' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
798 # Run lastz twice ( with different options ) on the linker sequence and paired end reads,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
799 # looking for the linker ( each run finds some the other doesn't )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
800 command = 'lastz %s %s W=12 --notrans --exact=18 --match=1,3 O=1 E=3 Y=10 L=18 --ambiguousn --coverage=85 --format=general-:name2,zstart2+,length2,size2 > %s' % \
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
801 ( options.input3, options.input2, tmp_w12_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
802 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
803 command = 'lastz %s %s T=1 --match=1,2 O=1 E=2 X=15 K=10 Y=15 L=18 --ambiguousn --coverage=85 --format=general-:name2,zstart2+,length2,size2 > %s' % \
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
804 ( options.input3, options.input2, tmp_T1_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
805 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
806 # Combine the alignment output from the two lastz runs
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
807 tmp_combined_linker_file_name = get_tmp_file_name( suffix='vs_linker' )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
808 command = 'cat %s %s | sort -u > %s' % ( tmp_w12_name, tmp_T1_name, tmp_combined_linker_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
809 run_command( command )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
810 # Use the alignment info to split reads into left and right mates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
811 tmp_mates_mapping_file_name, tmp_mates_file_name, tmp_mates_short_file_name, tmp_mates_long_file_name = split_paired_reads( options.input2, tmp_combined_linker_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
812 # Align mates to the reference - tmp_align_file_names is a list of file names created by align_mates()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
813 tmp_align_file_name_list = align_mates( options.input1, options.ref_source, ref_name, ref_sequences, tmp_mates_short_file_name, tmp_mates_long_file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
814 # Combine and convert mate coordinates back to read coordinates
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
815 paired_mate_unmapper( options.input2, options.input4, tmp_mates_mapping_file_name, tmp_align_file_name_list, options.output )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
816 # Delete all temporary files
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
817 for file_name in tmp_file_names:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
818 os.remove( file_name )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
819 # Handle any invalid lines in the input data
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
820 if total_skipped_lines:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
821 msgs = dict( bad_interval="Bad interval in line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
822 inconsistent_read_lengths="Inconsistent read/quality lengths for seq #",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
823 inconsistent_reads="Inconsistent reads for seq #",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
824 inconsistent_sizes="Inconsistent sizes for seq #",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
825 missing_mate="Mapping file does not include mate on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
826 missing_quals="Missing quality values for name on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
827 missing_seq="Missing sequence for name on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
828 multiple_seqs="Multiple names for seq #",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
829 no_header="First quality sequence has no header",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
830 num_fields="Must have 4 fields in line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
831 reads_paired="SAM flag indicates reads already paired on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
832 sam_flag="Bad SAM flag on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
833 sam_headers="SAM headers on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
834 sam_min_columns="Need 11 columns on line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
835 two_mate_names="Mate name already seen, line",
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
836 wrong_seq_len="Size differs from length of seq #" )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
837 print "Skipped %d invalid lines: "
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
838 msg = ""
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
839 for k, v in skipped_lines.items():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
840 if v[0]:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
841 # v[0] is the number of times the error occurred
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
842 # v[1] is the position of the line or sequence in the file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
843 # v[2] is the name of the sequence or the text of the line
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
844 msg += "(%d)%s %d:%s. " % ( v[0], msgs[k], v[1], v[2] )
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
845 print msg
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
846
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
847 if __name__=="__main__": __main__()