Mercurial > repos > davidvanzessen > argalaxy_tools
comparison igblastparser/igparse.pl @ 0:afe85eb6572e draft
Uploaded
author | davidvanzessen |
---|---|
date | Mon, 29 Aug 2016 05:41:20 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:afe85eb6572e |
---|---|
1 #!/usr/bin/perl | |
2 =head1 IGBLAST_simple.pl | |
3 | |
4 This version (1.4) has been heavily adapted since the original program was first created back in October 2012. | |
5 Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes. | |
6 | |
7 From V 1.2 onwards a 'Change Log' is included at the end of the program | |
8 | |
9 =head2 Usage | |
10 | |
11 Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment | |
12 as it allows inspection of the data stores. | |
13 | |
14 The program takes a text file of the | |
15 | |
16 ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process-> | |
17 | |
18 Supply the text version of the igBLAST report in the format as in the example below. | |
19 The extra command line arugment is the record number (aka. BLAST report) to process. | |
20 If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards. | |
21 | |
22 =head2 Example Input | |
23 | |
24 A standard igBLAST record or set of them in a file; this being typical: | |
25 | |
26 BLASTN 2.2.27+ | |
27 | |
28 | |
29 Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. | |
30 Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. | |
31 Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of | |
32 protein database search programs", Nucleic Acids Res. 25:3389-3402. | |
33 | |
34 | |
35 | |
36 Database: human_gl_V; human_gl_D; human_gl_J | |
37 674 sequences; 179,480 total letters | |
38 | |
39 | |
40 | |
41 Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1 | |
42 run=R_2012_04_10_11_57_56_ | |
43 | |
44 Length=433 | |
45 Score E | |
46 Sequences producing significant alignments: (Bits) Value | |
47 | |
48 lcl|IGHV3-30*04 330 2e-92 | |
49 lcl|IGHV3-30-3*01 330 2e-92 | |
50 lcl|IGHV3-30*01 327 2e-91 | |
51 lcl|IGHD3-16*01 14.4 11 | |
52 lcl|IGHD3-16*02 14.4 11 | |
53 lcl|IGHD1-14*01 12.4 43 | |
54 lcl|IGHJ4*02 78.3 1e-18 | |
55 lcl|IGHJ5*02 70.3 4e-16 | |
56 lcl|IGHJ4*01 68.3 2e-15 | |
57 | |
58 | |
59 Domain classification requested: imgt | |
60 | |
61 | |
62 V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand): | |
63 IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame + | |
64 | |
65 V(D)J junction details (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself | |
66 AGAGA TATGAGCCCCATCATGACA ACGTTTG CCGGAA ACTAC | |
67 | |
68 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity) | |
69 FWR1 27 38 12 11 1 0 91.7 | |
70 CDR1 39 62 24 22 2 0 91.7 | |
71 FWR2 63 113 51 50 1 0 98 | |
72 CDR2 114 137 24 23 1 0 95.8 | |
73 FWR3 138 251 114 109 5 0 95.6 | |
74 CDR3 (V region only) 252 259 8 7 1 0 87.5 | |
75 Total N/A N/A 233 222 11 0 95.3 | |
76 | |
77 | |
78 Alignments | |
79 | |
80 <----FWR1--><----------CDR1--------><-----------------------FWR2------ | |
81 W A A S G F T F N T Y A V H W V R Q A P G K G | |
82 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 | |
83 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 | |
84 C A A S G F T F S S Y A M H W V R Q A P G K G | |
85 V 95.7% (221/231) IGHV3-30-3*01 64 ..T......................G..G.......A................................. 133 | |
86 V 94.8% (221/233) IGHV3-30*01 64 ..T......................G..G.......A................................. 133 | |
87 | |
88 ----------------><----------CDR2--------><---------------------------- | |
89 L E W V A V I S Y D G S N K N Y A D S V K G R F | |
90 Query_1 97 TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT 166 | |
91 V 95.3% (222/233) IGHV3-30*04 134 ..................................T......T............................ 203 | |
92 L E W V A V I S Y D G S N K Y Y A D S V K G R F | |
93 V 95.7% (221/231) IGHV3-30-3*01 134 .........................................T............................ 203 | |
94 V 94.8% (221/233) IGHV3-30*01 134 .A................................T......T............................ 203 | |
95 | |
96 ---------------------------FWR3--------------------------------------- | |
97 T I S R D N S K N T L Y L Q M N S L R V E D T | |
98 Query_1 167 CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG 236 | |
99 V 95.3% (222/233) IGHV3-30*04 204 ...............................C.G.........................C.......... 273 | |
100 T I S R D N S K N T L Y L Q M N S L R A E D T | |
101 V 95.7% (221/231) IGHV3-30-3*01 204 ...............................C.G.........................C.......... 273 | |
102 V 94.8% (221/233) IGHV3-30*01 204 ...............................C.G.........................C.......... 273 | |
103 | |
104 --------------> | |
105 A V Y Y C T R D M S P I M T T F A G N Y W G Q | |
106 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306 | |
107 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296 | |
108 A V Y Y C A R | |
109 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294 | |
110 V 94.8% (221/233) IGHV3-30*01 274 .....G.........G.......----------------------------------------------- 296 | |
111 D 100.0% (7/7) IGHD3-16*01 12 ------------------------------------------.......--------------------- 18 | |
112 D 100.0% (7/7) IGHD3-16*02 12 ------------------------------------------.......--------------------- 18 | |
113 D 100.0% (6/6) IGHD1-14*01 8 -------------------------------------------------......--------------- 13 | |
114 J 100.0% (39/39) IGHJ4*02 10 -------------------------------------------------------............... 24 | |
115 J 100.0% (35/35) IGHJ5*02 17 -----------------------------------------------------------........... 27 | |
116 J 97.4% (38/39) IGHJ4*01 10 -------------------------------------------------------.............A. 24 | |
117 | |
118 | |
119 G T L V T V S S | |
120 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330 | |
121 J 100.0% (39/39) IGHJ4*02 25 ........................ 48 | |
122 J 100.0% (35/35) IGHJ5*02 28 ........................ 51 | |
123 J 97.4% (38/39) IGHJ4*01 25 ........................ 48 | |
124 | |
125 | |
126 Lambda K H | |
127 1.10 0.333 0.549 | |
128 | |
129 Gapped | |
130 Lambda K H | |
131 1.08 0.280 0.540 | |
132 | |
133 Effective search space used: 64847385 | |
134 | |
135 | |
136 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1 | |
137 run=R_2012_04_10_11_57_56_ | |
138 | |
139 | |
140 ...etc... | |
141 | |
142 =head2 Example Output | |
143 | |
144 | |
145 Example output from the data above sent: | |
146 $ ./IGBLAST_simple.pl igBLASTOutput.txt 1 | |
147 D: Request to process just record '1' received | |
148 D: printOUTPUTData: Running | |
149 D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How' | |
150 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How | |
151 D: ID is: 'HL67IUI01D26LR' | |
152 D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259) | |
153 D: Starting Search for CDR3 | |
154 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence) | |
155 D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27' | |
156 D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309) | |
157 D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +' | |
158 D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +' | |
159 D: printOUTPUTData: Running | |
160 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH | |
161 | |
162 =head4 Usage notes: | |
163 | |
164 Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need: | |
165 | |
166 ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT: | |
167 | |
168 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How | |
169 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH | |
170 OUTPUT: HL67IUI01EQMLY In-frame IGHV4-39*01 IGHD2-8*01 IGHJ3*02 GGSISSSSYY 29 IYHSGST 20 CARDATYYSNGFDIWGQG 53 MOTIF_FOUND_IN_BOTH | |
171 OUTPUT: HL67IUI01CDCLP Out-of-frame IGHV3-23*01 IGHD3-3*01 IGHJ4*02 FSNYAM 16 SGSGDRTY 23 AKAD*FLEWLFRIGDGERLLGPGN 72 MOTIF_FOUND_IN_DNA | |
172 OUTPUT: HL67IUI01AHRNH N/A IGHV3-33*01 N/A N/A WIHLQ*LW 23 YGMMEVI 23 NOT_FOUND | |
173 OUTPUT: HL67IUI01DZZ1V Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASG 20 LYCASEGDIVASELLSTGARV 62 MOTIF_FOUND_IN_DNA | |
174 OUTPUT: HL67IUI01DTR2Y Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*VLGPG 59 MOTIF_FOUND_IN_DNA | |
175 OUTPUT: HL67IUI01EQL3S In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH | |
176 OUTPUT: HL67IUI01AFG46 In-frame IGLV2-34*01 N/A IGHJ4*02 NOT_FOUND | |
177 OUTPUT: HL67IUI01EFFKO In-frame IGHV3-11*01 IGHD6-6*01 IGHJ4*02 GFTFSDYY 23 ISYSGGTI 23 CARASGAARHRPLDYWGQG 56 MOTIF_FOUND_IN_BOTH | |
178 OUTPUT: HL67IUI01B18SG In-frame IGHV3-33*01 IGHD5-12*01 IGHJ4*02 VRQA 11 KYYANSVK 23 RLGGFDYWGQGTLVTVSS 53 MOTIF_FOUND_IN_BOTH | |
179 OUTPUT: HL67IUI01D6LER In-frame IGHV1-24*01 IGHD3-22*01 IGHJ4*02 GYSLNELS 23 PDPEDDE 23 TVQPSRITMMAVVITRIHWGASGARE 76 MOTIF_FOUND_IN_DNA | |
180 OUTPUT: HL67IUI01CYCLF N/A IGHV4-39*01 N/A N/A GGSISSSSYY 29 IYYSGST 20 NOT_FOUND | |
181 OUTPUT: HL67IUI01B4LEE In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH | |
182 OUTPUT: HL67IUI01A4KW4 Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*IWGQG 58 MOTIF_FOUND_IN_BOTH | |
183 OUTPUT: HL67IUI01E05BV In-frame IGHV1-24*01 IGHD3-22*01 IGHJ2*01 GYSLNELS 23 PDPEDDE 23 NOT_FOUND | |
184 OUTPUT: HL67IUI01CVVKY In-frame IGHV1-3*01 IGHD2-15*01 IGHJ1*01 NOT_FOUND | |
185 OUTPUT: HL67IUI01CN5P2 In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH | |
186 OUTPUT: HL67IUI01DUUJ5 In-frame IGHV3-21*01 IGHD1-7*01 IGHJ4*02 GYTFSTYS 23 ISSSSAYR 23 CARDIRLELRDWGQG 44 MOTIF_FOUND_IN_BOTH | |
187 OUTPUT: HL67IUI01E1AIR Out-of-frame IGHV4-39*01 N/A IGHJ3*01 WGLHRRW**L 29 FVS*RAPR 23 NOT_FOUND | |
188 OUTPUT: HL67IUI01CCZ8D Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASGR 20 YCASEGDIVASELLSTGARE 58 MOTIF_FOUND_IN_DNA | |
189 OUTPUT: HL67IUI01BT9IR N/A IGHV3-21*02 N/A N/A NOT_FOUND | |
190 OUTPUT: HL67IUI01COTO0 Out-of-frame IGHV4-39*01 N/A IGHJ3*01 GGFIGGGDNF 29 LYHDGRPA 23 NOT_FOUND | |
191 OUTPUT: HL67IUI01D994O In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH | |
192 OUTPUT: HL67IUI01A08CJ In-frame IGHV4-39*01 IGHD6-13*01 IGHJ5*02 GGSISSSSYY 29 IYYTWEH 21 CERARRGSSWGQLVRPLGPG 62 MOTIF_FOUN | |
193 | |
194 | |
195 | |
196 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How | |
197 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH | |
198 ...etc... | |
199 | |
200 =head4 Also, combined grep & sed: | |
201 | |
202 $ ./IGBLAST_simple.pl igBLASTOutput.txt | grep OUTPUT: | sed 's/OUTPUT:\t//' | |
203 | |
204 =cut | |
205 | |
206 =head3 CDR3 Patterns: | |
207 | |
208 We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly: | |
209 | |
210 my $DNACDR3_Pat = "TGGGG....GG."; | |
211 my $AASequenceMotifPattern = "WG.G"; | |
212 | |
213 They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X'). | |
214 | |
215 [NB: These are original patterns used for testing, check the code for the current ones.] | |
216 | |
217 =cut | |
218 | |
219 my $DNACDR3_Pat = "TGGGG....GG."; | |
220 my $AACDR3_Pat = "WG.G"; | |
221 | |
222 use strict; | |
223 use Data::Dumper; | |
224 # Set this as to number of the result (aka "record") you want to process or 0 for all: | |
225 my $ProcessRecord =0; | |
226 if (defined $ARGV[1]) { $ProcessRecord = pop @ARGV; } #Also accept from the command line: | |
227 if ($ProcessRecord != 0) { print "D: Request to process just record '$ProcessRecord' received\n"; } | |
228 | |
229 #Adjust the record separator: | |
230 $/="Query= "; | |
231 my $Record=0; # A simple counter, that we might not use. | |
232 #Force-loaded header / version information: | |
233 my $Header = <>; | |
234 #At the moment we don't use this - so dump it immediately: | |
235 $Header = undef; | |
236 #print "D: Force-loaded header / version information: '$Header'\n"; | |
237 | |
238 #Print the Header for the output line (we need this once, at the start) | |
239 print &printOUTPUTData ({"HEADER" => 1})."\n"; | |
240 | |
241 while (<>) | |
242 { | |
243 =head4 First check - should we be processing this record at all? | |
244 | |
245 =cut | |
246 $Record++; #Increment the record counter: | |
247 #Do we process this record - or all records? | |
248 if ($ProcessRecord != $Record && $ProcessRecord != 0) | |
249 { next; } #We need to increment the record counter before we increment | |
250 | |
251 =head4 Setup the output line storage and print the header: | |
252 | |
253 We enter this initially and work to change it: | |
254 | |
255 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND"; | |
256 | |
257 =cut | |
258 | |
259 my %OUTPUT_Data; #To collect data for the output line in | |
260 #Assume the first and work to find better: | |
261 $OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND"; | |
262 #The whole record - one per read - is now stored in $_ | |
263 my @Lines =split (/[\r\n]+/,$_); # split on windows/linux/mac new lines | |
264 | |
265 #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went: | |
266 #print "D: Record #$Record\n"; print $_; print "\n---------\n"; | |
267 print "D: ''$Lines[0]'\nD: ...etc...'\nD: ############\n"; | |
268 | |
269 =head3 Get the ID | |
270 | |
271 Quite easy: the first field on the first line: | |
272 | |
273 Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1 | |
274 | |
275 =cut | |
276 | |
277 (my $ID) = $Lines[0]=~ m/^(\S+)/; | |
278 unless (defined $ID && $ID ne "") | |
279 { # So a near total failure...? | |
280 $OUTPUT_Data{"ID"} = "Unknown"; | |
281 print &printOUTPUTData (\%OUTPUT_Data)."\n"; | |
282 next; #No ID is terminal for this record | |
283 } | |
284 else | |
285 { | |
286 print "D: ID is: '$ID'\n"; | |
287 $OUTPUT_Data{"ID"} = $ID; | |
288 } | |
289 =head3 Declare the variables we will need here in the next few sections to store data | |
290 | |
291 =cut | |
292 | |
293 my $CurrentRegion; | |
294 my $RegionMarkup; | |
295 | |
296 #So we can sync the coordinated of the alignment up to the domains found: | |
297 my $Query_Start = -1; my $Query_End = -1; | |
298 | |
299 #Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop? | |
300 my $ThisQueryStart =-1; my $ThisQueryEnd =-1; #Think $ThisQueryEnd isn't used at the moment. | |
301 my $DNAQuerySequence =""; #The actual DNA Query sequence... | |
302 my $AAQuerySequence = ""; | |
303 | |
304 #As this changes with the alleles identified: | |
305 my $CurrentAASequence; | |
306 #The main storage variables | |
307 | |
308 my %Alginments; my %Alleles; | |
309 my %DomainBoundaries; | |
310 | |
311 =head2 Stanza 1: Get the general structure of the sequence identified | |
312 | |
313 =head3 Method 1: Use the table supplied | |
314 | |
315 Technically this valid for the top hit...realistically this is the only information we have reported to us | |
316 so we use this or nothing. This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd? Who knows! | |
317 | |
318 Targets this block: | |
319 | |
320 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity) | |
321 FWR1 167 240 75 72 2 1 96 | |
322 CDR1 241 264 24 20 4 0 83.3 | |
323 FWR2 265 315 51 48 3 0 94.1 | |
324 CDR2 316 336 24 15 6 3 62.5 | |
325 FWR3 337 450 114 106 8 0 93 | |
326 CDR3 (V region only) 451 454 4 4 0 0 100 | |
327 Total N/A N/A 292 265 23 4 90.8 | |
328 | |
329 Then we split out the lines inside it in a second scanning step - less optimal but easier to read: | |
330 | |
331 FWR1 167 240 75 72 2 1 96 | |
332 CDR1 241 264 24 20 4 0 83.3 | |
333 FWR2 265 315 51 48 3 0 94.1 | |
334 CDR2 316 336 24 15 6 3 62.5 | |
335 FWR3 337 450 114 106 8 0 93 | |
336 CDR3 (V region only) 451 454 4 4 0 0 100 | |
337 | |
338 into: | |
339 | |
340 (Section, from, to, length, matches, mismatches, gaps, percent identity) | |
341 | |
342 =head3 Method 2: Use the table supplied | |
343 | |
344 The other way to do this is to split the graphical markup out of the alignment. | |
345 This works for _any_ reported alignment, not just the top hits: | |
346 | |
347 In the main alignment table processing section collect the information, collect the information: | |
348 | |
349 #Is region mark-up: | |
350 if ($#InfoColumns == -1 && $#AlignmentColumns ==0) | |
351 { | |
352 # print ": Region Markup detected\n"; | |
353 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record | |
354 next; | |
355 } | |
356 | |
357 Then afterwards when all the region was collected, process it like this: | |
358 #Pad the CDER3 region: | |
359 | |
360 #Remove the trailing spaces: | |
361 $RegionMarkup =~ s/ *$//g; | |
362 #Calculate the length of the CDR3 region so we can add it in: | |
363 my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1; | |
364 #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop): | |
365 $RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">"; | |
366 #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n"; | |
367 | |
368 #Now really process it: | |
369 my $C_Pos = 0; | |
370 my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup); # | |
371 foreach my $C_Domain (@Domains) | |
372 { | |
373 if (length ($C_Domain) <=0) {next;} | |
374 my $DomainStart= $C_Pos; | |
375 my $DomainEnd = $DomainStart + length ($C_Domain)-1; | |
376 my ($DomainType) = $C_Domain =~ m/(...[123])/; | |
377 # print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n"; | |
378 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart; | |
379 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd; | |
380 $C_Pos = $DomainEnd+1; | |
381 } | |
382 | |
383 The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely. | |
384 Why stress? | |
385 | |
386 | |
387 =head3 The end of the FWR3 is the start of CDR3? | |
388 | |
389 This is an assumption made. Hence the two variables: | |
390 | |
391 my $MaxDomainReported =0 ; # In nts / bps | |
392 my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially. | |
393 | |
394 $MaxDomainBaseFound | |
395 | |
396 =cut | |
397 my $MaxDomainBaseFound =0 ; # In nts / bps | |
398 my $AlignmentStart ; # In nts /bp #Alternative name would be: '$MinDomainBaseFound'; set to null until primed | |
399 # my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially. | |
400 | |
401 (my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" ); | |
402 #Enable the next line if you want the raw data we are going to parse in this section: | |
403 #print Dumper @StructureSummaryTable; | |
404 foreach my $C_Section (@StructureSummaryTable) | |
405 { | |
406 my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section); | |
407 #print "D: Domain type: '$DomainType'\n"; | |
408 #$DomainType =~ s/ .*$//g; | |
409 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart; | |
410 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd; | |
411 | |
412 #So we can do a reality check on the length / start of the CDR3 if we have to go looking: | |
413 if ($MaxDomainBaseFound <= $DomainEnd) | |
414 { $MaxDomainBaseFound = $DomainEnd; } #Store the maximum base found | |
415 if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart) | |
416 { $AlignmentStart = $DomainStart; } | |
417 } | |
418 #print Dumper %DomainBoundaries; | |
419 #die "HIT BLOCK\n"; | |
420 | |
421 =head3 Did we find the CDR3 region specifically? | |
422 | |
423 If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up. | |
424 | |
425 =cut | |
426 print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n"; | |
427 | |
428 #my @WantedSections = qw (V D J); | |
429 | |
430 =head2 Second Stanza: Parse the main Alignment Table | |
431 | |
432 =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels. | |
433 | |
434 As this is a little involved and comparamentalises nicely we sub-contract this to two functions"" | |
435 | |
436 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" ); | |
437 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter? | |
438 | |
439 =cut | |
440 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" ); | |
441 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter? | |
442 #If you are interested, enable this line: | |
443 # print "D: The info panel was detected at: '$splitPoint'\n"; | |
444 | |
445 =head3 | |
446 | |
447 =cut | |
448 | |
449 | |
450 foreach my $C_Line (0..$#Table) | |
451 { | |
452 | |
453 =head3 Call the line type we find: There are 4: | |
454 | |
455 These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets) | |
456 | |
457 | <- This split is ~40 chars. from the start of the line | |
458 * InfoPanel * | * Alignment Panel * | |
459 : is a "Blank" line (-1,-1) | |
460 <----FWR1--><----------CDR1--------><-----------------------FWR2------ : is "Region Markup" (-1,0) | |
461 W A A S G F T F N T Y A V H W V R Q A P G K G : is "AA Sequence" (-1, >=0) | |
462 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 : is "DNA Sequence" (2,1) | |
463 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 : is "" " | |
464 | |
465 So we split 40 chars in and then the two parts on spaces. | |
466 | |
467 | |
468 =cut | |
469 | |
470 # print "D: (sub) Line in parsed table: '$C_Line': \n"; | |
471 | |
472 my ($InfoPanel, $AlignmentPanel) = $Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/; | |
473 | |
474 my @InfoColumns = split (/\s+/,$InfoPanel); | |
475 my @AlignmentColumns = split (/\s+/,$AlignmentPanel); | |
476 | |
477 #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first | |
478 # print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n"; | |
479 # print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n"; | |
480 | |
481 #Populate this so we can step through it | |
482 | |
483 =head4 Is a blank line: | |
484 =cut | |
485 if ($#InfoColumns == -1 && $#AlignmentColumns == -1) | |
486 { | |
487 # print ": Blank\n"; | |
488 next; | |
489 } #For now I think we just skip - is not needed (though might be implict mark-up) | |
490 | |
491 =head4 Is region mark-up: | |
492 =cut | |
493 if ($#InfoColumns == -1 && $#AlignmentColumns ==0) | |
494 { | |
495 # print ": Region Markup detected\n"; | |
496 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record | |
497 next; | |
498 } | |
499 =head4 Is query DNA Sequence: | |
500 =cut | |
501 if ($#InfoColumns == 2 && $#AlignmentColumns ==1) | |
502 { | |
503 # print ": DNA Query Sequence\n"; | |
504 #Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels') | |
505 ($ThisQueryStart) = $InfoPanel =~ / (\d+) *$/; | |
506 ($ThisQueryEnd) = $AlignmentPanel =~ / (\d+) *$/; | |
507 my ($ThisDNASeq) = $AlignmentPanel =~ /^(.*?) /; | |
508 #If you want to know what we just found: | |
509 #print "D: This DNA Sequence: '$ThisDNASeq'\n"; | |
510 $DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq; #Add it on to whatever we already have. | |
511 #Move the needle if there are smaller / greater; otherwise prime the 'needles': | |
512 if ($ThisQueryStart < $Query_Start or $Query_Start == -1) | |
513 { $Query_Start = $ThisQueryStart; } | |
514 if ($ThisQueryEnd > $Query_End or $Query_End == -1) | |
515 { $Query_End = $ThisQueryEnd; } | |
516 # print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n"; | |
517 next; | |
518 } | |
519 =head4 Is AA Sequence: | |
520 | |
521 This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions: | |
522 --------------> | |
523 A V Y Y C T R D M S P I M T T F A G N Y W G Q << Want this | |
524 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306 | |
525 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296 | |
526 A V Y Y C A R | |
527 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294 | |
528 | |
529 ...etc... | |
530 G T L V T V S S << Want this | |
531 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330 | |
532 | |
533 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...) | |
534 | |
535 =cut | |
536 | |
537 if ($#InfoColumns == -1 && $#AlignmentColumns >=-1) | |
538 { | |
539 unless ($Table[$C_Line+1] =~ /Query/) { next; } #Is the next line the DNA sequence ? | |
540 # | |
541 # print ": AA sequence\n"; | |
542 | |
543 | |
544 $CurrentAASequence = $AlignmentPanel; | |
545 #print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n"; | |
546 $CurrentAASequence =~ s/^ {$PanelSplitPoint}//; | |
547 #print "D: '$AAQuerySequence'\n"; | |
548 # print "D: Current AA Sequence: \t'$CurrentAASequence'\n"; | |
549 $AAQuerySequence = $AAQuerySequence.$CurrentAASequence; #Store the elongating AA Sequence as well | |
550 next; | |
551 } | |
552 =head4 Is Alignment: | |
553 =cut | |
554 if ($#InfoColumns == 4 && $#AlignmentColumns ==1) | |
555 { | |
556 #Not acutally interesting to us for this version of the parser. Delete ultimately? | |
557 next; | |
558 } | |
559 | |
560 #Is weird! Don't recognise it! | |
561 | |
562 warn "Weird! Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n"; | |
563 } #End main iteration loop for alignment parsing. | |
564 | |
565 | |
566 =head2 The CDR3 is noted as problematic. Can we identify it? | |
567 | |
568 =cut | |
569 print "D: Starting Search for CDR3\n"; | |
570 #Do have the end of the FWR3 but not the CDR3? If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point | |
571 if (exists ($DomainBoundaries{"FWR3"}{"End"}) | |
572 && $AlignmentStart !=0 | |
573 && not (exists $DomainBoundaries{"CDR3"}{"End"}) ) #Guess we need to go looking for the end then... | |
574 { | |
575 #print "D: Placing call to markUpCDR3\n"; | |
576 my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, | |
577 $DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart, | |
578 $DNACDR3_Pat, $AACDR3_Pat); | |
579 if ($CDR3_Start !=0 && $CDR3_End !=0) | |
580 { | |
581 $DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start; | |
582 $DomainBoundaries{"CDR3"}{"End"} = $CDR3_End ; | |
583 $DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag; | |
584 print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n"; | |
585 } | |
586 else | |
587 { print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n"; | |
588 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND"; | |
589 } | |
590 } | |
591 else | |
592 { #Was reported by igBLAST | |
593 print "D: Found the FWR3 from the Domain Boundary Table\n"; | |
594 $DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE"; | |
595 } | |
596 | |
597 #print Dumper %DomainBoundaries; | |
598 | |
599 =head2 Get the top VDJ regions: | |
600 | |
601 =cut | |
602 | |
603 =head2 Extract General Features: | |
604 | |
605 =cut | |
606 (my $TopHit) = $_ =~ m/V-J Frame, Strand\):\n(.*?)\n/s; | |
607 print "D: Top Hits (raw)= '$TopHit' \n"; | |
608 my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit); | |
609 print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n"; | |
610 | |
611 =head2 Store the V / D / J Genes used | |
612 | |
613 =cut | |
614 | |
615 if (defined $Top_V_gene_match && $Top_V_gene_match ne "") | |
616 { $OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match; } | |
617 | |
618 if (defined $Top_D_gene_match && $Top_D_gene_match ne "") | |
619 { $OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match; } | |
620 | |
621 if (defined $Top_J_gene_match && $Top_J_gene_match ne "") | |
622 { $OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match; } | |
623 | |
624 if (defined $Strand && $Strand ne "") | |
625 { $OUTPUT_Data{"Strand"} = $Strand;} | |
626 | |
627 =head4 Preamble: ID, Frame, and V / D / J used: | |
628 | |
629 =cut | |
630 #Do a reality check: if we didn't get an ID, then skip: | |
631 unless (defined (defined $ID) && $ID ne "" && | |
632 defined $VJFrame && $VJFrame ne "") | |
633 { | |
634 print &printOUTPUTData (\%OUTPUT_Data)."\n"; | |
635 next; | |
636 } | |
637 | |
638 #Ok, so we have data...most likely: | |
639 #print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match); | |
640 | |
641 if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "") | |
642 { $OUTPUT_Data{"VDJ Frame"} = $VJFrame;} | |
643 else | |
644 { | |
645 print &printOUTPUTData (\%OUTPUT_Data)."\n"; | |
646 next; | |
647 }#REALLY? We didn't find anything? Oh well, move to next record | |
648 | |
649 =head4 CDR1 | |
650 | |
651 =cut | |
652 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates) | |
653 | |
654 my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"}; | |
655 | |
656 # print "D: AA Seqeunce is: '$AAQuerySequence'\n"; | |
657 if (exists $DomainBoundaries{"CDR1"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start | |
658 { | |
659 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... | |
660 my $CDR1Start = $DomainBoundaries{"CDR1"}{"Start"}; | |
661 my $CDR1End = $DomainBoundaries{"CDR1"}{"End"}; | |
662 my $CDR1_Length = $CDR1End - $CDR1Start; | |
663 # print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n"; | |
664 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read | |
665 my $CDR1_Seq_AA = substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length); | |
666 # print "D: '$CDR1_Seq_AA'\n"; | |
667 $CDR1_Seq_AA =~ s/ //g; | |
668 my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA); | |
669 #Add this data to the output store specifically: | |
670 $OUTPUT_Data{"CDR1 Seq"} = $CDR1_Seq_AA; | |
671 $OUTPUT_Data{"CDR1 Length"} = $CDR1_Length; | |
672 } | |
673 #What happens if there is no CDR1 found? Leave blank - the output routine can handle this | |
674 | |
675 =head4 CDR2 | |
676 | |
677 =cut | |
678 | |
679 if (exists $DomainBoundaries{"CDR2"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start | |
680 { | |
681 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... | |
682 my $CDR2Start = $DomainBoundaries{"CDR2"}{"Start"}; | |
683 my $CDR2End = $DomainBoundaries{"CDR2"}{"End"}; | |
684 my $CDR2_Length = $CDR2End - $CDR2Start; | |
685 my $CDR2_Seq_AA = substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length); | |
686 $CDR2_Seq_AA =~ s/ //g; | |
687 my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA); | |
688 #Add this data to the output store specifically: | |
689 $OUTPUT_Data{"CDR2 Seq"} = $CDR2_Seq_AA; | |
690 $OUTPUT_Data{"CDR2 Length"} = $CDR2_Length; | |
691 } | |
692 #What happens if there is no CDR2 found? Leave blank - the output routine can handle this. | |
693 | |
694 =head4 CDR3 | |
695 | |
696 =cut | |
697 if (exists $DomainBoundaries{"CDR3"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start | |
698 { | |
699 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... | |
700 my $CDR3Start = $DomainBoundaries{"CDR3"}{"Start"}; | |
701 my $CDR3End = $DomainBoundaries{"CDR3"}{"End"}; | |
702 my $CDR3_Length = $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so | |
703 my $CDR3_Seq_AA = substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length); | |
704 my $CDR3_Seq_DNA = substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length); | |
705 $CDR3_Seq_AA =~ s/ //g; | |
706 $CDR3_Seq_DNA =~ s/ //g; | |
707 my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA); | |
708 my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA); | |
709 #Add this data to the output store specifically: | |
710 $OUTPUT_Data{"CDR3 Seq"} = $CDR3_Seq_AA; | |
711 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length; | |
712 $OUTPUT_Data{"CDR3 Seq DNA"} = $CDR3_Seq_DNA; | |
713 $OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length; | |
714 #And in the case of the CDR3 how we found it: | |
715 $OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"}; | |
716 } | |
717 #What happens if there is no CDR3 found? Leave blank - the output routine can handle this. | |
718 #die "HIT BLOCK\n"; | |
719 #End of the record; output the data we have collected and move on. | |
720 print &printOUTPUTData (\%OUTPUT_Data)."\n"; | |
721 } | |
722 | |
723 | |
724 | |
725 ############ | |
726 sub returnLinesBetween { | |
727 =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section}) | |
728 | |
729 When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array | |
730 that are bounded by these tags. | |
731 | |
732 If either of the tags are not found - or are found in the wrong order - then a null list is returned. | |
733 | |
734 =cut | |
735 | |
736 my ($Text_ref, $TopTag, $BotTag) = @_; | |
737 | |
738 my @Table; | |
739 #The two boundary conditions at which we will cut the table: | |
740 #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n"; | |
741 #How we record these: | |
742 my $AlignmentLine_Top=0; my $AlignmentLine_Bot=0; | |
743 | |
744 my $LineIndex=-1; #-1 As the loop increments this line counter first, then does its checks. | |
745 #If you care: | |
746 #print "D: Lines of text passed: $$#Lines\n"; | |
747 | |
748 #Iterate through until we find what we are looking for or run out of text to search: | |
749 while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref}) | |
750 { | |
751 $LineIndex++; | |
752 #Enable if you need to care: | |
753 # print "D: Line Index = $LineIndex\n"; | |
754 | |
755 if ($$Text_ref[$LineIndex] =~ m/$TopTag/) | |
756 { | |
757 $AlignmentLine_Top = $LineIndex; | |
758 # print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested | |
759 } | |
760 if ($$Text_ref[$LineIndex] =~ m/$BotTag/) | |
761 { | |
762 $AlignmentLine_Bot = $LineIndex; | |
763 # print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested | |
764 } | |
765 } | |
766 #Reality check: did we find anything? If not then we return null. | |
767 if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0) | |
768 { return; } | |
769 #Again, enable if you care: | |
770 #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n"; | |
771 | |
772 #We want the lines one down and one up - so polish these. | |
773 $AlignmentLine_Top++; $AlignmentLine_Bot--; | |
774 | |
775 #Return as an array slice: | |
776 return (@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]); | |
777 } | |
778 ############ | |
779 | |
780 sub findSplitPoint | |
781 { | |
782 =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table) | |
783 | |
784 When passed a table with the alignment in it makes an educated guess as to the precise split point to | |
785 spearate the 'info' and 'alignment' panels. | |
786 This is a right olde faff because the field / panel boundaries change. | |
787 | |
788 ' Query_6 167 GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG 235' | |
789 ' Query_6 236 CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG 305' | |
790 ' Query_6 306 GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC 372' | |
791 ' Query_6 373 TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT 442' | |
792 ' Query_6 443 ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC 512' | |
793 MOTIF_FOUND_IN_AA | |
794 i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total | |
795 | |
796 Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary | |
797 because the pattern match has been failing more often that it suceeded. | |
798 | |
799 =cut | |
800 #A rough guess is 38 for normal sequences, 48 for reversed ones: | |
801 | |
802 my $SplitPos = 0; | |
803 | |
804 (my $Table_ref) = @_; #Get the reference to the table | |
805 my @DNALines; #We populate this for mining in the next section | |
806 foreach my $C_Line (@{$Table_ref}) | |
807 { | |
808 #print "D: $C_Line\n"; | |
809 # (my $SplitLine) = $C_Line; | |
810 #Split on consecutive tabs or spaces: | |
811 my @LineFields = split (/[\t\s]+/,$C_Line); | |
812 #print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n"; | |
813 unless ( $LineFields[3] =~ m/[^\.]/ | |
814 && $LineFields[3] =~ m/[ATGCX]{20,}/ | |
815 && $#LineFields==4) | |
816 { next; } | |
817 #Enable if you want to know the lines we think are the DNA Query strings: | |
818 #print "D: DNA Line: '$C_Line'\n"; | |
819 push @DNALines, $C_Line; #Note it down | |
820 } | |
821 | |
822 my %PanelBounds; #Will contain the positions of the panel boundaries | |
823 | |
824 foreach my $C_DNALine (@DNALines) | |
825 { | |
826 #print "D: '$C_DNALine'\n"; | |
827 $C_DNALine =~ m/[ATGC-]+ \d+$/; #Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters. | |
828 my $MatchPos = $-[0]; #This is the position of the start of the last match because we can't get the index() function to work | |
829 #(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0); | |
830 #print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n"; | |
831 $PanelBounds{$MatchPos}++; | |
832 } | |
833 #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure) | |
834 #The brackets around "($SplitPos)" are really necessary it seems. | |
835 ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds); | |
836 #If you want | |
837 #print Dumper %PanelBounds; | |
838 #Tell people if we are having difficultlty: | |
839 if ($SplitPos == -1) { warn "Couldn't identify the panel boundaries\n"; } | |
840 #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n"; | |
841 return $SplitPos; | |
842 } | |
843 | |
844 | |
845 ## | |
846 # | |
847 # | |
848 ### | |
849 | |
850 | |
851 | |
852 | |
853 | |
854 ##### | |
855 # | |
856 # | |
857 ##### | |
858 sub markUpCDR3 | |
859 { | |
860 =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex) | |
861 | |
862 Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied. The CDR3 is assumed to start | |
863 at the end of the FWR3. | |
864 To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern. | |
865 The position of the first matching pattern is reported. | |
866 | |
867 =head4 Fuller Usage: | |
868 | |
869 my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, | |
870 $DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"}, | |
871 $DNACDR3_Pat, $AACDR3_Pat); | |
872 | |
873 | |
874 | |
875 =head4 Returned Values | |
876 | |
877 If the CDR3 was found then we we signal like this: | |
878 | |
879 $MotifFound ==0 : Nope, didn't find either motif | |
880 $MotifFound ==1 : Found at the DNA level, not the AA level | |
881 $MotifFound ==2 : Found at the the AA level, not the DNA level | |
882 $MotifFound ==3 : Found at the the AA level & the DNA level | |
883 | |
884 (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore) | |
885 | |
886 The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same. | |
887 Formally the test order is: | |
888 | |
889 1) DNA | |
890 2) AA | |
891 | |
892 i.e. DNA bp locations have priority. | |
893 | |
894 Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match). | |
895 See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation. | |
896 | |
897 =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA | |
898 | |
899 Note that patterns are assumed to require white space inserting in them between the letters. | |
900 This could be a serious limitation | |
901 | |
902 | |
903 =cut | |
904 | |
905 #Get the parameters passed: | |
906 my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat) = @_; | |
907 print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n"; | |
908 | |
909 | |
910 #Setup our return values: | |
911 my $Start = 0; my $End =0; my $MotifFound = 0; | |
912 my $How; #Literally How the motif was found (or not if blank) | |
913 | |
914 | |
915 =head4 Prepare the sequences and the patterns for use | |
916 | |
917 Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs | |
918 | |
919 Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working. | |
920 | |
921 | |
922 We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on | |
923 in position calculations. | |
924 | |
925 | |
926 =cut | |
927 | |
928 #Because igBLAST doesn't always report from the start of the read (primers and things are upstream): | |
929 | |
930 my $CropPoint = $FWR3_End - $FWR1_Start - 1 ; | |
931 #print "D: markUpCDR3: Crop point is: '$CropPoint'\n"; | |
932 | |
933 #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n"; | |
934 #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling: | |
935 | |
936 my $AA_Trimmed = substr ($AA, $CropPoint); | |
937 my $DNA_Trimmed = substr ($DNA ,$CropPoint); | |
938 #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n"; | |
939 #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n"; | |
940 | |
941 #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working | |
942 my $AAPat_Spaced; | |
943 foreach my $C_Char (0..length($AAPat)-1) #The -1 is because we don't want trailing spaces until the next nt -> AA translation. | |
944 { $AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1); } | |
945 #And write this back into the main pattern we were passed: | |
946 $AAPat = $AAPat_Spaced; | |
947 | |
948 #temp hack: | |
949 #$AA_Trimmed = $AA; | |
950 my $MotifFound=0; #So we can record which patterns we found | |
951 my $MotifPositionDNA =-1; | |
952 my $MotifPositionAA =-1; | |
953 | |
954 #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n"; | |
955 =head4 At DNA level: "TGG GGx xxx GGx" [+1] | |
956 | |
957 =cut | |
958 | |
959 #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n"; | |
960 | |
961 if ($DNA_Trimmed =~ m/$DNAPat/) | |
962 { | |
963 $MotifPositionDNA = $+[0]; #Just the easiest way to do this in Perl | |
964 # print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n"; | |
965 $MotifFound = $MotifFound + 1; | |
966 #Any more matches further on? | |
967 my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA); | |
968 # print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n"; | |
969 # print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n"; | |
970 # print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n"; | |
971 # print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n"; | |
972 if ($LaterString =~ m/$DNAPat/) | |
973 { print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n"; } | |
974 } | |
975 | |
976 =head4 At AA level: "WGxG" [+2] | |
977 | |
978 =cut | |
979 | |
980 if ($AA_Trimmed=~ m/$AAPat/) | |
981 { | |
982 $MotifPositionAA = $+[0]; #Just the easiest way to do this in Perl | |
983 $MotifFound = $MotifFound + 2; | |
984 # print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n"; | |
985 (my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA); | |
986 # print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n"; | |
987 | |
988 } | |
989 | |
990 =head4 Assess the results of motif position finding | |
991 | |
992 =cut | |
993 | |
994 #print "D: markUpCDR3: MotifFound = '$MotifFound'\n"; | |
995 | |
996 if ($MotifFound ==0) | |
997 { return ($Start, $End, $MotifFound); } #The easy one really: return we didn't find the CDR3 | |
998 | |
999 # | |
1000 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3: | |
1001 #Just found in DNA: | |
1002 if ($MotifFound ==1) | |
1003 { | |
1004 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3: | |
1005 $End = $MotifPositionDNA; | |
1006 $How = "MOTIF_FOUND_IN_DNA"; | |
1007 } | |
1008 #Just found in AA: | |
1009 if ($MotifFound ==2) | |
1010 { | |
1011 $End = $MotifPositionAA; | |
1012 $How = "MOTIF_FOUND_IN_AA"; | |
1013 } | |
1014 | |
1015 #Found in both, DNA has priority: | |
1016 if ($MotifFound ==3) | |
1017 { | |
1018 $Start = $FWR3_End ; #We assume the end of the FWR3 is the start of CDR3: | |
1019 $End = $MotifPositionDNA; | |
1020 $How = "MOTIF_FOUND_IN_BOTH"; | |
1021 } | |
1022 | |
1023 #print "D: markUpCDR3: Motif found = $MotifFound\n"; | |
1024 | |
1025 =head4 These next few lines are for testing / diagnostics only - disable for general use | |
1026 | |
1027 If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that | |
1028 the start of FWR1 is unlikely to be at nt 1. | |
1029 | |
1030 =cut | |
1031 | |
1032 $Start = $FWR3_End - $FWR1_Start -1; | |
1033 $End = $End + $CropPoint; | |
1034 my $CDR3_RegionLength = $End - $Start; | |
1035 #print "D: markUpCDR3: CDR3 Length= $Start - $End = '$CDR3_RegionLength'\n"; | |
1036 (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength); | |
1037 | |
1038 #Add onto the coordinates what we trimmed off: | |
1039 | |
1040 | |
1041 #print "D: markUpCDR3: Seq ='$CDR3_seq'\n"; | |
1042 | |
1043 print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n"; | |
1044 #die "HIT BLOCK\n"; | |
1045 return ($Start + $FWR1_Start, $End + $FWR1_Start, $How); | |
1046 } | |
1047 | |
1048 | |
1049 sub printOUTPUTData { | |
1050 =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData} | |
1051 | |
1052 When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID. | |
1053 This prepared and then returned as a text string that can then be printed to STDOUT: | |
1054 | |
1055 print (printOUTPUTData (\%OutputData)); | |
1056 | |
1057 Any missing data in the Hash array it polietly ignored and a null string printed in place. | |
1058 The text field is tab delimited; there are no extra trailing tabs or carriage returns in place. | |
1059 | |
1060 Actually the fields printed out are stored in an index array. | |
1061 | |
1062 =head3 Header output | |
1063 | |
1064 If the routine is passed a key 'HEADER' then the header columns are returned as that string. | |
1065 This is tested first - so don't add this unless you mean to. | |
1066 | |
1067 =cut | |
1068 | |
1069 my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene", | |
1070 "CDR1 Seq", "CDR1 Length", | |
1071 "CDR2 Seq", "CDR2 Length", | |
1072 "CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand", | |
1073 "CDR3 Found How"); | |
1074 | |
1075 my $OutputString = "OUTPUT:"; #What we are going to build the output into. | |
1076 | |
1077 =head4 Print Header & Exit? | |
1078 | |
1079 =cut | |
1080 | |
1081 my ($Data_ref) = @_; | |
1082 #print "D: printOUTPUTData: Running\n"; | |
1083 | |
1084 if (exists $$Data_ref {"HEADER"}) | |
1085 { | |
1086 $OutputString .= "\t"; | |
1087 for(my $n = 0; $n <= $#HeaderFields; $n++) | |
1088 { | |
1089 $OutputString .= $HeaderFields[$n]; | |
1090 $OutputString .= "\t" if($n < $#HeaderFields); | |
1091 } | |
1092 | |
1093 # foreach my $C_Header (@HeaderFields) | |
1094 # { $OutputString .= "$C_Header"; } # | |
1095 | |
1096 print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n"; | |
1097 return ($OutputString); | |
1098 } | |
1099 | |
1100 =head3 Assemble whatever data we have - and tab delimit the null fields | |
1101 | |
1102 =cut | |
1103 #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref; | |
1104 foreach my $C_Header (@HeaderFields) | |
1105 { | |
1106 | |
1107 if (exists ($$Data_ref {$C_Header})) | |
1108 { $OutputString .= "\t". $$Data_ref{$C_Header}; } #We have data to print out | |
1109 else | |
1110 { $OutputString .="\t"; } #Add a trailing space | |
1111 } # | |
1112 | |
1113 return ($OutputString); | |
1114 } | |
1115 | |
1116 | |
1117 ######################################### Code Junk ######################## | |
1118 | |
1119 | |
1120 =head2 Code Junk Attic | |
1121 | |
1122 =head3 Demonstrates how to reverse translate an amino acid sequence into DNA: | |
1123 | |
1124 use Bio::Tools::CodonTable; | |
1125 use Bio::Seq; | |
1126 | |
1127 # print possible codon tables | |
1128 my $tables = Bio::Tools::CodonTable->tables; | |
1129 while ( (my $id, my $name) = each %{$tables} ) { | |
1130 print "$id = $name\n"; | |
1131 } | |
1132 my $CodonTable = Bio::Tools::CodonTable->new(); | |
1133 | |
1134 my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n"; | |
1135 | |
1136 | |
1137 my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq); | |
1138 print "D: '$rvSeq'\n"; | |
1139 die "TEST OVER\n"; | |
1140 | |
1141 =cut | |
1142 | |
1143 | |
1144 =head3 For processing the 'Alignment lines' section of the alginment table | |
1145 | |
1146 #If we are ever interested; then enable the code below: | |
1147 # print ": Alignment\n"; | |
1148 # $InfoPanel =~ s/^ +//; $InfoPanel =~ s/ +$//; #Clean off trailing spaces | |
1149 # my ($Germclass, $PID, $PID_Counts, $Allele) = split (/\s+/,$InfoPanel); #Split on spaces | |
1150 ##Enable if you need to know what we just found: | |
1151 # #print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n"; | |
1152 # #A reality check: we should have an Allele - or some text here. | |
1153 # unless (defined $Allele && $Allele ne "") | |
1154 # { warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n"; } | |
1155 # if (exists ($Alginments {$Germclass}{$Allele})) | |
1156 # { $Alginments {$Germclass}{$Allele} = $Alginments {$Germclass}{$Allele}.$CurrentAASequence; } #Carry on adding | |
1157 # else #more work needed as we need to 'pad' the sequence with fake gap characters) | |
1158 # { | |
1159 ##Do we still need this padding? I don't think so | |
1160 # | |
1161 # | |
1162 # my $PaddingChars = ($ThisQueryStart-$Query_Start); | |
1163 # print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n"; | |
1164 # #To help testing, calculate this first: | |
1165 # my $PaddingString = " "x $PaddingChars; | |
1166 # $Alginments {$Germclass}{$Allele} = $CurrentAASequence; | |
1167 # } | |
1168 # next | |
1169 | |
1170 =head3 Demonstration of Pattern match positions | |
1171 | |
1172 my $Text = "12345TTT TTAAAAA"; | |
1173 my $TestPat = "TTT\\s+TT"; | |
1174 (my $Result)= $Text =~ m/$TestPat/; | |
1175 print "D: Two vars are: - = ",$-[0], " & + =", $+[0]," for test pattern '$TestPat'\n"; | |
1176 | |
1177 sub printCDR3 { | |
1178 | |
1179 =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence); | |
1180 | |
1181 ???? IS THIS FUNCTION IN USE ????? | |
1182 | |
1183 Handles the printing of the output when passed information about the CDR3 region. | |
1184 | |
1185 | |
1186 The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT: | |
1187 | |
1188 print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n"; | |
1189 | |
1190 #=cut | |
1191 | |
1192 #Despite the similarity in names, these are all local copies passed to us: | |
1193 | |
1194 my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_; | |
1195 | |
1196 #For DNA: | |
1197 my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End); | |
1198 my ($CDR_DNA_Length) = length ($CDR_DNA_Seq); | |
1199 | |
1200 #For AA: | |
1201 my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End); | |
1202 my ($CDR_AA_Length) = length ($CDR_AA_Seq); | |
1203 | |
1204 my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed: | |
1205 print "D: SUB: printCDR3: As returned: '$ReturnString'\n"; | |
1206 return ($ReturnString); | |
1207 | |
1208 } | |
1209 | |
1210 =cut | |
1211 | |
1212 | |
1213 | |
1214 =head2 Change Log | |
1215 | |
1216 =head3 Version 1.2 | |
1217 | |
1218 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record] | |
1219 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts]. | |
1220 This function is replaced by the more general printOUTPUTData() | |
1221 3) A tag for the CDR3 status is now output for every record / read. | |
1222 Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found. | |
1223 | |
1224 =head4 Version 1.3 | |
1225 | |
1226 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”, | |
1227 which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly. | |
1228 - found by Bas Horsman. | |
1229 | |
1230 =head4 Version 1.3a | |
1231 | |
1232 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3) | |
1233 | |
1234 =cut | |
1235 | |
1236 =head4 Version 1.4 | |
1237 | |
1238 1) Now processes files using Mac/Unix/MS-DOS newline characters: | |
1239 | |
1240 $_ =~ s/\r\n/\n/g; #In case line ends are MS-DOS | |
1241 $_ =~ s/\r/\n/g; #In case line ends are Mac | |
1242 #The whole record - one per read - is now stored in $_ | |
1243 my @Lines =split (/\R/,$_); #Split on new lines | |
1244 | |
1245 =head4 Version 1.4a | |
1246 | |
1247 1) Fixed the length of the CDR3 AA string being reported correctly: | |
1248 | |
1249 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length; | |
1250 to: | |
1251 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length; | |
1252 |