comparison igblastparser/igparse.pl @ 0:afe85eb6572e draft

Uploaded
author davidvanzessen
date Mon, 29 Aug 2016 05:41:20 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:afe85eb6572e
1 #!/usr/bin/perl
2 =head1 IGBLAST_simple.pl
3
4 This version (1.4) has been heavily adapted since the original program was first created back in October 2012.
5 Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes.
6
7 From V 1.2 onwards a 'Change Log' is included at the end of the program
8
9 =head2 Usage
10
11 Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment
12 as it allows inspection of the data stores.
13
14 The program takes a text file of the
15
16 ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process->
17
18 Supply the text version of the igBLAST report in the format as in the example below.
19 The extra command line arugment is the record number (aka. BLAST report) to process.
20 If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards.
21
22 =head2 Example Input
23
24 A standard igBLAST record or set of them in a file; this being typical:
25
26 BLASTN 2.2.27+
27
28
29 Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
30 Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
31 Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
32 protein database search programs", Nucleic Acids Res. 25:3389-3402.
33
34
35
36 Database: human_gl_V; human_gl_D; human_gl_J
37 674 sequences; 179,480 total letters
38
39
40
41 Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1
42 run=R_2012_04_10_11_57_56_
43
44 Length=433
45 Score E
46 Sequences producing significant alignments: (Bits) Value
47
48 lcl|IGHV3-30*04 330 2e-92
49 lcl|IGHV3-30-3*01 330 2e-92
50 lcl|IGHV3-30*01 327 2e-91
51 lcl|IGHD3-16*01 14.4 11
52 lcl|IGHD3-16*02 14.4 11
53 lcl|IGHD1-14*01 12.4 43
54 lcl|IGHJ4*02 78.3 1e-18
55 lcl|IGHJ5*02 70.3 4e-16
56 lcl|IGHJ4*01 68.3 2e-15
57
58
59 Domain classification requested: imgt
60
61
62 V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand):
63 IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +
64
65 V(D)J junction details (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself
66 AGAGA TATGAGCCCCATCATGACA ACGTTTG CCGGAA ACTAC
67
68 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
69 FWR1 27 38 12 11 1 0 91.7
70 CDR1 39 62 24 22 2 0 91.7
71 FWR2 63 113 51 50 1 0 98
72 CDR2 114 137 24 23 1 0 95.8
73 FWR3 138 251 114 109 5 0 95.6
74 CDR3 (V region only) 252 259 8 7 1 0 87.5
75 Total N/A N/A 233 222 11 0 95.3
76
77
78 Alignments
79
80 <----FWR1--><----------CDR1--------><-----------------------FWR2------
81 W A A S G F T F N T Y A V H W V R Q A P G K G
82 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96
83 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133
84 C A A S G F T F S S Y A M H W V R Q A P G K G
85 V 95.7% (221/231) IGHV3-30-3*01 64 ..T......................G..G.......A................................. 133
86 V 94.8% (221/233) IGHV3-30*01 64 ..T......................G..G.......A................................. 133
87
88 ----------------><----------CDR2--------><----------------------------
89 L E W V A V I S Y D G S N K N Y A D S V K G R F
90 Query_1 97 TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT 166
91 V 95.3% (222/233) IGHV3-30*04 134 ..................................T......T............................ 203
92 L E W V A V I S Y D G S N K Y Y A D S V K G R F
93 V 95.7% (221/231) IGHV3-30-3*01 134 .........................................T............................ 203
94 V 94.8% (221/233) IGHV3-30*01 134 .A................................T......T............................ 203
95
96 ---------------------------FWR3---------------------------------------
97 T I S R D N S K N T L Y L Q M N S L R V E D T
98 Query_1 167 CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG 236
99 V 95.3% (222/233) IGHV3-30*04 204 ...............................C.G.........................C.......... 273
100 T I S R D N S K N T L Y L Q M N S L R A E D T
101 V 95.7% (221/231) IGHV3-30-3*01 204 ...............................C.G.........................C.......... 273
102 V 94.8% (221/233) IGHV3-30*01 204 ...............................C.G.........................C.......... 273
103
104 -------------->
105 A V Y Y C T R D M S P I M T T F A G N Y W G Q
106 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
107 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
108 A V Y Y C A R
109 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
110 V 94.8% (221/233) IGHV3-30*01 274 .....G.........G.......----------------------------------------------- 296
111 D 100.0% (7/7) IGHD3-16*01 12 ------------------------------------------.......--------------------- 18
112 D 100.0% (7/7) IGHD3-16*02 12 ------------------------------------------.......--------------------- 18
113 D 100.0% (6/6) IGHD1-14*01 8 -------------------------------------------------......--------------- 13
114 J 100.0% (39/39) IGHJ4*02 10 -------------------------------------------------------............... 24
115 J 100.0% (35/35) IGHJ5*02 17 -----------------------------------------------------------........... 27
116 J 97.4% (38/39) IGHJ4*01 10 -------------------------------------------------------.............A. 24
117
118
119 G T L V T V S S
120 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
121 J 100.0% (39/39) IGHJ4*02 25 ........................ 48
122 J 100.0% (35/35) IGHJ5*02 28 ........................ 51
123 J 97.4% (38/39) IGHJ4*01 25 ........................ 48
124
125
126 Lambda K H
127 1.10 0.333 0.549
128
129 Gapped
130 Lambda K H
131 1.08 0.280 0.540
132
133 Effective search space used: 64847385
134
135
136 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1
137 run=R_2012_04_10_11_57_56_
138
139
140 ...etc...
141
142 =head2 Example Output
143
144
145 Example output from the data above sent:
146 $ ./IGBLAST_simple.pl igBLASTOutput.txt 1
147 D: Request to process just record '1' received
148 D: printOUTPUTData: Running
149 D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How'
150 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
151 D: ID is: 'HL67IUI01D26LR'
152 D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259)
153 D: Starting Search for CDR3
154 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence)
155 D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27'
156 D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309)
157 D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +'
158 D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +'
159 D: printOUTPUTData: Running
160 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
161
162 =head4 Usage notes:
163
164 Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need:
165
166 ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT:
167
168 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
169 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
170 OUTPUT: HL67IUI01EQMLY In-frame IGHV4-39*01 IGHD2-8*01 IGHJ3*02 GGSISSSSYY 29 IYHSGST 20 CARDATYYSNGFDIWGQG 53 MOTIF_FOUND_IN_BOTH
171 OUTPUT: HL67IUI01CDCLP Out-of-frame IGHV3-23*01 IGHD3-3*01 IGHJ4*02 FSNYAM 16 SGSGDRTY 23 AKAD*FLEWLFRIGDGERLLGPGN 72 MOTIF_FOUND_IN_DNA
172 OUTPUT: HL67IUI01AHRNH N/A IGHV3-33*01 N/A N/A WIHLQ*LW 23 YGMMEVI 23 NOT_FOUND
173 OUTPUT: HL67IUI01DZZ1V Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASG 20 LYCASEGDIVASELLSTGARV 62 MOTIF_FOUND_IN_DNA
174 OUTPUT: HL67IUI01DTR2Y Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*VLGPG 59 MOTIF_FOUND_IN_DNA
175 OUTPUT: HL67IUI01EQL3S In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
176 OUTPUT: HL67IUI01AFG46 In-frame IGLV2-34*01 N/A IGHJ4*02 NOT_FOUND
177 OUTPUT: HL67IUI01EFFKO In-frame IGHV3-11*01 IGHD6-6*01 IGHJ4*02 GFTFSDYY 23 ISYSGGTI 23 CARASGAARHRPLDYWGQG 56 MOTIF_FOUND_IN_BOTH
178 OUTPUT: HL67IUI01B18SG In-frame IGHV3-33*01 IGHD5-12*01 IGHJ4*02 VRQA 11 KYYANSVK 23 RLGGFDYWGQGTLVTVSS 53 MOTIF_FOUND_IN_BOTH
179 OUTPUT: HL67IUI01D6LER In-frame IGHV1-24*01 IGHD3-22*01 IGHJ4*02 GYSLNELS 23 PDPEDDE 23 TVQPSRITMMAVVITRIHWGASGARE 76 MOTIF_FOUND_IN_DNA
180 OUTPUT: HL67IUI01CYCLF N/A IGHV4-39*01 N/A N/A GGSISSSSYY 29 IYYSGST 20 NOT_FOUND
181 OUTPUT: HL67IUI01B4LEE In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
182 OUTPUT: HL67IUI01A4KW4 Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*IWGQG 58 MOTIF_FOUND_IN_BOTH
183 OUTPUT: HL67IUI01E05BV In-frame IGHV1-24*01 IGHD3-22*01 IGHJ2*01 GYSLNELS 23 PDPEDDE 23 NOT_FOUND
184 OUTPUT: HL67IUI01CVVKY In-frame IGHV1-3*01 IGHD2-15*01 IGHJ1*01 NOT_FOUND
185 OUTPUT: HL67IUI01CN5P2 In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
186 OUTPUT: HL67IUI01DUUJ5 In-frame IGHV3-21*01 IGHD1-7*01 IGHJ4*02 GYTFSTYS 23 ISSSSAYR 23 CARDIRLELRDWGQG 44 MOTIF_FOUND_IN_BOTH
187 OUTPUT: HL67IUI01E1AIR Out-of-frame IGHV4-39*01 N/A IGHJ3*01 WGLHRRW**L 29 FVS*RAPR 23 NOT_FOUND
188 OUTPUT: HL67IUI01CCZ8D Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASGR 20 YCASEGDIVASELLSTGARE 58 MOTIF_FOUND_IN_DNA
189 OUTPUT: HL67IUI01BT9IR N/A IGHV3-21*02 N/A N/A NOT_FOUND
190 OUTPUT: HL67IUI01COTO0 Out-of-frame IGHV4-39*01 N/A IGHJ3*01 GGFIGGGDNF 29 LYHDGRPA 23 NOT_FOUND
191 OUTPUT: HL67IUI01D994O In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
192 OUTPUT: HL67IUI01A08CJ In-frame IGHV4-39*01 IGHD6-13*01 IGHJ5*02 GGSISSSSYY 29 IYYTWEH 21 CERARRGSSWGQLVRPLGPG 62 MOTIF_FOUN
193
194
195
196 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
197 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
198 ...etc...
199
200 =head4 Also, combined grep & sed:
201
202 $ ./IGBLAST_simple.pl igBLASTOutput.txt | grep OUTPUT: | sed 's/OUTPUT:\t//'
203
204 =cut
205
206 =head3 CDR3 Patterns:
207
208 We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly:
209
210 my $DNACDR3_Pat = "TGGGG....GG.";
211 my $AASequenceMotifPattern = "WG.G";
212
213 They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X').
214
215 [NB: These are original patterns used for testing, check the code for the current ones.]
216
217 =cut
218
219 my $DNACDR3_Pat = "TGGGG....GG.";
220 my $AACDR3_Pat = "WG.G";
221
222 use strict;
223 use Data::Dumper;
224 # Set this as to number of the result (aka "record") you want to process or 0 for all:
225 my $ProcessRecord =0;
226 if (defined $ARGV[1]) { $ProcessRecord = pop @ARGV; } #Also accept from the command line:
227 if ($ProcessRecord != 0) { print "D: Request to process just record '$ProcessRecord' received\n"; }
228
229 #Adjust the record separator:
230 $/="Query= ";
231 my $Record=0; # A simple counter, that we might not use.
232 #Force-loaded header / version information:
233 my $Header = <>;
234 #At the moment we don't use this - so dump it immediately:
235 $Header = undef;
236 #print "D: Force-loaded header / version information: '$Header'\n";
237
238 #Print the Header for the output line (we need this once, at the start)
239 print &printOUTPUTData ({"HEADER" => 1})."\n";
240
241 while (<>)
242 {
243 =head4 First check - should we be processing this record at all?
244
245 =cut
246 $Record++; #Increment the record counter:
247 #Do we process this record - or all records?
248 if ($ProcessRecord != $Record && $ProcessRecord != 0)
249 { next; } #We need to increment the record counter before we increment
250
251 =head4 Setup the output line storage and print the header:
252
253 We enter this initially and work to change it:
254
255 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
256
257 =cut
258
259 my %OUTPUT_Data; #To collect data for the output line in
260 #Assume the first and work to find better:
261 $OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND";
262 #The whole record - one per read - is now stored in $_
263 my @Lines =split (/[\r\n]+/,$_); # split on windows/linux/mac new lines
264
265 #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went:
266 #print "D: Record #$Record\n"; print $_; print "\n---------\n";
267 print "D: ''$Lines[0]'\nD: ...etc...'\nD: ############\n";
268
269 =head3 Get the ID
270
271 Quite easy: the first field on the first line:
272
273 Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1
274
275 =cut
276
277 (my $ID) = $Lines[0]=~ m/^(\S+)/;
278 unless (defined $ID && $ID ne "")
279 { # So a near total failure...?
280 $OUTPUT_Data{"ID"} = "Unknown";
281 print &printOUTPUTData (\%OUTPUT_Data)."\n";
282 next; #No ID is terminal for this record
283 }
284 else
285 {
286 print "D: ID is: '$ID'\n";
287 $OUTPUT_Data{"ID"} = $ID;
288 }
289 =head3 Declare the variables we will need here in the next few sections to store data
290
291 =cut
292
293 my $CurrentRegion;
294 my $RegionMarkup;
295
296 #So we can sync the coordinated of the alignment up to the domains found:
297 my $Query_Start = -1; my $Query_End = -1;
298
299 #Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop?
300 my $ThisQueryStart =-1; my $ThisQueryEnd =-1; #Think $ThisQueryEnd isn't used at the moment.
301 my $DNAQuerySequence =""; #The actual DNA Query sequence...
302 my $AAQuerySequence = "";
303
304 #As this changes with the alleles identified:
305 my $CurrentAASequence;
306 #The main storage variables
307
308 my %Alginments; my %Alleles;
309 my %DomainBoundaries;
310
311 =head2 Stanza 1: Get the general structure of the sequence identified
312
313 =head3 Method 1: Use the table supplied
314
315 Technically this valid for the top hit...realistically this is the only information we have reported to us
316 so we use this or nothing. This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd? Who knows!
317
318 Targets this block:
319
320 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
321 FWR1 167 240 75 72 2 1 96
322 CDR1 241 264 24 20 4 0 83.3
323 FWR2 265 315 51 48 3 0 94.1
324 CDR2 316 336 24 15 6 3 62.5
325 FWR3 337 450 114 106 8 0 93
326 CDR3 (V region only) 451 454 4 4 0 0 100
327 Total N/A N/A 292 265 23 4 90.8
328
329 Then we split out the lines inside it in a second scanning step - less optimal but easier to read:
330
331 FWR1 167 240 75 72 2 1 96
332 CDR1 241 264 24 20 4 0 83.3
333 FWR2 265 315 51 48 3 0 94.1
334 CDR2 316 336 24 15 6 3 62.5
335 FWR3 337 450 114 106 8 0 93
336 CDR3 (V region only) 451 454 4 4 0 0 100
337
338 into:
339
340 (Section, from, to, length, matches, mismatches, gaps, percent identity)
341
342 =head3 Method 2: Use the table supplied
343
344 The other way to do this is to split the graphical markup out of the alignment.
345 This works for _any_ reported alignment, not just the top hits:
346
347 In the main alignment table processing section collect the information, collect the information:
348
349 #Is region mark-up:
350 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
351 {
352 # print ": Region Markup detected\n";
353 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
354 next;
355 }
356
357 Then afterwards when all the region was collected, process it like this:
358 #Pad the CDER3 region:
359
360 #Remove the trailing spaces:
361 $RegionMarkup =~ s/ *$//g;
362 #Calculate the length of the CDR3 region so we can add it in:
363 my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1;
364 #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop):
365 $RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">";
366 #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n";
367
368 #Now really process it:
369 my $C_Pos = 0;
370 my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup); #
371 foreach my $C_Domain (@Domains)
372 {
373 if (length ($C_Domain) <=0) {next;}
374 my $DomainStart= $C_Pos;
375 my $DomainEnd = $DomainStart + length ($C_Domain)-1;
376 my ($DomainType) = $C_Domain =~ m/(...[123])/;
377 # print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n";
378 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
379 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
380 $C_Pos = $DomainEnd+1;
381 }
382
383 The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely.
384 Why stress?
385
386
387 =head3 The end of the FWR3 is the start of CDR3?
388
389 This is an assumption made. Hence the two variables:
390
391 my $MaxDomainReported =0 ; # In nts / bps
392 my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
393
394 $MaxDomainBaseFound
395
396 =cut
397 my $MaxDomainBaseFound =0 ; # In nts / bps
398 my $AlignmentStart ; # In nts /bp #Alternative name would be: '$MinDomainBaseFound'; set to null until primed
399 # my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
400
401 (my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" );
402 #Enable the next line if you want the raw data we are going to parse in this section:
403 #print Dumper @StructureSummaryTable;
404 foreach my $C_Section (@StructureSummaryTable)
405 {
406 my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section);
407 #print "D: Domain type: '$DomainType'\n";
408 #$DomainType =~ s/ .*$//g;
409 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
410 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
411
412 #So we can do a reality check on the length / start of the CDR3 if we have to go looking:
413 if ($MaxDomainBaseFound <= $DomainEnd)
414 { $MaxDomainBaseFound = $DomainEnd; } #Store the maximum base found
415 if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart)
416 { $AlignmentStart = $DomainStart; }
417 }
418 #print Dumper %DomainBoundaries;
419 #die "HIT BLOCK\n";
420
421 =head3 Did we find the CDR3 region specifically?
422
423 If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up.
424
425 =cut
426 print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n";
427
428 #my @WantedSections = qw (V D J);
429
430 =head2 Second Stanza: Parse the main Alignment Table
431
432 =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels.
433
434 As this is a little involved and comparamentalises nicely we sub-contract this to two functions""
435
436 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
437 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
438
439 =cut
440 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
441 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
442 #If you are interested, enable this line:
443 # print "D: The info panel was detected at: '$splitPoint'\n";
444
445 =head3
446
447 =cut
448
449
450 foreach my $C_Line (0..$#Table)
451 {
452
453 =head3 Call the line type we find: There are 4:
454
455 These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets)
456
457 | <- This split is ~40 chars. from the start of the line
458 * InfoPanel * | * Alignment Panel *
459 : is a "Blank" line (-1,-1)
460 <----FWR1--><----------CDR1--------><-----------------------FWR2------ : is "Region Markup" (-1,0)
461 W A A S G F T F N T Y A V H W V R Q A P G K G : is "AA Sequence" (-1, >=0)
462 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 : is "DNA Sequence" (2,1)
463 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 : is "" "
464
465 So we split 40 chars in and then the two parts on spaces.
466
467
468 =cut
469
470 # print "D: (sub) Line in parsed table: '$C_Line': \n";
471
472 my ($InfoPanel, $AlignmentPanel) = $Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/;
473
474 my @InfoColumns = split (/\s+/,$InfoPanel);
475 my @AlignmentColumns = split (/\s+/,$AlignmentPanel);
476
477 #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first
478 # print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n";
479 # print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n";
480
481 #Populate this so we can step through it
482
483 =head4 Is a blank line:
484 =cut
485 if ($#InfoColumns == -1 && $#AlignmentColumns == -1)
486 {
487 # print ": Blank\n";
488 next;
489 } #For now I think we just skip - is not needed (though might be implict mark-up)
490
491 =head4 Is region mark-up:
492 =cut
493 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
494 {
495 # print ": Region Markup detected\n";
496 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
497 next;
498 }
499 =head4 Is query DNA Sequence:
500 =cut
501 if ($#InfoColumns == 2 && $#AlignmentColumns ==1)
502 {
503 # print ": DNA Query Sequence\n";
504 #Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels')
505 ($ThisQueryStart) = $InfoPanel =~ / (\d+) *$/;
506 ($ThisQueryEnd) = $AlignmentPanel =~ / (\d+) *$/;
507 my ($ThisDNASeq) = $AlignmentPanel =~ /^(.*?) /;
508 #If you want to know what we just found:
509 #print "D: This DNA Sequence: '$ThisDNASeq'\n";
510 $DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq; #Add it on to whatever we already have.
511 #Move the needle if there are smaller / greater; otherwise prime the 'needles':
512 if ($ThisQueryStart < $Query_Start or $Query_Start == -1)
513 { $Query_Start = $ThisQueryStart; }
514 if ($ThisQueryEnd > $Query_End or $Query_End == -1)
515 { $Query_End = $ThisQueryEnd; }
516 # print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n";
517 next;
518 }
519 =head4 Is AA Sequence:
520
521 This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions:
522 -------------->
523 A V Y Y C T R D M S P I M T T F A G N Y W G Q << Want this
524 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
525 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
526 A V Y Y C A R
527 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
528
529 ...etc...
530 G T L V T V S S << Want this
531 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
532
533 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...)
534
535 =cut
536
537 if ($#InfoColumns == -1 && $#AlignmentColumns >=-1)
538 {
539 unless ($Table[$C_Line+1] =~ /Query/) { next; } #Is the next line the DNA sequence ?
540 #
541 # print ": AA sequence\n";
542
543
544 $CurrentAASequence = $AlignmentPanel;
545 #print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n";
546 $CurrentAASequence =~ s/^ {$PanelSplitPoint}//;
547 #print "D: '$AAQuerySequence'\n";
548 # print "D: Current AA Sequence: \t'$CurrentAASequence'\n";
549 $AAQuerySequence = $AAQuerySequence.$CurrentAASequence; #Store the elongating AA Sequence as well
550 next;
551 }
552 =head4 Is Alignment:
553 =cut
554 if ($#InfoColumns == 4 && $#AlignmentColumns ==1)
555 {
556 #Not acutally interesting to us for this version of the parser. Delete ultimately?
557 next;
558 }
559
560 #Is weird! Don't recognise it!
561
562 warn "Weird! Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n";
563 } #End main iteration loop for alignment parsing.
564
565
566 =head2 The CDR3 is noted as problematic. Can we identify it?
567
568 =cut
569 print "D: Starting Search for CDR3\n";
570 #Do have the end of the FWR3 but not the CDR3? If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point
571 if (exists ($DomainBoundaries{"FWR3"}{"End"})
572 && $AlignmentStart !=0
573 && not (exists $DomainBoundaries{"CDR3"}{"End"}) ) #Guess we need to go looking for the end then...
574 {
575 #print "D: Placing call to markUpCDR3\n";
576 my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
577 $DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart,
578 $DNACDR3_Pat, $AACDR3_Pat);
579 if ($CDR3_Start !=0 && $CDR3_End !=0)
580 {
581 $DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start;
582 $DomainBoundaries{"CDR3"}{"End"} = $CDR3_End ;
583 $DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag;
584 print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n";
585 }
586 else
587 { print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n";
588 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
589 }
590 }
591 else
592 { #Was reported by igBLAST
593 print "D: Found the FWR3 from the Domain Boundary Table\n";
594 $DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE";
595 }
596
597 #print Dumper %DomainBoundaries;
598
599 =head2 Get the top VDJ regions:
600
601 =cut
602
603 =head2 Extract General Features:
604
605 =cut
606 (my $TopHit) = $_ =~ m/V-J Frame, Strand\):\n(.*?)\n/s;
607 print "D: Top Hits (raw)= '$TopHit' \n";
608 my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit);
609 print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n";
610
611 =head2 Store the V / D / J Genes used
612
613 =cut
614
615 if (defined $Top_V_gene_match && $Top_V_gene_match ne "")
616 { $OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match; }
617
618 if (defined $Top_D_gene_match && $Top_D_gene_match ne "")
619 { $OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match; }
620
621 if (defined $Top_J_gene_match && $Top_J_gene_match ne "")
622 { $OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match; }
623
624 if (defined $Strand && $Strand ne "")
625 { $OUTPUT_Data{"Strand"} = $Strand;}
626
627 =head4 Preamble: ID, Frame, and V / D / J used:
628
629 =cut
630 #Do a reality check: if we didn't get an ID, then skip:
631 unless (defined (defined $ID) && $ID ne "" &&
632 defined $VJFrame && $VJFrame ne "")
633 {
634 print &printOUTPUTData (\%OUTPUT_Data)."\n";
635 next;
636 }
637
638 #Ok, so we have data...most likely:
639 #print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match);
640
641 if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "")
642 { $OUTPUT_Data{"VDJ Frame"} = $VJFrame;}
643 else
644 {
645 print &printOUTPUTData (\%OUTPUT_Data)."\n";
646 next;
647 }#REALLY? We didn't find anything? Oh well, move to next record
648
649 =head4 CDR1
650
651 =cut
652 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates)
653
654 my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"};
655
656 # print "D: AA Seqeunce is: '$AAQuerySequence'\n";
657 if (exists $DomainBoundaries{"CDR1"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
658 {
659 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
660 my $CDR1Start = $DomainBoundaries{"CDR1"}{"Start"};
661 my $CDR1End = $DomainBoundaries{"CDR1"}{"End"};
662 my $CDR1_Length = $CDR1End - $CDR1Start;
663 # print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n";
664 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read
665 my $CDR1_Seq_AA = substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length);
666 # print "D: '$CDR1_Seq_AA'\n";
667 $CDR1_Seq_AA =~ s/ //g;
668 my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA);
669 #Add this data to the output store specifically:
670 $OUTPUT_Data{"CDR1 Seq"} = $CDR1_Seq_AA;
671 $OUTPUT_Data{"CDR1 Length"} = $CDR1_Length;
672 }
673 #What happens if there is no CDR1 found? Leave blank - the output routine can handle this
674
675 =head4 CDR2
676
677 =cut
678
679 if (exists $DomainBoundaries{"CDR2"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
680 {
681 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
682 my $CDR2Start = $DomainBoundaries{"CDR2"}{"Start"};
683 my $CDR2End = $DomainBoundaries{"CDR2"}{"End"};
684 my $CDR2_Length = $CDR2End - $CDR2Start;
685 my $CDR2_Seq_AA = substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length);
686 $CDR2_Seq_AA =~ s/ //g;
687 my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA);
688 #Add this data to the output store specifically:
689 $OUTPUT_Data{"CDR2 Seq"} = $CDR2_Seq_AA;
690 $OUTPUT_Data{"CDR2 Length"} = $CDR2_Length;
691 }
692 #What happens if there is no CDR2 found? Leave blank - the output routine can handle this.
693
694 =head4 CDR3
695
696 =cut
697 if (exists $DomainBoundaries{"CDR3"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
698 {
699 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
700 my $CDR3Start = $DomainBoundaries{"CDR3"}{"Start"};
701 my $CDR3End = $DomainBoundaries{"CDR3"}{"End"};
702 my $CDR3_Length = $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so
703 my $CDR3_Seq_AA = substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
704 my $CDR3_Seq_DNA = substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
705 $CDR3_Seq_AA =~ s/ //g;
706 $CDR3_Seq_DNA =~ s/ //g;
707 my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA);
708 my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA);
709 #Add this data to the output store specifically:
710 $OUTPUT_Data{"CDR3 Seq"} = $CDR3_Seq_AA;
711 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
712 $OUTPUT_Data{"CDR3 Seq DNA"} = $CDR3_Seq_DNA;
713 $OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length;
714 #And in the case of the CDR3 how we found it:
715 $OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"};
716 }
717 #What happens if there is no CDR3 found? Leave blank - the output routine can handle this.
718 #die "HIT BLOCK\n";
719 #End of the record; output the data we have collected and move on.
720 print &printOUTPUTData (\%OUTPUT_Data)."\n";
721 }
722
723
724
725 ############
726 sub returnLinesBetween {
727 =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section})
728
729 When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array
730 that are bounded by these tags.
731
732 If either of the tags are not found - or are found in the wrong order - then a null list is returned.
733
734 =cut
735
736 my ($Text_ref, $TopTag, $BotTag) = @_;
737
738 my @Table;
739 #The two boundary conditions at which we will cut the table:
740 #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n";
741 #How we record these:
742 my $AlignmentLine_Top=0; my $AlignmentLine_Bot=0;
743
744 my $LineIndex=-1; #-1 As the loop increments this line counter first, then does its checks.
745 #If you care:
746 #print "D: Lines of text passed: $$#Lines\n";
747
748 #Iterate through until we find what we are looking for or run out of text to search:
749 while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref})
750 {
751 $LineIndex++;
752 #Enable if you need to care:
753 # print "D: Line Index = $LineIndex\n";
754
755 if ($$Text_ref[$LineIndex] =~ m/$TopTag/)
756 {
757 $AlignmentLine_Top = $LineIndex;
758 # print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
759 }
760 if ($$Text_ref[$LineIndex] =~ m/$BotTag/)
761 {
762 $AlignmentLine_Bot = $LineIndex;
763 # print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
764 }
765 }
766 #Reality check: did we find anything? If not then we return null.
767 if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0)
768 { return; }
769 #Again, enable if you care:
770 #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n";
771
772 #We want the lines one down and one up - so polish these.
773 $AlignmentLine_Top++; $AlignmentLine_Bot--;
774
775 #Return as an array slice:
776 return (@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]);
777 }
778 ############
779
780 sub findSplitPoint
781 {
782 =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table)
783
784 When passed a table with the alignment in it makes an educated guess as to the precise split point to
785 spearate the 'info' and 'alignment' panels.
786 This is a right olde faff because the field / panel boundaries change.
787
788 ' Query_6 167 GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG 235'
789 ' Query_6 236 CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG 305'
790 ' Query_6 306 GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC 372'
791 ' Query_6 373 TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT 442'
792 ' Query_6 443 ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC 512'
793 MOTIF_FOUND_IN_AA
794 i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total
795
796 Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary
797 because the pattern match has been failing more often that it suceeded.
798
799 =cut
800 #A rough guess is 38 for normal sequences, 48 for reversed ones:
801
802 my $SplitPos = 0;
803
804 (my $Table_ref) = @_; #Get the reference to the table
805 my @DNALines; #We populate this for mining in the next section
806 foreach my $C_Line (@{$Table_ref})
807 {
808 #print "D: $C_Line\n";
809 # (my $SplitLine) = $C_Line;
810 #Split on consecutive tabs or spaces:
811 my @LineFields = split (/[\t\s]+/,$C_Line);
812 #print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n";
813 unless ( $LineFields[3] =~ m/[^\.]/
814 && $LineFields[3] =~ m/[ATGCX]{20,}/
815 && $#LineFields==4)
816 { next; }
817 #Enable if you want to know the lines we think are the DNA Query strings:
818 #print "D: DNA Line: '$C_Line'\n";
819 push @DNALines, $C_Line; #Note it down
820 }
821
822 my %PanelBounds; #Will contain the positions of the panel boundaries
823
824 foreach my $C_DNALine (@DNALines)
825 {
826 #print "D: '$C_DNALine'\n";
827 $C_DNALine =~ m/[ATGC-]+ \d+$/; #Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters.
828 my $MatchPos = $-[0]; #This is the position of the start of the last match because we can't get the index() function to work
829 #(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0);
830 #print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n";
831 $PanelBounds{$MatchPos}++;
832 }
833 #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure)
834 #The brackets around "($SplitPos)" are really necessary it seems.
835 ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds);
836 #If you want
837 #print Dumper %PanelBounds;
838 #Tell people if we are having difficultlty:
839 if ($SplitPos == -1) { warn "Couldn't identify the panel boundaries\n"; }
840 #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n";
841 return $SplitPos;
842 }
843
844
845 ##
846 #
847 #
848 ###
849
850
851
852
853
854 #####
855 #
856 #
857 #####
858 sub markUpCDR3
859 {
860 =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex)
861
862 Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied. The CDR3 is assumed to start
863 at the end of the FWR3.
864 To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern.
865 The position of the first matching pattern is reported.
866
867 =head4 Fuller Usage:
868
869 my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
870 $DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"},
871 $DNACDR3_Pat, $AACDR3_Pat);
872
873
874
875 =head4 Returned Values
876
877 If the CDR3 was found then we we signal like this:
878
879 $MotifFound ==0 : Nope, didn't find either motif
880 $MotifFound ==1 : Found at the DNA level, not the AA level
881 $MotifFound ==2 : Found at the the AA level, not the DNA level
882 $MotifFound ==3 : Found at the the AA level & the DNA level
883
884 (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore)
885
886 The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same.
887 Formally the test order is:
888
889 1) DNA
890 2) AA
891
892 i.e. DNA bp locations have priority.
893
894 Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match).
895 See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation.
896
897 =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA
898
899 Note that patterns are assumed to require white space inserting in them between the letters.
900 This could be a serious limitation
901
902
903 =cut
904
905 #Get the parameters passed:
906 my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat) = @_;
907 print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n";
908
909
910 #Setup our return values:
911 my $Start = 0; my $End =0; my $MotifFound = 0;
912 my $How; #Literally How the motif was found (or not if blank)
913
914
915 =head4 Prepare the sequences and the patterns for use
916
917 Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs
918
919 Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working.
920
921
922 We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on
923 in position calculations.
924
925
926 =cut
927
928 #Because igBLAST doesn't always report from the start of the read (primers and things are upstream):
929
930 my $CropPoint = $FWR3_End - $FWR1_Start - 1 ;
931 #print "D: markUpCDR3: Crop point is: '$CropPoint'\n";
932
933 #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n";
934 #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling:
935
936 my $AA_Trimmed = substr ($AA, $CropPoint);
937 my $DNA_Trimmed = substr ($DNA ,$CropPoint);
938 #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n";
939 #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n";
940
941 #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working
942 my $AAPat_Spaced;
943 foreach my $C_Char (0..length($AAPat)-1) #The -1 is because we don't want trailing spaces until the next nt -> AA translation.
944 { $AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1); }
945 #And write this back into the main pattern we were passed:
946 $AAPat = $AAPat_Spaced;
947
948 #temp hack:
949 #$AA_Trimmed = $AA;
950 my $MotifFound=0; #So we can record which patterns we found
951 my $MotifPositionDNA =-1;
952 my $MotifPositionAA =-1;
953
954 #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n";
955 =head4 At DNA level: "TGG GGx xxx GGx" [+1]
956
957 =cut
958
959 #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
960
961 if ($DNA_Trimmed =~ m/$DNAPat/)
962 {
963 $MotifPositionDNA = $+[0]; #Just the easiest way to do this in Perl
964 # print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n";
965 $MotifFound = $MotifFound + 1;
966 #Any more matches further on?
967 my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA);
968 # print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n";
969 # print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n";
970 # print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
971 # print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n";
972 if ($LaterString =~ m/$DNAPat/)
973 { print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n"; }
974 }
975
976 =head4 At AA level: "WGxG" [+2]
977
978 =cut
979
980 if ($AA_Trimmed=~ m/$AAPat/)
981 {
982 $MotifPositionAA = $+[0]; #Just the easiest way to do this in Perl
983 $MotifFound = $MotifFound + 2;
984 # print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n";
985 (my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA);
986 # print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n";
987
988 }
989
990 =head4 Assess the results of motif position finding
991
992 =cut
993
994 #print "D: markUpCDR3: MotifFound = '$MotifFound'\n";
995
996 if ($MotifFound ==0)
997 { return ($Start, $End, $MotifFound); } #The easy one really: return we didn't find the CDR3
998
999 #
1000 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
1001 #Just found in DNA:
1002 if ($MotifFound ==1)
1003 {
1004 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
1005 $End = $MotifPositionDNA;
1006 $How = "MOTIF_FOUND_IN_DNA";
1007 }
1008 #Just found in AA:
1009 if ($MotifFound ==2)
1010 {
1011 $End = $MotifPositionAA;
1012 $How = "MOTIF_FOUND_IN_AA";
1013 }
1014
1015 #Found in both, DNA has priority:
1016 if ($MotifFound ==3)
1017 {
1018 $Start = $FWR3_End ; #We assume the end of the FWR3 is the start of CDR3:
1019 $End = $MotifPositionDNA;
1020 $How = "MOTIF_FOUND_IN_BOTH";
1021 }
1022
1023 #print "D: markUpCDR3: Motif found = $MotifFound\n";
1024
1025 =head4 These next few lines are for testing / diagnostics only - disable for general use
1026
1027 If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that
1028 the start of FWR1 is unlikely to be at nt 1.
1029
1030 =cut
1031
1032 $Start = $FWR3_End - $FWR1_Start -1;
1033 $End = $End + $CropPoint;
1034 my $CDR3_RegionLength = $End - $Start;
1035 #print "D: markUpCDR3: CDR3 Length= $Start - $End = '$CDR3_RegionLength'\n";
1036 (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength);
1037
1038 #Add onto the coordinates what we trimmed off:
1039
1040
1041 #print "D: markUpCDR3: Seq ='$CDR3_seq'\n";
1042
1043 print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n";
1044 #die "HIT BLOCK\n";
1045 return ($Start + $FWR1_Start, $End + $FWR1_Start, $How);
1046 }
1047
1048
1049 sub printOUTPUTData {
1050 =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData}
1051
1052 When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID.
1053 This prepared and then returned as a text string that can then be printed to STDOUT:
1054
1055 print (printOUTPUTData (\%OutputData));
1056
1057 Any missing data in the Hash array it polietly ignored and a null string printed in place.
1058 The text field is tab delimited; there are no extra trailing tabs or carriage returns in place.
1059
1060 Actually the fields printed out are stored in an index array.
1061
1062 =head3 Header output
1063
1064 If the routine is passed a key 'HEADER' then the header columns are returned as that string.
1065 This is tested first - so don't add this unless you mean to.
1066
1067 =cut
1068
1069 my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene",
1070 "CDR1 Seq", "CDR1 Length",
1071 "CDR2 Seq", "CDR2 Length",
1072 "CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand",
1073 "CDR3 Found How");
1074
1075 my $OutputString = "OUTPUT:"; #What we are going to build the output into.
1076
1077 =head4 Print Header & Exit?
1078
1079 =cut
1080
1081 my ($Data_ref) = @_;
1082 #print "D: printOUTPUTData: Running\n";
1083
1084 if (exists $$Data_ref {"HEADER"})
1085 {
1086 $OutputString .= "\t";
1087 for(my $n = 0; $n <= $#HeaderFields; $n++)
1088 {
1089 $OutputString .= $HeaderFields[$n];
1090 $OutputString .= "\t" if($n < $#HeaderFields);
1091 }
1092
1093 # foreach my $C_Header (@HeaderFields)
1094 # { $OutputString .= "$C_Header"; } #
1095
1096 print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n";
1097 return ($OutputString);
1098 }
1099
1100 =head3 Assemble whatever data we have - and tab delimit the null fields
1101
1102 =cut
1103 #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref;
1104 foreach my $C_Header (@HeaderFields)
1105 {
1106
1107 if (exists ($$Data_ref {$C_Header}))
1108 { $OutputString .= "\t". $$Data_ref{$C_Header}; } #We have data to print out
1109 else
1110 { $OutputString .="\t"; } #Add a trailing space
1111 } #
1112
1113 return ($OutputString);
1114 }
1115
1116
1117 ######################################### Code Junk ########################
1118
1119
1120 =head2 Code Junk Attic
1121
1122 =head3 Demonstrates how to reverse translate an amino acid sequence into DNA:
1123
1124 use Bio::Tools::CodonTable;
1125 use Bio::Seq;
1126
1127 # print possible codon tables
1128 my $tables = Bio::Tools::CodonTable->tables;
1129 while ( (my $id, my $name) = each %{$tables} ) {
1130 print "$id = $name\n";
1131 }
1132 my $CodonTable = Bio::Tools::CodonTable->new();
1133
1134 my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n";
1135
1136
1137 my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq);
1138 print "D: '$rvSeq'\n";
1139 die "TEST OVER\n";
1140
1141 =cut
1142
1143
1144 =head3 For processing the 'Alignment lines' section of the alginment table
1145
1146 #If we are ever interested; then enable the code below:
1147 # print ": Alignment\n";
1148 # $InfoPanel =~ s/^ +//; $InfoPanel =~ s/ +$//; #Clean off trailing spaces
1149 # my ($Germclass, $PID, $PID_Counts, $Allele) = split (/\s+/,$InfoPanel); #Split on spaces
1150 ##Enable if you need to know what we just found:
1151 # #print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n";
1152 # #A reality check: we should have an Allele - or some text here.
1153 # unless (defined $Allele && $Allele ne "")
1154 # { warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n"; }
1155 # if (exists ($Alginments {$Germclass}{$Allele}))
1156 # { $Alginments {$Germclass}{$Allele} = $Alginments {$Germclass}{$Allele}.$CurrentAASequence; } #Carry on adding
1157 # else #more work needed as we need to 'pad' the sequence with fake gap characters)
1158 # {
1159 ##Do we still need this padding? I don't think so
1160 #
1161 #
1162 # my $PaddingChars = ($ThisQueryStart-$Query_Start);
1163 # print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n";
1164 # #To help testing, calculate this first:
1165 # my $PaddingString = " "x $PaddingChars;
1166 # $Alginments {$Germclass}{$Allele} = $CurrentAASequence;
1167 # }
1168 # next
1169
1170 =head3 Demonstration of Pattern match positions
1171
1172 my $Text = "12345TTT TTAAAAA";
1173 my $TestPat = "TTT\\s+TT";
1174 (my $Result)= $Text =~ m/$TestPat/;
1175 print "D: Two vars are: - = ",$-[0], " & + =", $+[0]," for test pattern '$TestPat'\n";
1176
1177 sub printCDR3 {
1178
1179 =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence);
1180
1181 ???? IS THIS FUNCTION IN USE ?????
1182
1183 Handles the printing of the output when passed information about the CDR3 region.
1184
1185
1186 The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT:
1187
1188 print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n";
1189
1190 #=cut
1191
1192 #Despite the similarity in names, these are all local copies passed to us:
1193
1194 my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_;
1195
1196 #For DNA:
1197 my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End);
1198 my ($CDR_DNA_Length) = length ($CDR_DNA_Seq);
1199
1200 #For AA:
1201 my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End);
1202 my ($CDR_AA_Length) = length ($CDR_AA_Seq);
1203
1204 my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed:
1205 print "D: SUB: printCDR3: As returned: '$ReturnString'\n";
1206 return ($ReturnString);
1207
1208 }
1209
1210 =cut
1211
1212
1213
1214 =head2 Change Log
1215
1216 =head3 Version 1.2
1217
1218 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record]
1219 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts].
1220 This function is replaced by the more general printOUTPUTData()
1221 3) A tag for the CDR3 status is now output for every record / read.
1222 Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found.
1223
1224 =head4 Version 1.3
1225
1226 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”,
1227 which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly.
1228 - found by Bas Horsman.
1229
1230 =head4 Version 1.3a
1231
1232 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3)
1233
1234 =cut
1235
1236 =head4 Version 1.4
1237
1238 1) Now processes files using Mac/Unix/MS-DOS newline characters:
1239
1240 $_ =~ s/\r\n/\n/g; #In case line ends are MS-DOS
1241 $_ =~ s/\r/\n/g; #In case line ends are Mac
1242 #The whole record - one per read - is now stored in $_
1243 my @Lines =split (/\R/,$_); #Split on new lines
1244
1245 =head4 Version 1.4a
1246
1247 1) Fixed the length of the CDR3 AA string being reported correctly:
1248
1249 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length;
1250 to:
1251 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
1252