annotate igblastparser/igparse.pl @ 13:d3ebaa2d2fe0 draft

Uploaded
author davidvanzessen
date Tue, 20 Dec 2016 06:02:44 -0500
parents afe85eb6572e
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1 #!/usr/bin/perl
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
2 =head1 IGBLAST_simple.pl
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
4 This version (1.4) has been heavily adapted since the original program was first created back in October 2012.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
5 Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
6
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
7 From V 1.2 onwards a 'Change Log' is included at the end of the program
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
8
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
9 =head2 Usage
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
10
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
11 Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
12 as it allows inspection of the data stores.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
13
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
14 The program takes a text file of the
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
15
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
16 ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process->
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
17
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
18 Supply the text version of the igBLAST report in the format as in the example below.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
19 The extra command line arugment is the record number (aka. BLAST report) to process.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
20 If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
21
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
22 =head2 Example Input
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
23
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
24 A standard igBLAST record or set of them in a file; this being typical:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
25
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
26 BLASTN 2.2.27+
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
27
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
28
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
29 Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
30 Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
31 Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
32 protein database search programs", Nucleic Acids Res. 25:3389-3402.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
33
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
34
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
35
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
36 Database: human_gl_V; human_gl_D; human_gl_J
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
37 674 sequences; 179,480 total letters
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
38
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
39
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
40
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
41 Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
42 run=R_2012_04_10_11_57_56_
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
43
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
44 Length=433
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
45 Score E
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
46 Sequences producing significant alignments: (Bits) Value
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
47
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
48 lcl|IGHV3-30*04 330 2e-92
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
49 lcl|IGHV3-30-3*01 330 2e-92
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
50 lcl|IGHV3-30*01 327 2e-91
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
51 lcl|IGHD3-16*01 14.4 11
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
52 lcl|IGHD3-16*02 14.4 11
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
53 lcl|IGHD1-14*01 12.4 43
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
54 lcl|IGHJ4*02 78.3 1e-18
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
55 lcl|IGHJ5*02 70.3 4e-16
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
56 lcl|IGHJ4*01 68.3 2e-15
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
57
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
58
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
59 Domain classification requested: imgt
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
60
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
61
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
62 V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand):
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
63 IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
64
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
65 V(D)J junction details (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
66 AGAGA TATGAGCCCCATCATGACA ACGTTTG CCGGAA ACTAC
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
67
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
68 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
69 FWR1 27 38 12 11 1 0 91.7
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
70 CDR1 39 62 24 22 2 0 91.7
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
71 FWR2 63 113 51 50 1 0 98
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
72 CDR2 114 137 24 23 1 0 95.8
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
73 FWR3 138 251 114 109 5 0 95.6
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
74 CDR3 (V region only) 252 259 8 7 1 0 87.5
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
75 Total N/A N/A 233 222 11 0 95.3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
76
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
77
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
78 Alignments
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
79
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
80 <----FWR1--><----------CDR1--------><-----------------------FWR2------
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
81 W A A S G F T F N T Y A V H W V R Q A P G K G
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
82 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
83 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
84 C A A S G F T F S S Y A M H W V R Q A P G K G
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
85 V 95.7% (221/231) IGHV3-30-3*01 64 ..T......................G..G.......A................................. 133
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
86 V 94.8% (221/233) IGHV3-30*01 64 ..T......................G..G.......A................................. 133
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
87
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
88 ----------------><----------CDR2--------><----------------------------
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
89 L E W V A V I S Y D G S N K N Y A D S V K G R F
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
90 Query_1 97 TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT 166
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
91 V 95.3% (222/233) IGHV3-30*04 134 ..................................T......T............................ 203
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
92 L E W V A V I S Y D G S N K Y Y A D S V K G R F
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
93 V 95.7% (221/231) IGHV3-30-3*01 134 .........................................T............................ 203
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
94 V 94.8% (221/233) IGHV3-30*01 134 .A................................T......T............................ 203
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
95
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
96 ---------------------------FWR3---------------------------------------
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
97 T I S R D N S K N T L Y L Q M N S L R V E D T
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
98 Query_1 167 CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG 236
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
99 V 95.3% (222/233) IGHV3-30*04 204 ...............................C.G.........................C.......... 273
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
100 T I S R D N S K N T L Y L Q M N S L R A E D T
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
101 V 95.7% (221/231) IGHV3-30-3*01 204 ...............................C.G.........................C.......... 273
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
102 V 94.8% (221/233) IGHV3-30*01 204 ...............................C.G.........................C.......... 273
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
103
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
104 -------------->
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
105 A V Y Y C T R D M S P I M T T F A G N Y W G Q
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
106 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
107 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
108 A V Y Y C A R
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
109 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
110 V 94.8% (221/233) IGHV3-30*01 274 .....G.........G.......----------------------------------------------- 296
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
111 D 100.0% (7/7) IGHD3-16*01 12 ------------------------------------------.......--------------------- 18
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
112 D 100.0% (7/7) IGHD3-16*02 12 ------------------------------------------.......--------------------- 18
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
113 D 100.0% (6/6) IGHD1-14*01 8 -------------------------------------------------......--------------- 13
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
114 J 100.0% (39/39) IGHJ4*02 10 -------------------------------------------------------............... 24
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
115 J 100.0% (35/35) IGHJ5*02 17 -----------------------------------------------------------........... 27
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
116 J 97.4% (38/39) IGHJ4*01 10 -------------------------------------------------------.............A. 24
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
117
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
118
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
119 G T L V T V S S
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
120 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
121 J 100.0% (39/39) IGHJ4*02 25 ........................ 48
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
122 J 100.0% (35/35) IGHJ5*02 28 ........................ 51
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
123 J 97.4% (38/39) IGHJ4*01 25 ........................ 48
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
124
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
125
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
126 Lambda K H
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
127 1.10 0.333 0.549
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
128
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
129 Gapped
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
130 Lambda K H
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
131 1.08 0.280 0.540
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
132
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
133 Effective search space used: 64847385
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
134
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
135
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
136 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
137 run=R_2012_04_10_11_57_56_
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
138
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
139
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
140 ...etc...
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
141
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
142 =head2 Example Output
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
143
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
144
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
145 Example output from the data above sent:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
146 $ ./IGBLAST_simple.pl igBLASTOutput.txt 1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
147 D: Request to process just record '1' received
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
148 D: printOUTPUTData: Running
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
149 D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
150 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
151 D: ID is: 'HL67IUI01D26LR'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
152 D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
153 D: Starting Search for CDR3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
154 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
155 D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
156 D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
157 D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
158 D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
159 D: printOUTPUTData: Running
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
160 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
161
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
162 =head4 Usage notes:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
163
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
164 Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
165
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
166 ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
167
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
168 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
169 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
170 OUTPUT: HL67IUI01EQMLY In-frame IGHV4-39*01 IGHD2-8*01 IGHJ3*02 GGSISSSSYY 29 IYHSGST 20 CARDATYYSNGFDIWGQG 53 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
171 OUTPUT: HL67IUI01CDCLP Out-of-frame IGHV3-23*01 IGHD3-3*01 IGHJ4*02 FSNYAM 16 SGSGDRTY 23 AKAD*FLEWLFRIGDGERLLGPGN 72 MOTIF_FOUND_IN_DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
172 OUTPUT: HL67IUI01AHRNH N/A IGHV3-33*01 N/A N/A WIHLQ*LW 23 YGMMEVI 23 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
173 OUTPUT: HL67IUI01DZZ1V Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASG 20 LYCASEGDIVASELLSTGARV 62 MOTIF_FOUND_IN_DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
174 OUTPUT: HL67IUI01DTR2Y Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*VLGPG 59 MOTIF_FOUND_IN_DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
175 OUTPUT: HL67IUI01EQL3S In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
176 OUTPUT: HL67IUI01AFG46 In-frame IGLV2-34*01 N/A IGHJ4*02 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
177 OUTPUT: HL67IUI01EFFKO In-frame IGHV3-11*01 IGHD6-6*01 IGHJ4*02 GFTFSDYY 23 ISYSGGTI 23 CARASGAARHRPLDYWGQG 56 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
178 OUTPUT: HL67IUI01B18SG In-frame IGHV3-33*01 IGHD5-12*01 IGHJ4*02 VRQA 11 KYYANSVK 23 RLGGFDYWGQGTLVTVSS 53 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
179 OUTPUT: HL67IUI01D6LER In-frame IGHV1-24*01 IGHD3-22*01 IGHJ4*02 GYSLNELS 23 PDPEDDE 23 TVQPSRITMMAVVITRIHWGASGARE 76 MOTIF_FOUND_IN_DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
180 OUTPUT: HL67IUI01CYCLF N/A IGHV4-39*01 N/A N/A GGSISSSSYY 29 IYYSGST 20 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
181 OUTPUT: HL67IUI01B4LEE In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
182 OUTPUT: HL67IUI01A4KW4 Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*IWGQG 58 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
183 OUTPUT: HL67IUI01E05BV In-frame IGHV1-24*01 IGHD3-22*01 IGHJ2*01 GYSLNELS 23 PDPEDDE 23 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
184 OUTPUT: HL67IUI01CVVKY In-frame IGHV1-3*01 IGHD2-15*01 IGHJ1*01 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
185 OUTPUT: HL67IUI01CN5P2 In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
186 OUTPUT: HL67IUI01DUUJ5 In-frame IGHV3-21*01 IGHD1-7*01 IGHJ4*02 GYTFSTYS 23 ISSSSAYR 23 CARDIRLELRDWGQG 44 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
187 OUTPUT: HL67IUI01E1AIR Out-of-frame IGHV4-39*01 N/A IGHJ3*01 WGLHRRW**L 29 FVS*RAPR 23 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
188 OUTPUT: HL67IUI01CCZ8D Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASGR 20 YCASEGDIVASELLSTGARE 58 MOTIF_FOUND_IN_DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
189 OUTPUT: HL67IUI01BT9IR N/A IGHV3-21*02 N/A N/A NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
190 OUTPUT: HL67IUI01COTO0 Out-of-frame IGHV4-39*01 N/A IGHJ3*01 GGFIGGGDNF 29 LYHDGRPA 23 NOT_FOUND
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
191 OUTPUT: HL67IUI01D994O In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
192 OUTPUT: HL67IUI01A08CJ In-frame IGHV4-39*01 IGHD6-13*01 IGHJ5*02 GGSISSSSYY 29 IYYTWEH 21 CERARRGSSWGQLVRPLGPG 62 MOTIF_FOUN
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
193
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
194
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
195
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
196 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
197 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
198 ...etc...
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
199
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
200 =head4 Also, combined grep & sed:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
201
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
202 $ ./IGBLAST_simple.pl igBLASTOutput.txt | grep OUTPUT: | sed 's/OUTPUT:\t//'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
203
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
204 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
205
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
206 =head3 CDR3 Patterns:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
207
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
208 We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
209
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
210 my $DNACDR3_Pat = "TGGGG....GG.";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
211 my $AASequenceMotifPattern = "WG.G";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
212
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
213 They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X').
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
214
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
215 [NB: These are original patterns used for testing, check the code for the current ones.]
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
216
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
217 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
218
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
219 my $DNACDR3_Pat = "TGGGG....GG.";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
220 my $AACDR3_Pat = "WG.G";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
221
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
222 use strict;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
223 use Data::Dumper;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
224 # Set this as to number of the result (aka "record") you want to process or 0 for all:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
225 my $ProcessRecord =0;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
226 if (defined $ARGV[1]) { $ProcessRecord = pop @ARGV; } #Also accept from the command line:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
227 if ($ProcessRecord != 0) { print "D: Request to process just record '$ProcessRecord' received\n"; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
228
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
229 #Adjust the record separator:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
230 $/="Query= ";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
231 my $Record=0; # A simple counter, that we might not use.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
232 #Force-loaded header / version information:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
233 my $Header = <>;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
234 #At the moment we don't use this - so dump it immediately:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
235 $Header = undef;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
236 #print "D: Force-loaded header / version information: '$Header'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
237
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
238 #Print the Header for the output line (we need this once, at the start)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
239 print &printOUTPUTData ({"HEADER" => 1})."\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
240
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
241 while (<>)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
242 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
243 =head4 First check - should we be processing this record at all?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
244
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
245 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
246 $Record++; #Increment the record counter:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
247 #Do we process this record - or all records?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
248 if ($ProcessRecord != $Record && $ProcessRecord != 0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
249 { next; } #We need to increment the record counter before we increment
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
250
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
251 =head4 Setup the output line storage and print the header:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
252
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
253 We enter this initially and work to change it:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
254
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
255 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
256
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
257 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
258
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
259 my %OUTPUT_Data; #To collect data for the output line in
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
260 #Assume the first and work to find better:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
261 $OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
262 #The whole record - one per read - is now stored in $_
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
263 my @Lines =split (/[\r\n]+/,$_); # split on windows/linux/mac new lines
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
264
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
265 #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
266 #print "D: Record #$Record\n"; print $_; print "\n---------\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
267 print "D: ''$Lines[0]'\nD: ...etc...'\nD: ############\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
268
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
269 =head3 Get the ID
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
270
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
271 Quite easy: the first field on the first line:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
272
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
273 Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
274
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
275 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
276
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
277 (my $ID) = $Lines[0]=~ m/^(\S+)/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
278 unless (defined $ID && $ID ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
279 { # So a near total failure...?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
280 $OUTPUT_Data{"ID"} = "Unknown";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
281 print &printOUTPUTData (\%OUTPUT_Data)."\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
282 next; #No ID is terminal for this record
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
283 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
284 else
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
285 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
286 print "D: ID is: '$ID'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
287 $OUTPUT_Data{"ID"} = $ID;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
288 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
289 =head3 Declare the variables we will need here in the next few sections to store data
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
290
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
291 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
292
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
293 my $CurrentRegion;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
294 my $RegionMarkup;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
295
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
296 #So we can sync the coordinated of the alignment up to the domains found:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
297 my $Query_Start = -1; my $Query_End = -1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
298
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
299 #Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
300 my $ThisQueryStart =-1; my $ThisQueryEnd =-1; #Think $ThisQueryEnd isn't used at the moment.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
301 my $DNAQuerySequence =""; #The actual DNA Query sequence...
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
302 my $AAQuerySequence = "";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
303
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
304 #As this changes with the alleles identified:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
305 my $CurrentAASequence;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
306 #The main storage variables
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
307
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
308 my %Alginments; my %Alleles;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
309 my %DomainBoundaries;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
310
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
311 =head2 Stanza 1: Get the general structure of the sequence identified
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
312
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
313 =head3 Method 1: Use the table supplied
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
314
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
315 Technically this valid for the top hit...realistically this is the only information we have reported to us
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
316 so we use this or nothing. This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd? Who knows!
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
317
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
318 Targets this block:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
319
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
320 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
321 FWR1 167 240 75 72 2 1 96
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
322 CDR1 241 264 24 20 4 0 83.3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
323 FWR2 265 315 51 48 3 0 94.1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
324 CDR2 316 336 24 15 6 3 62.5
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
325 FWR3 337 450 114 106 8 0 93
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
326 CDR3 (V region only) 451 454 4 4 0 0 100
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
327 Total N/A N/A 292 265 23 4 90.8
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
328
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
329 Then we split out the lines inside it in a second scanning step - less optimal but easier to read:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
330
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
331 FWR1 167 240 75 72 2 1 96
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
332 CDR1 241 264 24 20 4 0 83.3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
333 FWR2 265 315 51 48 3 0 94.1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
334 CDR2 316 336 24 15 6 3 62.5
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
335 FWR3 337 450 114 106 8 0 93
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
336 CDR3 (V region only) 451 454 4 4 0 0 100
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
337
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
338 into:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
339
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
340 (Section, from, to, length, matches, mismatches, gaps, percent identity)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
341
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
342 =head3 Method 2: Use the table supplied
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
343
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
344 The other way to do this is to split the graphical markup out of the alignment.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
345 This works for _any_ reported alignment, not just the top hits:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
346
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
347 In the main alignment table processing section collect the information, collect the information:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
348
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
349 #Is region mark-up:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
350 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
351 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
352 # print ": Region Markup detected\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
353 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
354 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
355 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
356
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
357 Then afterwards when all the region was collected, process it like this:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
358 #Pad the CDER3 region:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
359
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
360 #Remove the trailing spaces:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
361 $RegionMarkup =~ s/ *$//g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
362 #Calculate the length of the CDR3 region so we can add it in:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
363 my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
364 #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop):
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
365 $RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
366 #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
367
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
368 #Now really process it:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
369 my $C_Pos = 0;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
370 my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup); #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
371 foreach my $C_Domain (@Domains)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
372 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
373 if (length ($C_Domain) <=0) {next;}
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
374 my $DomainStart= $C_Pos;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
375 my $DomainEnd = $DomainStart + length ($C_Domain)-1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
376 my ($DomainType) = $C_Domain =~ m/(...[123])/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
377 # print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
378 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
379 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
380 $C_Pos = $DomainEnd+1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
381 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
382
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
383 The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
384 Why stress?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
385
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
386
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
387 =head3 The end of the FWR3 is the start of CDR3?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
388
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
389 This is an assumption made. Hence the two variables:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
390
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
391 my $MaxDomainReported =0 ; # In nts / bps
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
392 my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
393
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
394 $MaxDomainBaseFound
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
395
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
396 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
397 my $MaxDomainBaseFound =0 ; # In nts / bps
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
398 my $AlignmentStart ; # In nts /bp #Alternative name would be: '$MinDomainBaseFound'; set to null until primed
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
399 # my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
400
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
401 (my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" );
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
402 #Enable the next line if you want the raw data we are going to parse in this section:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
403 #print Dumper @StructureSummaryTable;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
404 foreach my $C_Section (@StructureSummaryTable)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
405 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
406 my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
407 #print "D: Domain type: '$DomainType'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
408 #$DomainType =~ s/ .*$//g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
409 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
410 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
411
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
412 #So we can do a reality check on the length / start of the CDR3 if we have to go looking:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
413 if ($MaxDomainBaseFound <= $DomainEnd)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
414 { $MaxDomainBaseFound = $DomainEnd; } #Store the maximum base found
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
415 if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
416 { $AlignmentStart = $DomainStart; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
417 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
418 #print Dumper %DomainBoundaries;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
419 #die "HIT BLOCK\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
420
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
421 =head3 Did we find the CDR3 region specifically?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
422
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
423 If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
424
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
425 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
426 print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
427
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
428 #my @WantedSections = qw (V D J);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
429
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
430 =head2 Second Stanza: Parse the main Alignment Table
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
431
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
432 =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
433
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
434 As this is a little involved and comparamentalises nicely we sub-contract this to two functions""
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
435
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
436 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
437 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
438
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
439 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
440 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
441 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
442 #If you are interested, enable this line:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
443 # print "D: The info panel was detected at: '$splitPoint'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
444
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
445 =head3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
446
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
447 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
448
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
449
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
450 foreach my $C_Line (0..$#Table)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
451 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
452
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
453 =head3 Call the line type we find: There are 4:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
454
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
455 These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
456
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
457 | <- This split is ~40 chars. from the start of the line
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
458 * InfoPanel * | * Alignment Panel *
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
459 : is a "Blank" line (-1,-1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
460 <----FWR1--><----------CDR1--------><-----------------------FWR2------ : is "Region Markup" (-1,0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
461 W A A S G F T F N T Y A V H W V R Q A P G K G : is "AA Sequence" (-1, >=0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
462 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 : is "DNA Sequence" (2,1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
463 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 : is "" "
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
464
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
465 So we split 40 chars in and then the two parts on spaces.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
466
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
467
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
468 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
469
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
470 # print "D: (sub) Line in parsed table: '$C_Line': \n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
471
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
472 my ($InfoPanel, $AlignmentPanel) = $Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
473
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
474 my @InfoColumns = split (/\s+/,$InfoPanel);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
475 my @AlignmentColumns = split (/\s+/,$AlignmentPanel);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
476
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
477 #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
478 # print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
479 # print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
480
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
481 #Populate this so we can step through it
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
482
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
483 =head4 Is a blank line:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
484 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
485 if ($#InfoColumns == -1 && $#AlignmentColumns == -1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
486 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
487 # print ": Blank\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
488 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
489 } #For now I think we just skip - is not needed (though might be implict mark-up)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
490
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
491 =head4 Is region mark-up:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
492 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
493 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
494 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
495 # print ": Region Markup detected\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
496 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
497 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
498 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
499 =head4 Is query DNA Sequence:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
500 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
501 if ($#InfoColumns == 2 && $#AlignmentColumns ==1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
502 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
503 # print ": DNA Query Sequence\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
504 #Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels')
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
505 ($ThisQueryStart) = $InfoPanel =~ / (\d+) *$/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
506 ($ThisQueryEnd) = $AlignmentPanel =~ / (\d+) *$/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
507 my ($ThisDNASeq) = $AlignmentPanel =~ /^(.*?) /;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
508 #If you want to know what we just found:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
509 #print "D: This DNA Sequence: '$ThisDNASeq'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
510 $DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq; #Add it on to whatever we already have.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
511 #Move the needle if there are smaller / greater; otherwise prime the 'needles':
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
512 if ($ThisQueryStart < $Query_Start or $Query_Start == -1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
513 { $Query_Start = $ThisQueryStart; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
514 if ($ThisQueryEnd > $Query_End or $Query_End == -1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
515 { $Query_End = $ThisQueryEnd; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
516 # print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
517 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
518 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
519 =head4 Is AA Sequence:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
520
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
521 This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
522 -------------->
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
523 A V Y Y C T R D M S P I M T T F A G N Y W G Q << Want this
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
524 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
525 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
526 A V Y Y C A R
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
527 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
528
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
529 ...etc...
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
530 G T L V T V S S << Want this
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
531 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
532
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
533 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
534
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
535 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
536
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
537 if ($#InfoColumns == -1 && $#AlignmentColumns >=-1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
538 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
539 unless ($Table[$C_Line+1] =~ /Query/) { next; } #Is the next line the DNA sequence ?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
540 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
541 # print ": AA sequence\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
542
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
543
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
544 $CurrentAASequence = $AlignmentPanel;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
545 #print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
546 $CurrentAASequence =~ s/^ {$PanelSplitPoint}//;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
547 #print "D: '$AAQuerySequence'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
548 # print "D: Current AA Sequence: \t'$CurrentAASequence'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
549 $AAQuerySequence = $AAQuerySequence.$CurrentAASequence; #Store the elongating AA Sequence as well
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
550 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
551 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
552 =head4 Is Alignment:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
553 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
554 if ($#InfoColumns == 4 && $#AlignmentColumns ==1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
555 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
556 #Not acutally interesting to us for this version of the parser. Delete ultimately?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
557 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
558 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
559
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
560 #Is weird! Don't recognise it!
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
561
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
562 warn "Weird! Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
563 } #End main iteration loop for alignment parsing.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
564
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
565
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
566 =head2 The CDR3 is noted as problematic. Can we identify it?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
567
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
568 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
569 print "D: Starting Search for CDR3\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
570 #Do have the end of the FWR3 but not the CDR3? If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
571 if (exists ($DomainBoundaries{"FWR3"}{"End"})
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
572 && $AlignmentStart !=0
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
573 && not (exists $DomainBoundaries{"CDR3"}{"End"}) ) #Guess we need to go looking for the end then...
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
574 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
575 #print "D: Placing call to markUpCDR3\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
576 my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
577 $DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart,
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
578 $DNACDR3_Pat, $AACDR3_Pat);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
579 if ($CDR3_Start !=0 && $CDR3_End !=0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
580 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
581 $DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
582 $DomainBoundaries{"CDR3"}{"End"} = $CDR3_End ;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
583 $DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
584 print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
585 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
586 else
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
587 { print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
588 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
589 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
590 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
591 else
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
592 { #Was reported by igBLAST
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
593 print "D: Found the FWR3 from the Domain Boundary Table\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
594 $DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
595 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
596
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
597 #print Dumper %DomainBoundaries;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
598
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
599 =head2 Get the top VDJ regions:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
600
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
601 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
602
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
603 =head2 Extract General Features:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
604
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
605 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
606 (my $TopHit) = $_ =~ m/V-J Frame, Strand\):\n(.*?)\n/s;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
607 print "D: Top Hits (raw)= '$TopHit' \n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
608 my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
609 print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
610
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
611 =head2 Store the V / D / J Genes used
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
612
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
613 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
614
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
615 if (defined $Top_V_gene_match && $Top_V_gene_match ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
616 { $OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
617
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
618 if (defined $Top_D_gene_match && $Top_D_gene_match ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
619 { $OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
620
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
621 if (defined $Top_J_gene_match && $Top_J_gene_match ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
622 { $OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
623
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
624 if (defined $Strand && $Strand ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
625 { $OUTPUT_Data{"Strand"} = $Strand;}
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
626
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
627 =head4 Preamble: ID, Frame, and V / D / J used:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
628
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
629 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
630 #Do a reality check: if we didn't get an ID, then skip:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
631 unless (defined (defined $ID) && $ID ne "" &&
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
632 defined $VJFrame && $VJFrame ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
633 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
634 print &printOUTPUTData (\%OUTPUT_Data)."\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
635 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
636 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
637
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
638 #Ok, so we have data...most likely:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
639 #print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
640
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
641 if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
642 { $OUTPUT_Data{"VDJ Frame"} = $VJFrame;}
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
643 else
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
644 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
645 print &printOUTPUTData (\%OUTPUT_Data)."\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
646 next;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
647 }#REALLY? We didn't find anything? Oh well, move to next record
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
648
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
649 =head4 CDR1
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
650
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
651 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
652 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
653
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
654 my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
655
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
656 # print "D: AA Seqeunce is: '$AAQuerySequence'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
657 if (exists $DomainBoundaries{"CDR1"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
658 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
659 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
660 my $CDR1Start = $DomainBoundaries{"CDR1"}{"Start"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
661 my $CDR1End = $DomainBoundaries{"CDR1"}{"End"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
662 my $CDR1_Length = $CDR1End - $CDR1Start;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
663 # print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
664 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
665 my $CDR1_Seq_AA = substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
666 # print "D: '$CDR1_Seq_AA'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
667 $CDR1_Seq_AA =~ s/ //g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
668 my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
669 #Add this data to the output store specifically:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
670 $OUTPUT_Data{"CDR1 Seq"} = $CDR1_Seq_AA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
671 $OUTPUT_Data{"CDR1 Length"} = $CDR1_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
672 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
673 #What happens if there is no CDR1 found? Leave blank - the output routine can handle this
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
674
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
675 =head4 CDR2
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
676
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
677 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
678
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
679 if (exists $DomainBoundaries{"CDR2"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
680 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
681 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
682 my $CDR2Start = $DomainBoundaries{"CDR2"}{"Start"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
683 my $CDR2End = $DomainBoundaries{"CDR2"}{"End"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
684 my $CDR2_Length = $CDR2End - $CDR2Start;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
685 my $CDR2_Seq_AA = substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
686 $CDR2_Seq_AA =~ s/ //g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
687 my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
688 #Add this data to the output store specifically:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
689 $OUTPUT_Data{"CDR2 Seq"} = $CDR2_Seq_AA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
690 $OUTPUT_Data{"CDR2 Length"} = $CDR2_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
691 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
692 #What happens if there is no CDR2 found? Leave blank - the output routine can handle this.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
693
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
694 =head4 CDR3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
695
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
696 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
697 if (exists $DomainBoundaries{"CDR3"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
698 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
699 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
700 my $CDR3Start = $DomainBoundaries{"CDR3"}{"Start"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
701 my $CDR3End = $DomainBoundaries{"CDR3"}{"End"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
702 my $CDR3_Length = $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
703 my $CDR3_Seq_AA = substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
704 my $CDR3_Seq_DNA = substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
705 $CDR3_Seq_AA =~ s/ //g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
706 $CDR3_Seq_DNA =~ s/ //g;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
707 my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
708 my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
709 #Add this data to the output store specifically:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
710 $OUTPUT_Data{"CDR3 Seq"} = $CDR3_Seq_AA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
711 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
712 $OUTPUT_Data{"CDR3 Seq DNA"} = $CDR3_Seq_DNA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
713 $OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
714 #And in the case of the CDR3 how we found it:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
715 $OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"};
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
716 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
717 #What happens if there is no CDR3 found? Leave blank - the output routine can handle this.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
718 #die "HIT BLOCK\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
719 #End of the record; output the data we have collected and move on.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
720 print &printOUTPUTData (\%OUTPUT_Data)."\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
721 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
722
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
723
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
724
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
725 ############
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
726 sub returnLinesBetween {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
727 =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section})
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
728
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
729 When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
730 that are bounded by these tags.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
731
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
732 If either of the tags are not found - or are found in the wrong order - then a null list is returned.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
733
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
734 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
735
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
736 my ($Text_ref, $TopTag, $BotTag) = @_;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
737
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
738 my @Table;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
739 #The two boundary conditions at which we will cut the table:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
740 #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
741 #How we record these:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
742 my $AlignmentLine_Top=0; my $AlignmentLine_Bot=0;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
743
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
744 my $LineIndex=-1; #-1 As the loop increments this line counter first, then does its checks.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
745 #If you care:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
746 #print "D: Lines of text passed: $$#Lines\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
747
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
748 #Iterate through until we find what we are looking for or run out of text to search:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
749 while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref})
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
750 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
751 $LineIndex++;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
752 #Enable if you need to care:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
753 # print "D: Line Index = $LineIndex\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
754
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
755 if ($$Text_ref[$LineIndex] =~ m/$TopTag/)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
756 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
757 $AlignmentLine_Top = $LineIndex;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
758 # print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
759 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
760 if ($$Text_ref[$LineIndex] =~ m/$BotTag/)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
761 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
762 $AlignmentLine_Bot = $LineIndex;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
763 # print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
764 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
765 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
766 #Reality check: did we find anything? If not then we return null.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
767 if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
768 { return; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
769 #Again, enable if you care:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
770 #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
771
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
772 #We want the lines one down and one up - so polish these.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
773 $AlignmentLine_Top++; $AlignmentLine_Bot--;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
774
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
775 #Return as an array slice:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
776 return (@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
777 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
778 ############
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
779
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
780 sub findSplitPoint
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
781 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
782 =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
783
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
784 When passed a table with the alignment in it makes an educated guess as to the precise split point to
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
785 spearate the 'info' and 'alignment' panels.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
786 This is a right olde faff because the field / panel boundaries change.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
787
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
788 ' Query_6 167 GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG 235'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
789 ' Query_6 236 CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG 305'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
790 ' Query_6 306 GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC 372'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
791 ' Query_6 373 TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT 442'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
792 ' Query_6 443 ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC 512'
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
793 MOTIF_FOUND_IN_AA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
794 i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
795
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
796 Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
797 because the pattern match has been failing more often that it suceeded.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
798
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
799 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
800 #A rough guess is 38 for normal sequences, 48 for reversed ones:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
801
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
802 my $SplitPos = 0;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
803
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
804 (my $Table_ref) = @_; #Get the reference to the table
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
805 my @DNALines; #We populate this for mining in the next section
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
806 foreach my $C_Line (@{$Table_ref})
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
807 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
808 #print "D: $C_Line\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
809 # (my $SplitLine) = $C_Line;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
810 #Split on consecutive tabs or spaces:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
811 my @LineFields = split (/[\t\s]+/,$C_Line);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
812 #print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
813 unless ( $LineFields[3] =~ m/[^\.]/
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
814 && $LineFields[3] =~ m/[ATGCX]{20,}/
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
815 && $#LineFields==4)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
816 { next; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
817 #Enable if you want to know the lines we think are the DNA Query strings:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
818 #print "D: DNA Line: '$C_Line'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
819 push @DNALines, $C_Line; #Note it down
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
820 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
821
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
822 my %PanelBounds; #Will contain the positions of the panel boundaries
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
823
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
824 foreach my $C_DNALine (@DNALines)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
825 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
826 #print "D: '$C_DNALine'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
827 $C_DNALine =~ m/[ATGC-]+ \d+$/; #Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
828 my $MatchPos = $-[0]; #This is the position of the start of the last match because we can't get the index() function to work
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
829 #(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
830 #print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
831 $PanelBounds{$MatchPos}++;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
832 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
833 #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
834 #The brackets around "($SplitPos)" are really necessary it seems.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
835 ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
836 #If you want
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
837 #print Dumper %PanelBounds;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
838 #Tell people if we are having difficultlty:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
839 if ($SplitPos == -1) { warn "Couldn't identify the panel boundaries\n"; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
840 #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
841 return $SplitPos;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
842 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
843
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
844
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
845 ##
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
846 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
847 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
848 ###
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
849
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
850
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
851
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
852
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
853
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
854 #####
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
855 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
856 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
857 #####
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
858 sub markUpCDR3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
859 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
860 =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
861
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
862 Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied. The CDR3 is assumed to start
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
863 at the end of the FWR3.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
864 To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
865 The position of the first matching pattern is reported.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
866
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
867 =head4 Fuller Usage:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
868
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
869 my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
870 $DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"},
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
871 $DNACDR3_Pat, $AACDR3_Pat);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
872
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
873
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
874
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
875 =head4 Returned Values
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
876
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
877 If the CDR3 was found then we we signal like this:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
878
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
879 $MotifFound ==0 : Nope, didn't find either motif
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
880 $MotifFound ==1 : Found at the DNA level, not the AA level
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
881 $MotifFound ==2 : Found at the the AA level, not the DNA level
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
882 $MotifFound ==3 : Found at the the AA level & the DNA level
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
883
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
884 (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
885
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
886 The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
887 Formally the test order is:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
888
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
889 1) DNA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
890 2) AA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
891
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
892 i.e. DNA bp locations have priority.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
893
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
894 Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match).
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
895 See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
896
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
897 =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
898
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
899 Note that patterns are assumed to require white space inserting in them between the letters.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
900 This could be a serious limitation
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
901
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
902
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
903 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
904
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
905 #Get the parameters passed:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
906 my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat) = @_;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
907 print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
908
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
909
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
910 #Setup our return values:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
911 my $Start = 0; my $End =0; my $MotifFound = 0;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
912 my $How; #Literally How the motif was found (or not if blank)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
913
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
914
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
915 =head4 Prepare the sequences and the patterns for use
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
916
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
917 Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
918
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
919 Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
920
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
921
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
922 We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
923 in position calculations.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
924
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
925
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
926 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
927
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
928 #Because igBLAST doesn't always report from the start of the read (primers and things are upstream):
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
929
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
930 my $CropPoint = $FWR3_End - $FWR1_Start - 1 ;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
931 #print "D: markUpCDR3: Crop point is: '$CropPoint'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
932
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
933 #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
934 #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
935
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
936 my $AA_Trimmed = substr ($AA, $CropPoint);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
937 my $DNA_Trimmed = substr ($DNA ,$CropPoint);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
938 #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
939 #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
940
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
941 #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
942 my $AAPat_Spaced;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
943 foreach my $C_Char (0..length($AAPat)-1) #The -1 is because we don't want trailing spaces until the next nt -> AA translation.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
944 { $AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1); }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
945 #And write this back into the main pattern we were passed:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
946 $AAPat = $AAPat_Spaced;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
947
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
948 #temp hack:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
949 #$AA_Trimmed = $AA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
950 my $MotifFound=0; #So we can record which patterns we found
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
951 my $MotifPositionDNA =-1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
952 my $MotifPositionAA =-1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
953
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
954 #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
955 =head4 At DNA level: "TGG GGx xxx GGx" [+1]
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
956
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
957 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
958
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
959 #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
960
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
961 if ($DNA_Trimmed =~ m/$DNAPat/)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
962 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
963 $MotifPositionDNA = $+[0]; #Just the easiest way to do this in Perl
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
964 # print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
965 $MotifFound = $MotifFound + 1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
966 #Any more matches further on?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
967 my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
968 # print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
969 # print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
970 # print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
971 # print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
972 if ($LaterString =~ m/$DNAPat/)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
973 { print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n"; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
974 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
975
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
976 =head4 At AA level: "WGxG" [+2]
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
977
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
978 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
979
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
980 if ($AA_Trimmed=~ m/$AAPat/)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
981 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
982 $MotifPositionAA = $+[0]; #Just the easiest way to do this in Perl
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
983 $MotifFound = $MotifFound + 2;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
984 # print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
985 (my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
986 # print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
987
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
988 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
989
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
990 =head4 Assess the results of motif position finding
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
991
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
992 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
993
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
994 #print "D: markUpCDR3: MotifFound = '$MotifFound'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
995
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
996 if ($MotifFound ==0)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
997 { return ($Start, $End, $MotifFound); } #The easy one really: return we didn't find the CDR3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
998
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
999 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1000 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1001 #Just found in DNA:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1002 if ($MotifFound ==1)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1003 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1004 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1005 $End = $MotifPositionDNA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1006 $How = "MOTIF_FOUND_IN_DNA";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1007 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1008 #Just found in AA:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1009 if ($MotifFound ==2)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1010 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1011 $End = $MotifPositionAA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1012 $How = "MOTIF_FOUND_IN_AA";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1013 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1014
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1015 #Found in both, DNA has priority:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1016 if ($MotifFound ==3)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1017 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1018 $Start = $FWR3_End ; #We assume the end of the FWR3 is the start of CDR3:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1019 $End = $MotifPositionDNA;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1020 $How = "MOTIF_FOUND_IN_BOTH";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1021 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1022
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1023 #print "D: markUpCDR3: Motif found = $MotifFound\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1024
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1025 =head4 These next few lines are for testing / diagnostics only - disable for general use
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1026
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1027 If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1028 the start of FWR1 is unlikely to be at nt 1.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1029
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1030 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1031
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1032 $Start = $FWR3_End - $FWR1_Start -1;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1033 $End = $End + $CropPoint;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1034 my $CDR3_RegionLength = $End - $Start;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1035 #print "D: markUpCDR3: CDR3 Length= $Start - $End = '$CDR3_RegionLength'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1036 (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1037
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1038 #Add onto the coordinates what we trimmed off:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1039
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1040
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1041 #print "D: markUpCDR3: Seq ='$CDR3_seq'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1042
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1043 print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1044 #die "HIT BLOCK\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1045 return ($Start + $FWR1_Start, $End + $FWR1_Start, $How);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1046 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1047
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1048
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1049 sub printOUTPUTData {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1050 =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData}
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1051
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1052 When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1053 This prepared and then returned as a text string that can then be printed to STDOUT:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1054
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1055 print (printOUTPUTData (\%OutputData));
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1056
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1057 Any missing data in the Hash array it polietly ignored and a null string printed in place.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1058 The text field is tab delimited; there are no extra trailing tabs or carriage returns in place.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1059
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1060 Actually the fields printed out are stored in an index array.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1061
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1062 =head3 Header output
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1063
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1064 If the routine is passed a key 'HEADER' then the header columns are returned as that string.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1065 This is tested first - so don't add this unless you mean to.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1066
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1067 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1068
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1069 my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene",
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1070 "CDR1 Seq", "CDR1 Length",
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1071 "CDR2 Seq", "CDR2 Length",
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1072 "CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand",
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1073 "CDR3 Found How");
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1074
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1075 my $OutputString = "OUTPUT:"; #What we are going to build the output into.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1076
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1077 =head4 Print Header & Exit?
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1078
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1079 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1080
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1081 my ($Data_ref) = @_;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1082 #print "D: printOUTPUTData: Running\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1083
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1084 if (exists $$Data_ref {"HEADER"})
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1085 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1086 $OutputString .= "\t";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1087 for(my $n = 0; $n <= $#HeaderFields; $n++)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1088 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1089 $OutputString .= $HeaderFields[$n];
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1090 $OutputString .= "\t" if($n < $#HeaderFields);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1091 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1092
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1093 # foreach my $C_Header (@HeaderFields)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1094 # { $OutputString .= "$C_Header"; } #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1095
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1096 print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1097 return ($OutputString);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1098 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1099
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1100 =head3 Assemble whatever data we have - and tab delimit the null fields
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1101
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1102 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1103 #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1104 foreach my $C_Header (@HeaderFields)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1105 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1106
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1107 if (exists ($$Data_ref {$C_Header}))
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1108 { $OutputString .= "\t". $$Data_ref{$C_Header}; } #We have data to print out
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1109 else
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1110 { $OutputString .="\t"; } #Add a trailing space
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1111 } #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1112
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1113 return ($OutputString);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1114 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1115
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1116
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1117 ######################################### Code Junk ########################
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1118
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1119
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1120 =head2 Code Junk Attic
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1121
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1122 =head3 Demonstrates how to reverse translate an amino acid sequence into DNA:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1123
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1124 use Bio::Tools::CodonTable;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1125 use Bio::Seq;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1126
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1127 # print possible codon tables
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1128 my $tables = Bio::Tools::CodonTable->tables;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1129 while ( (my $id, my $name) = each %{$tables} ) {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1130 print "$id = $name\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1131 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1132 my $CodonTable = Bio::Tools::CodonTable->new();
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1133
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1134 my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1135
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1136
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1137 my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1138 print "D: '$rvSeq'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1139 die "TEST OVER\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1140
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1141 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1142
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1143
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1144 =head3 For processing the 'Alignment lines' section of the alginment table
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1145
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1146 #If we are ever interested; then enable the code below:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1147 # print ": Alignment\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1148 # $InfoPanel =~ s/^ +//; $InfoPanel =~ s/ +$//; #Clean off trailing spaces
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1149 # my ($Germclass, $PID, $PID_Counts, $Allele) = split (/\s+/,$InfoPanel); #Split on spaces
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1150 ##Enable if you need to know what we just found:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1151 # #print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1152 # #A reality check: we should have an Allele - or some text here.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1153 # unless (defined $Allele && $Allele ne "")
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1154 # { warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n"; }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1155 # if (exists ($Alginments {$Germclass}{$Allele}))
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1156 # { $Alginments {$Germclass}{$Allele} = $Alginments {$Germclass}{$Allele}.$CurrentAASequence; } #Carry on adding
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1157 # else #more work needed as we need to 'pad' the sequence with fake gap characters)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1158 # {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1159 ##Do we still need this padding? I don't think so
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1160 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1161 #
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1162 # my $PaddingChars = ($ThisQueryStart-$Query_Start);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1163 # print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1164 # #To help testing, calculate this first:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1165 # my $PaddingString = " "x $PaddingChars;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1166 # $Alginments {$Germclass}{$Allele} = $CurrentAASequence;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1167 # }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1168 # next
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1169
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1170 =head3 Demonstration of Pattern match positions
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1171
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1172 my $Text = "12345TTT TTAAAAA";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1173 my $TestPat = "TTT\\s+TT";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1174 (my $Result)= $Text =~ m/$TestPat/;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1175 print "D: Two vars are: - = ",$-[0], " & + =", $+[0]," for test pattern '$TestPat'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1176
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1177 sub printCDR3 {
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1178
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1179 =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1180
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1181 ???? IS THIS FUNCTION IN USE ?????
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1182
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1183 Handles the printing of the output when passed information about the CDR3 region.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1184
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1185
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1186 The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1187
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1188 print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1189
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1190 #=cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1191
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1192 #Despite the similarity in names, these are all local copies passed to us:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1193
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1194 my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1195
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1196 #For DNA:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1197 my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1198 my ($CDR_DNA_Length) = length ($CDR_DNA_Seq);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1199
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1200 #For AA:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1201 my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1202 my ($CDR_AA_Length) = length ($CDR_AA_Seq);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1203
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1204 my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1205 print "D: SUB: printCDR3: As returned: '$ReturnString'\n";
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1206 return ($ReturnString);
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1207
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1208 }
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1209
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1210 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1211
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1212
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1213
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1214 =head2 Change Log
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1215
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1216 =head3 Version 1.2
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1217
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1218 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record]
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1219 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts].
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1220 This function is replaced by the more general printOUTPUTData()
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1221 3) A tag for the CDR3 status is now output for every record / read.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1222 Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1223
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1224 =head4 Version 1.3
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1225
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1226 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”,
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1227 which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1228 - found by Bas Horsman.
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1229
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1230 =head4 Version 1.3a
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1231
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1232 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3)
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1233
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1234 =cut
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1235
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1236 =head4 Version 1.4
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1237
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1238 1) Now processes files using Mac/Unix/MS-DOS newline characters:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1239
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1240 $_ =~ s/\r\n/\n/g; #In case line ends are MS-DOS
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1241 $_ =~ s/\r/\n/g; #In case line ends are Mac
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1242 #The whole record - one per read - is now stored in $_
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1243 my @Lines =split (/\R/,$_); #Split on new lines
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1244
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1245 =head4 Version 1.4a
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1246
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1247 1) Fixed the length of the CDR3 AA string being reported correctly:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1248
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1249 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1250 to:
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1251 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
afe85eb6572e Uploaded
davidvanzessen
parents:
diff changeset
1252