Mercurial > repos > davidvanzessen > argalaxy_tools

#!/usr/bin/perl
=head1 IGBLAST_simple.pl

This version (1.4) has been heavily adapted since the original program was first created back in October 2012.
Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes.

From V 1.2 onwards a 'Change Log' is included at the end of the program

=head2 Usage

Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment
as it allows inspection of the data stores.

The program takes a text file of the

 ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process->

Supply the text version of the igBLAST report in the format as in the example below.
The extra command line arugment is the record number (aka. BLAST report) to process.
If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards.

=head2 Example Input

A standard igBLAST record or set of them in a file; this being typical:

 BLASTN 2.2.27+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Database: human_gl_V; human_gl_D; human_gl_J
           674 sequences; 179,480 total letters


Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1
run=R_2012_04_10_11_57_56_

Length=433
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

lcl|IGHV3-30*04                                                        330    2e-92
lcl|IGHV3-30-3*01                                                      330    2e-92
lcl|IGHV3-30*01                                                        327    2e-91
lcl|IGHD3-16*01                                                       14.4       11
lcl|IGHD3-16*02                                                       14.4       11
lcl|IGHD1-14*01                                                       12.4       43
lcl|IGHJ4*02                                                          78.3    1e-18
lcl|IGHJ5*02                                                          70.3    4e-16
lcl|IGHJ4*01                                                          68.3    2e-15


Domain classification requested: imgt


V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand):
IGHV3-30*04	IGHD3-16*01	IGHJ4*02	VH	In-frame	+

V(D)J junction details (V end, V-D junction, D region, D-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself
AGAGA	TATGAGCCCCATCATGACA	ACGTTTG	CCGGAA	ACTAC

Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
FWR1	27	38	12	11	1	0	91.7
CDR1	39	62	24	22	2	0	91.7
FWR2	63	113	51	50	1	0	98
CDR2	114	137	24	23	1	0	95.8
FWR3	138	251	114	109	5	0	95.6
CDR3 (V region only)	252	259	8	7	1	0	87.5
Total	N/A	N/A	233	222	11	0	95.3


Alignments

                                        <----FWR1--><----------CDR1--------><-----------------------FWR2------
                                         W  A  A  S  G  F  T  F  N  T  Y  A  V  H  W  V  R  Q  A  P  G  K  G
                    Query_1        27   TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC  96
V  95.3% (222/233)  IGHV3-30*04    64   ..T......................G..G.......A.................................  133
                                         C  A  A  S  G  F  T  F  S  S  Y  A  M  H  W  V  R  Q  A  P  G  K  G
V  95.7% (221/231)  IGHV3-30-3*01  64   ..T......................G..G.......A.................................  133
V  94.8% (221/233)  IGHV3-30*01    64   ..T......................G..G.......A.................................  133

                                        ----------------><----------CDR2--------><----------------------------
                                        L  E  W  V  A  V  I  S  Y  D  G  S  N  K  N  Y  A  D  S  V  K  G  R  F
                    Query_1        97   TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT  166
V  95.3% (222/233)  IGHV3-30*04    134  ..................................T......T............................  203
                                        L  E  W  V  A  V  I  S  Y  D  G  S  N  K  Y  Y  A  D  S  V  K  G  R  F
V  95.7% (221/231)  IGHV3-30-3*01  134  .........................................T............................  203
V  94.8% (221/233)  IGHV3-30*01    134  .A................................T......T............................  203

                                        ---------------------------FWR3---------------------------------------
                                          T  I  S  R  D  N  S  K  N  T  L  Y  L  Q  M  N  S  L  R  V  E  D  T
                    Query_1        167  CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG  236
V  95.3% (222/233)  IGHV3-30*04    204  ...............................C.G.........................C..........  273
                                          T  I  S  R  D  N  S  K  N  T  L  Y  L  Q  M  N  S  L  R  A  E  D  T
V  95.7% (221/231)  IGHV3-30-3*01  204  ...............................C.G.........................C..........  273
V  94.8% (221/233)  IGHV3-30*01    204  ...............................C.G.........................C..........  273

                                        -------------->
                                         A  V  Y  Y  C  T  R  D  M  S  P  I  M  T  T  F  A  G  N  Y  W  G  Q
                    Query_1        237  GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG  306
V  95.3% (222/233)  IGHV3-30*04    274  .....G.........G.......-----------------------------------------------  296
                                         A  V  Y  Y  C  A  R
V  95.7% (221/231)  IGHV3-30-3*01  274  .....G.........G.....-------------------------------------------------  294
V  94.8% (221/233)  IGHV3-30*01    274  .....G.........G.......-----------------------------------------------  296
D  100.0% (7/7)     IGHD3-16*01    12   ------------------------------------------.......---------------------  18
D  100.0% (7/7)     IGHD3-16*02    12   ------------------------------------------.......---------------------  18
D  100.0% (6/6)     IGHD1-14*01    8    -------------------------------------------------......---------------  13
J  100.0% (39/39)   IGHJ4*02       10   -------------------------------------------------------...............  24
J  100.0% (35/35)   IGHJ5*02       17   -----------------------------------------------------------...........  27
J  97.4% (38/39)    IGHJ4*01       10   -------------------------------------------------------.............A.  24


                                        G  T  L  V  T  V  S  S
                    Query_1        307  GAACCCTGGTCACCGTCTCCTCAG  330
J  100.0% (39/39)   IGHJ4*02       25   ........................  48
J  100.0% (35/35)   IGHJ5*02       28   ........................  51
J  97.4% (38/39)    IGHJ4*01       25   ........................  48


Lambda      K        H
    1.10    0.333    0.549

Gapped
Lambda      K        H
    1.08    0.280    0.540

Effective search space used: 64847385


Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1
run=R_2012_04_10_11_57_56_


...etc...

=head2 Example Output


Example output from the data above sent:
 $ ./IGBLAST_simple.pl igBLASTOutput.txt 1
 D: Request to process just record '1' received
 D: printOUTPUTData: Running
 D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How'
 OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 D: ID is: 'HL67IUI01D26LR'
 D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259)
 D: Starting Search for CDR3
 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence)
 D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27'
 D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309)
 D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01     IGHJ4*02        VH      In-frame        +'
 D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +'
 D: printOUTPUTData: Running
 OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH

=head4 Usage notes:

Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need:

 ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT:

 OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01EQMLY  In-frame        IGHV4-39*01     IGHD2-8*01      IGHJ3*02        GGSISSSSYY      29      IYHSGST 20      CARDATYYSNGFDIWGQG      53      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01CDCLP  Out-of-frame    IGHV3-23*01     IGHD3-3*01      IGHJ4*02        FSNYAM  16      SGSGDRTY        23      AKAD*FLEWLFRIGDGERLLGPGN        72      MOTIF_FOUND_IN_DNA
 OUTPUT: HL67IUI01AHRNH  N/A     IGHV3-33*01     N/A     N/A     WIHLQ*LW        23      YGMMEVI 23                      NOT_FOUND
 OUTPUT: HL67IUI01DZZ1V  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        GFTFDKYA        23      ILASG   20      LYCASEGDIVASELLSTGARV   62      MOTIF_FOUND_IN_DNA
 OUTPUT: HL67IUI01DTR2Y  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        LDSPLTNM        23      LYLPVV  20      TVRVRGT*WLRSF*VLGPG     59      MOTIF_FOUND_IN_DNA
 OUTPUT: HL67IUI01EQL3S  In-frame        IGHV7-4-1*02    IGHD6-19*01     IGHJ6*02        GYTFRTFT        23      INTNTGTP        23      CAKESGTGSAHFFYGMDVWGQG  65      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01AFG46  In-frame        IGLV2-34*01     N/A     IGHJ4*02                                                        NOT_FOUND
 OUTPUT: HL67IUI01EFFKO  In-frame        IGHV3-11*01     IGHD6-6*01      IGHJ4*02        GFTFSDYY        23      ISYSGGTI        23      CARASGAARHRPLDYWGQG     56      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01B18SG  In-frame        IGHV3-33*01     IGHD5-12*01     IGHJ4*02        VRQA    11      KYYANSVK        23      RLGGFDYWGQGTLVTVSS      53      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01D6LER  In-frame        IGHV1-24*01     IGHD3-22*01     IGHJ4*02        GYSLNELS        23      PDPEDDE 23      TVQPSRITMMAVVITRIHWGASGARE      76      MOTIF_FOUND_IN_DNA
 OUTPUT: HL67IUI01CYCLF  N/A     IGHV4-39*01     N/A     N/A     GGSISSSSYY      29      IYYSGST 20                      NOT_FOUND
 OUTPUT: HL67IUI01B4LEE  In-frame        IGHV7-4-1*02    IGHD6-19*01     IGHJ6*02        GYTFRTFT        23      INTNTGTP        23      CAKESGTGSAHFFYGMDVWGQG  65      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01A4KW4  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        LDSPLTNM        23      LYLPVV  20      TVRVRGT*WLRSF*IWGQG     58      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01E05BV  In-frame        IGHV1-24*01     IGHD3-22*01     IGHJ2*01        GYSLNELS        23      PDPEDDE 23                      NOT_FOUND
 OUTPUT: HL67IUI01CVVKY  In-frame        IGHV1-3*01      IGHD2-15*01     IGHJ1*01                                                        NOT_FOUND
 OUTPUT: HL67IUI01CN5P2  In-frame        IGHV7-4-1*02    IGHD2-21*02     IGHJ5*02        GYSITDYG        23      LNTRTGNP        23      CAVKDARDFVSWGQG 44      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01DUUJ5  In-frame        IGHV3-21*01     IGHD1-7*01      IGHJ4*02        GYTFSTYS        23      ISSSSAYR        23      CARDIRLELRDWGQG 44      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01E1AIR  Out-of-frame    IGHV4-39*01     N/A     IGHJ3*01        WGLHRRW**L      29      FVS*RAPR        23                      NOT_FOUND
 OUTPUT: HL67IUI01CCZ8D  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        GFTFDKYA        23      ILASGR  20      YCASEGDIVASELLSTGARE    58      MOTIF_FOUND_IN_DNA
 OUTPUT: HL67IUI01BT9IR  N/A     IGHV3-21*02     N/A     N/A                                                     NOT_FOUND
 OUTPUT: HL67IUI01COTO0  Out-of-frame    IGHV4-39*01     N/A     IGHJ3*01        GGFIGGGDNF      29      LYHDGRPA        23                      NOT_FOUND
 OUTPUT: HL67IUI01D994O  In-frame        IGHV7-4-1*02    IGHD2-21*02     IGHJ5*02        GYSITDYG        23      LNTRTGNP        23      CAVKDARDFVSWGQG 44      MOTIF_FOUND_IN_BOTH
 OUTPUT: HL67IUI01A08CJ  In-frame        IGHV4-39*01     IGHD6-13*01     IGHJ5*02        GGSISSSSYY      29      IYYTWEH 21      CERARRGSSWGQLVRPLGPG    62      MOTIF_FOUN


 OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH
 ...etc...

=head4 Also, combined grep & sed:

 $ ./IGBLAST_simple.pl igBLASTOutput.txt  | grep OUTPUT: | sed 's/OUTPUT:\t//'

=cut

=head3 CDR3 Patterns:

We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly:

 my $DNACDR3_Pat = "TGGGG....GG.";
 my $AASequenceMotifPattern = "WG.G";

They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X').

[NB: These are original patterns used for testing, check the code for the current ones.]

=cut

my $DNACDR3_Pat = "TGGGG....GG.";
my $AACDR3_Pat =  "WG.G";

use strict;
use Data::Dumper;
# Set this as to number of the result (aka "record") you want to process or 0 for all:
my $ProcessRecord =0;
if (defined $ARGV[1])	{	$ProcessRecord = pop @ARGV;	}	#Also accept from the command line:
if ($ProcessRecord != 0)	{	print "D: Request to process just record '$ProcessRecord' received\n";	}

#Adjust the record separator:
$/="Query= ";
my $Record=0;	# A simple counter, that we might not use.
#Force-loaded header / version information:
my $Header = <>;
#At the moment we don't use this - so dump it immediately:
$Header = undef;
#print "D: Force-loaded header / version information: '$Header'\n";

#Print the Header for the output line (we need this once, at the start)
print &printOUTPUTData ({"HEADER" => 1})."\n";

while (<>)
	{
=head4 First check - should we be processing this record at all?

=cut
	$Record++;	#Increment the record counter:
	#Do we process this record - or all records?
	if ($ProcessRecord != $Record && $ProcessRecord != 0)
		{	next; }	#We need to increment the record counter before we increment

=head4 Setup the output line storage and print the header:

We enter this initially and work to change it:

 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";

=cut

	my %OUTPUT_Data;	#To collect data for the output line in
	#Assume the first and work to find better:
	$OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND";
	#The whole record - one per read - is now stored in $_
	my @Lines =split (/[\r\n]+/,$_);	# split on windows/linux/mac new lines

 #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went:
	#print "D: Record #$Record\n";	print $_;	print "\n---------\n";
	print "D:  ''$Lines[0]'\nD:  ...etc...'\nD:  ############\n";

=head3 Get the ID

Quite easy: the first field on the first line:

 Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1

=cut

	(my $ID) = $Lines[0]=~ m/^(\S+)/;
	unless (defined $ID && $ID ne "")
		{			# So a near total failure...?
		$OUTPUT_Data{"ID"} = "Unknown";
		print &printOUTPUTData (\%OUTPUT_Data)."\n";
		next;	#No ID is terminal for this record
		}
	else
		{
		print "D: ID is: '$ID'\n";
		$OUTPUT_Data{"ID"} = $ID;
		}
=head3 Declare the variables we will need here in the next few sections to store data

=cut

	my $CurrentRegion;
	my $RegionMarkup;

	#So we can sync the coordinated of the alignment up to the domains found:
	my $Query_Start = -1; my $Query_End = -1;

	#Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop?
	my $ThisQueryStart =-1; my $ThisQueryEnd =-1;	#Think $ThisQueryEnd isn't used at the moment.
	my $DNAQuerySequence ="";		#The actual DNA Query sequence...
	my $AAQuerySequence = "";

	#As this changes with the alleles identified:
	my $CurrentAASequence;
	#The main storage variables

	my %Alginments;  my %Alleles;
	my %DomainBoundaries;

=head2 Stanza 1: Get the general structure of the sequence identified

=head3 Method 1: Use the table supplied

Technically this valid for the top hit...realistically this is the only information we have reported to us
so we use this or nothing.  This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd?  Who knows!

Targets this block:

 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
 FWR1	167	240	75	72	2	1	96
 CDR1	241	264	24	20	4	0	83.3
 FWR2	265	315	51	48	3	0	94.1
 CDR2	316	336	24	15	6	3	62.5
 FWR3	337	450	114	106	8	0	93
 CDR3 (V region only)	451	454	4	4	0	0	100
 Total	N/A	N/A	292	265	23	4	90.8

Then we split out the lines inside it in a second scanning step - less optimal but easier to read:

 FWR1	167	240	75	72	2	1	96
 CDR1	241	264	24	20	4	0	83.3
 FWR2	265	315	51	48	3	0	94.1
 CDR2	316	336	24	15	6	3	62.5
 FWR3	337	450	114	106	8	0	93
 CDR3 (V region only)	451	454	4	4	0	0	100

into:

 (Section, from, to, length, matches, mismatches, gaps, percent identity)

=head3 Method 2: Use the table supplied

The other way to do this is to split the graphical markup out of the alignment.
This works for _any_ reported alignment, not just the top hits:

In the main alignment table processing section collect the information, collect the information:

 #Is region mark-up:
	if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
		{
#		print ": Region Markup detected\n";
		$RegionMarkup = $RegionMarkup.$AlignmentPanel;	#Collect the information, then re-synthesise it at the end of record
		next;
		}

Then afterwards when all the region was collected, process it like this:
#Pad the CDER3 region:

 #Remove the trailing spaces:
	$RegionMarkup =~ s/ *$//g;
 #Calculate the length of the CDR3 region so we can add it in:
	my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1;
 #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop):
	$RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">";
 #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n";

 #Now really process it:
 	my $C_Pos = 0;
	my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup);	#
	foreach my $C_Domain (@Domains)
		{
			if (length ($C_Domain) <=0)	{next;}
		my $DomainStart= $C_Pos;
		my $DomainEnd = $DomainStart + length ($C_Domain)-1;
		my ($DomainType) = $C_Domain =~ m/(...[123])/;
#		print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n";
		$DomainBoundaries{$DomainType}{"Start"} 	= $DomainStart;
		$DomainBoundaries{$DomainType}{"End"} 		= $DomainEnd;
		$C_Pos = $DomainEnd+1;
		}

The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely.
Why stress?


=head3 The end of the FWR3 is the start of CDR3?

This is an assumption made.  Hence the two variables:

	my $MaxDomainReported =0 ;	# In nts / bps
	my $FWR3_Found_Flag = 0;	# Did we find the end of the FWR3 - which is the start of the CDR3.  Set to 'false' initially.

	$MaxDomainBaseFound

=cut
	my $MaxDomainBaseFound 	=0 	;	# In nts / bps
	my $AlignmentStart			;	# In nts /bp	#Alternative name would be: '$MinDomainBaseFound'; set to null until primed
#	my $FWR3_Found_Flag = 0;	# Did we find the end of the FWR3 - which is the start of the CDR3.  Set to 'false' initially.

	(my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" );
#Enable the next line if you want the raw data we are going to parse in this section:
	#print Dumper @StructureSummaryTable;
	foreach my $C_Section (@StructureSummaryTable)
		{
		my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section);
		#print "D: Domain type: '$DomainType'\n";
		#$DomainType =~ s/ .*$//g;
		$DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
		$DomainBoundaries{$DomainType}{"End"} 	= $DomainEnd;

#So we can do a reality check on the length / start of the CDR3 if we have to go looking:
		if ($MaxDomainBaseFound <= $DomainEnd)
			{ $MaxDomainBaseFound = $DomainEnd; }	#Store the maximum base found
		if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart)
			{	$AlignmentStart = $DomainStart;	}
		}
#print Dumper %DomainBoundaries;
#die "HIT BLOCK\n";

=head3 Did we find the CDR3 region specifically?

If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up.

=cut
	print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n";

#my @WantedSections = qw (V D J);

=head2 Second Stanza: Parse the main Alignment Table

=head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels.

As this is a little involved and comparamentalises nicely we sub-contract this to two functions""

	(my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
	my $PanelSplitPoint = findSplitPoint (\@Table);	#Why can't they just use a fixed field width or a tab as a delimiter?

=cut
	(my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
	my $PanelSplitPoint = findSplitPoint (\@Table);	#Why can't they just use a fixed field width or a tab as a delimiter?
#If you are interested, enable this line:
#	print "D: The info panel was detected at: '$splitPoint'\n";

=head3

=cut


foreach my $C_Line (0..$#Table)
	{

=head3 Call the line type we find: There are 4:

These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets)

									     | <- This split is ~40 chars. from the start of the line
	* InfoPanel *						 |    * Alignment Panel *
																														: is a "Blank" line  		(-1,-1)
                                         <----FWR1--><----------CDR1--------><-----------------------FWR2------			: is "Region Markup"		(-1,0)
                                          W  A  A  S  G  F  T  F  N  T  Y  A  V  H  W  V  R  Q  A  P  G  K  G  			: is "AA Sequence"			(-1, >=0)
                     Query_1        27   TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC  96		: is "DNA Sequence"			(2,1)
 V  95.3% (222/233)  IGHV3-30*04    64   ..T......................G..G.......A.................................  133	: is "" "

So we split 40 chars in and then the two parts on spaces.


=cut

#	print "D: (sub) Line in parsed table: '$C_Line': \n";

	my ($InfoPanel, $AlignmentPanel) 	=	$Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/;

	my @InfoColumns = split (/\s+/,$InfoPanel);
	my @AlignmentColumns = split (/\s+/,$AlignmentPanel);

#If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first
#	print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n";
#	print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n";

#Populate this so we can step through it

=head4 Is a blank line:
=cut
		if ($#InfoColumns == -1 && $#AlignmentColumns == -1)
		{
#		print ": Blank\n";
		next;
		}	#For now I think we just skip - is not needed (though might be implict mark-up)

=head4 Is region mark-up:
=cut
	if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
		{
#		print ": Region Markup detected\n";
		$RegionMarkup = $RegionMarkup.$AlignmentPanel;	#Collect the information, then re-synthesise it at the end of record
		next;
		}
=head4 Is query DNA Sequence:
=cut
	if ($#InfoColumns == 2 && $#AlignmentColumns ==1)
		{
#		print ": DNA Query Sequence\n";
		#Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels')
		($ThisQueryStart)  	= $InfoPanel 			=~ / (\d+) *$/;
		($ThisQueryEnd)	 	= $AlignmentPanel   	=~ / (\d+) *$/;
		my ($ThisDNASeq) 	= $AlignmentPanel 		=~ /^(.*?) /;
	#If you want to know what we just found:
		#print "D: This DNA Sequence: '$ThisDNASeq'\n";
		$DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq;	#Add it on to whatever we already have.
		#Move the needle if there are smaller / greater; otherwise prime the 'needles':
		if ($ThisQueryStart < $Query_Start or $Query_Start == -1)
			{	$Query_Start = $ThisQueryStart;	}
		if ($ThisQueryEnd > $Query_End or $Query_End == -1)
			{	$Query_End = $ThisQueryEnd;	}
#		print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n";
		next;
		}
=head4 Is AA Sequence:

This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions:
                                       -------------->
                                         A  V  Y  Y  C  T  R  D  M  S  P  I  M  T  T  F  A  G  N  Y  W  G  Q  					<< Want this
                    Query_1        237  GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG  306
 V  95.3% (222/233)  IGHV3-30*04    274  .....G.........G.......-----------------------------------------------  296
                                         A  V  Y  Y  C  A  R
 V  95.7% (221/231)  IGHV3-30-3*01  274  .....G.........G.....-------------------------------------------------  294

 ...etc...
                                        G  T  L  V  T  V  S  S  																<< Want this
                    Query_1        307  GAACCCTGGTCACCGTCTCCTCAG  330

To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...)

=cut

		if ($#InfoColumns == -1 && $#AlignmentColumns >=-1)
		{
		unless ($Table[$C_Line+1]	=~	/Query/)	{	next;	}	#Is the next line the DNA sequence ?
		#
#		print ": AA sequence\n";


		$CurrentAASequence = $AlignmentPanel;
		#print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n";
		$CurrentAASequence =~ s/^ {$PanelSplitPoint}//;
		#print "D: '$AAQuerySequence'\n";
#		print "D: Current AA Sequence: \t'$CurrentAASequence'\n";
		$AAQuerySequence	=	$AAQuerySequence.$CurrentAASequence;	#Store the elongating AA Sequence as well
		next;
		}
=head4 Is Alignment:
=cut
		if ($#InfoColumns == 4 && $#AlignmentColumns ==1)
		{
		#Not acutally interesting to us for this version of the parser.  Delete ultimately?
		next;
		}

#Is weird!  Don't recognise it!

	warn "Weird!  Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n";
	}	#End main iteration loop for alignment parsing.


=head2  The CDR3 is noted as problematic.  Can we identify it?

=cut
	print "D: Starting Search for CDR3\n";
	#Do have the end of the FWR3 but not the CDR3?  If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point
	if (exists ($DomainBoundaries{"FWR3"}{"End"})
		&& $AlignmentStart !=0
		&& not (exists $DomainBoundaries{"CDR3"}{"End"}) )	#Guess we need to go looking for the end then...
		{
		#print "D: Placing call to markUpCDR3\n";
		my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
			$DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart,
			$DNACDR3_Pat, $AACDR3_Pat);
		if ($CDR3_Start !=0 && $CDR3_End !=0)
			{
			$DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start;
			$DomainBoundaries{"CDR3"}{"End"} = $CDR3_End 	;
			$DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag;
			print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n";
			}
		else
			{	print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n";
				$DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
			}
		}
	else
		{	#Was reported by igBLAST
		print "D: Found the FWR3 from the Domain Boundary Table\n";
		$DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE";
		}

#print Dumper %DomainBoundaries;

=head2 Get the top VDJ regions:

=cut

=head2 Extract General Features:

=cut
	(my $TopHit) = $_ =~  m/V-J Frame, Strand\):\n(.*?)\n/s;
	print "D: Top Hits (raw)= '$TopHit' \n";
	my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit);
	print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n";

=head2 Store the V / D / J Genes used

=cut

	if (defined $Top_V_gene_match && $Top_V_gene_match ne "")
	{	$OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match;		}

	if (defined $Top_D_gene_match && $Top_D_gene_match ne "")
	{	$OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match;		}

	if (defined $Top_J_gene_match && $Top_J_gene_match ne "")
	{	$OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match;		}

	if (defined $Strand && $Strand ne "")
	{	$OUTPUT_Data{"Strand"} = $Strand;}

=head4 Preamble: ID, Frame, and V / D / J used:

=cut
	#Do a reality check: if we didn't get an ID, then skip:
	unless (defined (defined $ID) && $ID ne "" &&
			defined $VJFrame && $VJFrame ne "")
			{
			print &printOUTPUTData (\%OUTPUT_Data)."\n";
			next;
			}

#Ok, so we have data...most likely:
	#print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match);

	if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "")
		{	$OUTPUT_Data{"VDJ Frame"} = $VJFrame;}
		else
		{
		print &printOUTPUTData (\%OUTPUT_Data)."\n";
		 next;
		}#REALLY?  We didn't find anything?  Oh well, move to next record

=head4 CDR1

=cut
	#Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates)

	my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"};

#	print "D: AA Seqeunce is: '$AAQuerySequence'\n";
	if (exists $DomainBoundaries{"CDR1"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
		{
#		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
		my $CDR1Start 		= $DomainBoundaries{"CDR1"}{"Start"};
		my $CDR1End 		= $DomainBoundaries{"CDR1"}{"End"};
		my $CDR1_Length 	= $CDR1End - $CDR1Start;
#		print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n";
	#Remember that the alignment starts at the FWR1 start, not nt =0 on the read
		my $CDR1_Seq_AA 	= substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length);
#		print "D: '$CDR1_Seq_AA'\n";
		$CDR1_Seq_AA 	=~ s/ //g;
		my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA);
		#Add this data to the output store specifically:
		$OUTPUT_Data{"CDR1 Seq"} 	= $CDR1_Seq_AA;
		$OUTPUT_Data{"CDR1 Length"} = $CDR1_Length;
		}
	#What happens if there is no CDR1 found?  Leave blank - the output routine can handle this

=head4 CDR2

=cut

	if (exists $DomainBoundaries{"CDR2"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
		{
#		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
		my $CDR2Start 		= $DomainBoundaries{"CDR2"}{"Start"};
		my $CDR2End 		= $DomainBoundaries{"CDR2"}{"End"};
		my $CDR2_Length 	= $CDR2End - $CDR2Start;
		my $CDR2_Seq_AA 	= substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length);
		 $CDR2_Seq_AA 	=~ s/ //g;
		my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA);
		#Add this data to the output store specifically:
		$OUTPUT_Data{"CDR2 Seq"} 	= $CDR2_Seq_AA;
		$OUTPUT_Data{"CDR2 Length"} = $CDR2_Length;
		}
	#What happens if there is no CDR2 found?  Leave blank - the output routine can handle this.

=head4 CDR3

=cut
	if (exists $DomainBoundaries{"CDR3"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
		{
#		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
		my $CDR3Start 		= $DomainBoundaries{"CDR3"}{"Start"};
		my $CDR3End 		= $DomainBoundaries{"CDR3"}{"End"};
		my $CDR3_Length 	= $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so
		my $CDR3_Seq_AA 	= substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
		my $CDR3_Seq_DNA	= substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
		$CDR3_Seq_AA 	=~ s/ //g;
		$CDR3_Seq_DNA 	=~ s/ //g;
		my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA);
		my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA);
		#Add this data to the output store specifically:
		$OUTPUT_Data{"CDR3 Seq"} 	= $CDR3_Seq_AA;
		$OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
		$OUTPUT_Data{"CDR3 Seq DNA"}     = $CDR3_Seq_DNA;
		$OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length;
		#And in the case of the CDR3 how we found it:
		$OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"};
		}
	#What happens if there is no CDR3 found?  Leave blank - the output routine can handle this.
#die "HIT BLOCK\n";
#End of the record; output the data we have collected and move on.
print &printOUTPUTData (\%OUTPUT_Data)."\n";
}


############
sub returnLinesBetween {
=head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section})

When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array
that are bounded by these tags.

If either of the tags are not found - or are found in the wrong order - then a null list is returned.

=cut

my ($Text_ref, $TopTag, $BotTag) = @_;

my @Table;
#The two boundary conditions at which we will cut the table:
#print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n";
#How we record these:
my $AlignmentLine_Top=0;  my $AlignmentLine_Bot=0;

my $LineIndex=-1;	#-1 As the loop increments this line counter first, then does its checks.
#If you care:
#print "D: Lines of text passed: $$#Lines\n";

#Iterate through until we find what we are looking for or run out of text to search:
while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref})
	{
	$LineIndex++;
	#Enable if you need to care:
#	print "D: Line Index = $LineIndex\n";

	if ($$Text_ref[$LineIndex] =~ m/$TopTag/)
		{
		$AlignmentLine_Top = $LineIndex;
#		print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n";	#Enable if you are interested
		}
	if ($$Text_ref[$LineIndex] =~ m/$BotTag/)
		{
		$AlignmentLine_Bot = $LineIndex;
#		print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n";	#Enable if you are interested
		}
	}
#Reality check: did we find anything?  If not then we return null.
if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0)
		{	return;	}
#Again, enable if you care:
#print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n";

#We want the lines one down and one up - so polish these.
$AlignmentLine_Top++; $AlignmentLine_Bot--;

#Return as an array slice:
return 	(@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]);
}
############

sub findSplitPoint
{
=head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table)

When passed a table with the alignment in it makes an educated guess as to the precise split point to
spearate the 'info' and 'alignment' panels.
This is a right olde faff because the field / panel boundaries change.

 '                    Query_6      167  GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG  235'
 '                    Query_6      236  CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG  305'
 '                    Query_6      306  GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC  372'
 '                    Query_6      373  TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT  442'
 '                    Query_6      443  ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC  512'
MOTIF_FOUND_IN_AA
i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total

Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary
because the pattern match has been failing more often that it suceeded.

=cut
#A rough guess is 38 for normal sequences, 48 for reversed ones:

my $SplitPos = 0;

(my $Table_ref) = @_;	#Get the reference to the table
my @DNALines;	#We populate this for mining in the next section
foreach  my $C_Line 	(@{$Table_ref})
	{
	#print "D: $C_Line\n";
#	(my $SplitLine) = $C_Line;
	#Split on consecutive tabs or spaces:
	my @LineFields = split (/[\t\s]+/,$C_Line);
	#print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n";
	unless (	$LineFields[3] =~ m/[^\.]/
			&& 	$LineFields[3] =~ m/[ATGCX]{20,}/
			&& 	$#LineFields==4)
				{	next;	}
#Enable if you want to know the lines we think are the DNA Query strings:
	#print "D: DNA Line:          '$C_Line'\n";
	push @DNALines, $C_Line;		#Note it down
	}

my %PanelBounds;	#Will contain the positions of the panel boundaries

foreach my $C_DNALine (@DNALines)
	{
	#print "D: '$C_DNALine'\n";
	$C_DNALine =~ m/[ATGC-]+  \d+$/;	#Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters.
	my $MatchPos = $-[0];				#This is the position of the start of the last match because we can't get the index() function to work
	#(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0);
	#print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n";
	$PanelBounds{$MatchPos}++;
	}
#Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure)
#The brackets around "($SplitPos)" are really necessary it seems.
($SplitPos) = (sort { $a <=> $b } keys %PanelBounds);
#If you want
#print Dumper  %PanelBounds;
#Tell people if we are having difficultlty:
if ($SplitPos == -1)	{	warn "Couldn't identify the panel boundaries\n";	}
#print "D: $SplitPos: Returning the split position of: '$SplitPos'\n";
return $SplitPos;
}


##
#
#
###


#####
#
#
#####
sub markUpCDR3
{
=head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex)

Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied.  The CDR3 is assumed to start
at the end of the FWR3.
To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern.
The position of the first matching pattern is reported.

=head4 Fuller Usage:

my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
			$DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"},
			$DNACDR3_Pat, $AACDR3_Pat);


=head4 Returned Values

If the CDR3 was found then we we signal like this:

 $MotifFound ==0 	: Nope, didn't find either motif
 $MotifFound ==1 	: Found at the DNA level, not the AA level
 $MotifFound ==2 	: Found at the the AA level, not the DNA level
 $MotifFound ==3 	: Found at the the AA level & the DNA level

(Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore)

The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same.
Formally the test order is:

 1) DNA
 2) AA

i.e. DNA bp locations have priority.

Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match).
See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation.

=head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA

Note that patterns are assumed to require white space inserting in them between the letters.
This could be a serious limitation


=cut

#Get the parameters passed:
my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat)	=	@_;
print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n";


#Setup our return values:
my $Start = 0; my  $End =0;	my $MotifFound = 0;
my $How;	#Literally How the motif was found (or not if blank)


=head4 Prepare the sequences and the patterns for use

Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs

Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working.


We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on
in position calculations.


=cut

#Because igBLAST doesn't always report from the start of the read (primers and things are upstream):

my $CropPoint = $FWR3_End - $FWR1_Start - 1 ;
#print "D: markUpCDR3: Crop point is: '$CropPoint'\n";

#print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n";
#We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling:

my $AA_Trimmed 		= substr ($AA, $CropPoint);
my $DNA_Trimmed 	= substr ($DNA ,$CropPoint);
#print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n";
#print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n";

#This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working
my $AAPat_Spaced;
foreach my $C_Char (0..length($AAPat)-1)	#The -1 is because we don't want trailing spaces until the next nt ->  AA translation.
	{	$AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1);	}
#And write this back into the main pattern we were passed:
$AAPat = $AAPat_Spaced;

#temp hack:
#$AA_Trimmed = $AA;
my $MotifFound=0;		#So we can record which patterns we found
my $MotifPositionDNA	=-1;
my $MotifPositionAA		=-1;

#print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n";
=head4 At DNA level: "TGG GGx xxx GGx" [+1]

=cut

#print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";

if ($DNA_Trimmed =~ m/$DNAPat/)
	{
	$MotifPositionDNA = $+[0];	#Just the easiest way to do this in Perl
#	print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n";
	$MotifFound = $MotifFound + 1;
	#Any more matches further on?
	my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA);
#	print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n";
#	print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n";
#	print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
#	print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n";
	if ($LaterString =~ m/$DNAPat/)
		{	print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n";		}
	}

=head4 At AA level: "WGxG" [+2]

=cut

if ($AA_Trimmed=~ m/$AAPat/)
	{
	$MotifPositionAA = $+[0];	#Just the easiest way to do this in Perl
	$MotifFound = $MotifFound + 2;
#	print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n";
	(my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA);
#	print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n";

	}

=head4 Assess the results of motif position finding

=cut

#print "D: markUpCDR3: MotifFound  = '$MotifFound'\n";

if ($MotifFound ==0)
	{	return ($Start, $End, $MotifFound);		}	#The easy one really: return we didn't find the CDR3

#
$Start = $FWR3_End;	#We assume the end of the FWR3 is the start of CDR3:
#Just found in DNA:
if ($MotifFound ==1)
	{
	$Start = $FWR3_End;	#We assume the end of the FWR3 is the start of CDR3:
	$End = $MotifPositionDNA;
	$How = "MOTIF_FOUND_IN_DNA";
	}
#Just found in AA:
if ($MotifFound ==2)
	{
	$End = $MotifPositionAA;
	$How = "MOTIF_FOUND_IN_AA";
	}

#Found in both, DNA has priority:
if ($MotifFound ==3)
	{
	$Start = $FWR3_End ;	#We assume the end of the FWR3 is the start of CDR3:
	$End = $MotifPositionDNA;
	$How = "MOTIF_FOUND_IN_BOTH";
	}

#print "D: markUpCDR3: Motif found = $MotifFound\n";

=head4 These next few lines are for testing / diagnostics only - disable for general use

If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that
the start of FWR1 is unlikely to be at nt 1.

=cut

$Start = $FWR3_End - $FWR1_Start -1;
$End = $End + $CropPoint;
my $CDR3_RegionLength = $End - $Start;
#print "D: markUpCDR3: CDR3 Length=  $Start - $End = '$CDR3_RegionLength'\n";
(my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength);

#Add onto the coordinates what we trimmed off:


#print "D: markUpCDR3: Seq ='$CDR3_seq'\n";

print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n";
#die "HIT BLOCK\n";
return ($Start + $FWR1_Start, $End + $FWR1_Start, $How);
}


sub printOUTPUTData {
=head2 sub: $OutputDataString = printOUTPUTData {\%OutputData}

When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID.
This prepared and then returned as a text string that can then be printed to STDOUT:

	print (printOUTPUTData (\%OutputData));

Any missing data in the Hash array it polietly ignored and a null string printed in place.
The text field is tab delimited; there are no extra trailing tabs or carriage returns in place.

Actually the fields printed out are stored in an index array.

=head3 Header output

If the routine is passed a key 'HEADER' then the header columns are returned as that string.
This is tested first - so don't add this unless you mean to.

=cut

my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene",
					"CDR1 Seq", "CDR1 Length",
					"CDR2 Seq", "CDR2 Length",
					"CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand",
					"CDR3 Found How");

my $OutputString = "OUTPUT:";	#What we are going to build the output into.

=head4 Print Header & Exit?

=cut

my ($Data_ref) = @_;
#print "D: printOUTPUTData: Running\n";

if (exists $$Data_ref {"HEADER"})
	{
	$OutputString .= "\t";
	for(my $n = 0; $n <= $#HeaderFields; $n++)
	{
		$OutputString .= $HeaderFields[$n];
		$OutputString .= "\t" if($n < $#HeaderFields);
	}

	# foreach my $C_Header (@HeaderFields)
		# {	$OutputString .= "$C_Header";		}	#

	print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n";
	return ($OutputString);
	}

=head3 Assemble whatever data we have - and tab delimit the null fields

=cut
#print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref;
foreach my $C_Header (@HeaderFields)
	{

	if (exists ($$Data_ref {$C_Header}))
		{	$OutputString .= "\t". $$Data_ref{$C_Header};	}	#We have data to print out
		else
		{	$OutputString .="\t";	}	#Add a trailing space
	}	#

return ($OutputString);
}


######################################### Code Junk ########################


=head2 Code Junk Attic

=head3 Demonstrates how to reverse translate an amino acid sequence into DNA:

use Bio::Tools::CodonTable;
use Bio::Seq;

# print possible codon tables
  my $tables = Bio::Tools::CodonTable->tables;
  while ( (my $id, my $name) = each %{$tables} ) {
    print "$id = $name\n";
  }
 my $CodonTable   = Bio::Tools::CodonTable->new();

 my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n";


my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq);
print "D: '$rvSeq'\n";
die "TEST OVER\n";

=cut


=head3 For processing the 'Alignment lines' section of the alginment table

		#If we are ever interested; then enable the code below:
#		print ": Alignment\n";
#		$InfoPanel =~ s/^ +//;	$InfoPanel =~ s/ +$//;		#Clean off trailing spaces
#		my ($Germclass, $PID, $PID_Counts, $Allele) 	=	split (/\s+/,$InfoPanel);	#Split on spaces
##Enable if you need to know what we just found:
#		#print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n";
#		#A reality check: we should have an Allele - or some text here.
#		unless (defined $Allele && $Allele ne "")
#			{	warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n";	}
#		if (exists ($Alginments {$Germclass}{$Allele}))
#			{	$Alginments {$Germclass}{$Allele}	=	$Alginments {$Germclass}{$Allele}.$CurrentAASequence;	}	#Carry on adding
#		else	#more work needed as we need to 'pad' the sequence with fake gap characters)
#			{
##Do we still need this padding?  I don't think so
#
#
#			my $PaddingChars = ($ThisQueryStart-$Query_Start);
#			print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n";
#			#To help testing, calculate this first:
#			my $PaddingString = " "x $PaddingChars;
#			$Alginments {$Germclass}{$Allele}	=	$CurrentAASequence;
#			}
#		next

=head3 Demonstration of Pattern match positions

my $Text = "12345TTT   TTAAAAA";
my $TestPat = "TTT\\s+TT";
(my $Result)= $Text =~ m/$TestPat/;
print "D: Two vars are: - = ",$-[0], " &  + =", $+[0]," for test pattern '$TestPat'\n";

sub printCDR3 {

=head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence);

???? IS THIS FUNCTION IN USE ?????

Handles the printing of the output when passed information about the CDR3 region.


The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT:

 print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n";

#=cut

#Despite the similarity in names, these are all local copies passed to us:

my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_;

#For DNA:
my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End);
my ($CDR_DNA_Length) = length ($CDR_DNA_Seq);

#For AA:
my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End);
my ($CDR_AA_Length) = length ($CDR_AA_Seq);

my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed:
print "D: SUB: printCDR3: As returned: '$ReturnString'\n";
return ($ReturnString);

}

=cut


=head2 Change Log

=head3 Version 1.2

 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record]
 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts].
 	This function is replaced by the more general printOUTPUTData()
 3) A tag for the CDR3 status is now output for every record / read.
    Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found.

=head4 Version 1.3

 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”,
  which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly.
  - found by Bas Horsman.

=head4 Version 1.3a

 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3)

=cut

=head4 Version 1.4

 1) Now processes files using Mac/Unix/MS-DOS newline characters:

  $_ =~ s/\r\n/\n/g;		#In case line ends are MS-DOS
  $_ =~ s/\r/\n/g;		#In case line ends are Mac
  #The whole record - one per read - is now stored in $_
  my @Lines =split (/\R/,$_);	#Split on new lines

=head4 Version 1.4a

1) Fixed the length of the CDR3 AA string being reported correctly:

 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length;
 to:
 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
author	davidvanzessen
date	Tue, 14 Mar 2017 06:51:38 -0400
parents	afe85eb6572e
children