annotate fastx_barcode_splitter.pl @ 0:84bbf4fd24c3 draft

Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
author lparsons
date Fri, 08 Nov 2013 09:53:39 -0500
parents
children e7b7cdc1834d
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
1 #!/usr/bin/perl
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
2
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
3 # FASTX-toolkit - FASTA/FASTQ preprocessing tools.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
4 # Copyright (C) 2009 A. Gordon (gordon@cshl.edu)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
5 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
6 # Lance Parsons (lparsons@princeton.edu)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
7 # 3/21/2011 - Modified to accept separate index file for barcodes
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
8 # 4/6/2011 - Modified to cleanup bad barcode identifiers (esp. useful for Galaxy)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
9 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
10 # This program is free software: you can redistribute it and/or modify
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
11 # it under the terms of the GNU Affero General Public License as
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
12 # published by the Free Software Foundation, either version 3 of the
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
13 # License, or (at your option) any later version.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
14 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
15 # This program is distributed in the hope that it will be useful,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
16 # but WITHOUT ANY WARRANTY; without even the implied warranty of
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
17 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
18 # GNU Affero General Public License for more details.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
19 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
20 # You should have received a copy of the GNU Affero General Public License
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
21 # along with this program. If not, see <http://www.gnu.org/licenses/>.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
22
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
23 use strict;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
24 use warnings;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
25 use IO::Handle;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
26 use Data::Dumper;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
27 use Getopt::Long;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
28 use Carp;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
29
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
30 ##
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
31 ## This program splits a FASTQ/FASTA file into several smaller files,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
32 ## Based on barcode matching.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
33 ##
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
34 ## run with "--help" for usage information
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
35 ##
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
36 ## Assaf Gordon <gordon@cshl.edu> , 11sep2008
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
37
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
38 # Forward declarations
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
39 sub load_barcode_file ($);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
40 sub parse_command_line ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
41 sub match_sequences ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
42 sub mismatch_count($$) ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
43 sub print_results;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
44 sub open_and_detect_input_format;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
45 sub open_index_and_detect_input_format($);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
46 sub read_index_record;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
47 sub read_record;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
48 sub write_record($);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
49 sub usage();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
50
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
51 # Global flags and arguments,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
52 # Set by command line argumens
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
53 my $barcode_file ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
54 my $barcodes_at_eol = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
55 my $barcodes_at_bol = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
56 my $index_read_file ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
57 my $exact_match = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
58 my $allow_partial_overlap = 0;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
59 my $allowed_mismatches = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
60 my $newfile_suffix = '';
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
61 my $newfile_prefix ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
62 my $quiet = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
63 my $debug = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
64 my $fastq_format = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
65 my $index_fastq_format = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
66 my $read_id_check_strip_characters = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
67
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
68 # Global variables
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
69 # Populated by 'create_output_files'
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
70 my %filenames;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
71 my %files;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
72 my %counts = ( 'unmatched' => 0 );
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
73 my $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
74 my @barcodes;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
75 my $input_file_io;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
76
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
77
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
78 # The Four lines per record in FASTQ format.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
79 # (when using FASTA format, only the first two are used)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
80 my $seq_name;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
81 my $seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
82 my $seq_name2;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
83 my $seq_qualities;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
84
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
85 # Values used for index read file
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
86 my $index_seq_name;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
87 my $index_seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
88 my $index_seq_name2;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
89 my $index_seq_qualities;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
90
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
91
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
92
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
93 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
94 # Start of Program
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
95 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
96 parse_command_line ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
97
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
98 load_barcode_file ( $barcode_file ) ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
99
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
100 open_and_detect_input_format;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
101
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
102 if (defined $index_read_file) {open_index_and_detect_input_format ( $index_read_file );}
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
103
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
104 match_sequences ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
105
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
106 print_results unless $quiet;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
107
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
108 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
109 # End of program
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
110 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
111
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
112
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
113
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
114
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
115
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
116
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
117
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
118
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
119 sub parse_command_line {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
120 my $help;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
121
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
122 usage() if (scalar @ARGV==0);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
123
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
124 my $result = GetOptions ( "bcfile=s" => \$barcode_file,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
125 "eol" => \$barcodes_at_eol,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
126 "bol" => \$barcodes_at_bol,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
127 "idxfile=s" => \$index_read_file,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
128 "idxidstrip=i" => \$read_id_check_strip_characters,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
129 "exact" => \$exact_match,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
130 "prefix=s" => \$newfile_prefix,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
131 "suffix=s" => \$newfile_suffix,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
132 "quiet" => \$quiet,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
133 "partial=i" => \$allow_partial_overlap,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
134 "debug" => \$debug,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
135 "mismatches=i" => \$allowed_mismatches,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
136 "help" => \$help
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
137 ) ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
138
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
139 usage() if ($help);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
140
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
141 die "Error: barcode file not specified (use '--bcfile [FILENAME]')\n" unless defined $barcode_file;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
142 die "Error: prefix path/filename not specified (use '--prefix [PATH]')\n" unless defined $newfile_prefix;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
143
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
144 if (! defined $index_read_file) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
145 if ($barcodes_at_bol == $barcodes_at_eol) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
146 die "Error: can't specify both --eol & --bol\n" if $barcodes_at_eol;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
147 die "Error: must specify either --eol or --bol or --idxfile\n" ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
148 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
149 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
150 elsif ($barcodes_at_bol || $barcodes_at_eol) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
151 die "Error: Must specify only one of --idxfile, --eol, or --bol";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
152 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
153
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
154 die "Error: invalid for value partial matches (valid values are 0 or greater)\n" if $allow_partial_overlap<0;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
155
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
156 $allowed_mismatches = 0 if $exact_match;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
157
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
158 die "Error: invalid value for mismatches (valid values are 0 or more)\n" if ($allowed_mismatches<0);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
159
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
160 die "Error: partial overlap value ($allow_partial_overlap) bigger than " .
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
161 "max. allowed mismatches ($allowed_mismatches)\n" if ($allow_partial_overlap > $allowed_mismatches);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
162
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
163
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
164 exit unless $result;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
165 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
166
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
167
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
168
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
169 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
170 # Read the barcode file
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
171 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
172 sub load_barcode_file ($) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
173 my $filename = shift or croak "Missing barcode file name";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
174
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
175 open BCFILE,"<$filename" or die "Error: failed to open barcode file ($filename)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
176 while (<BCFILE>) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
177 next if m/^#/;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
178 chomp;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
179 my ($ident, $barcode) = split('\t') ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
180
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
181 $barcode = uc($barcode);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
182
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
183 # Sanity checks on the barcodes
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
184 die "Error: bad data at barcode file ($filename) line $.\n" unless defined $barcode;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
185 die "Error: bad barcode value ($barcode) at barcode file ($filename) line $.\n"
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
186 unless $barcode =~ m/^[AGCT]+$/;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
187
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
188 # Cleanup Identifiers (only allow alphanumeric, replace others with dash '-')
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
189 $ident =~ s/[^A-Za-z0-9]/-/g;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
190 die "Error: bad identifier value ($ident) at barcode file ($filename) line $. (must be alphanumeric)\n"
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
191 unless $ident =~ m/^[A-Za-z0-9-]+$/;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
192
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
193 die "Error: badcode($ident, $barcode) is shorter or equal to maximum number of " .
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
194 "mismatches ($allowed_mismatches). This makes no sense. Specify fewer mismatches.\n"
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
195 if length($barcode)<=$allowed_mismatches;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
196
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
197 $barcodes_length = length($barcode) unless defined $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
198 die "Error: found barcodes in different lengths. this feature is not supported yet.\n"
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
199 unless $barcodes_length == length($barcode);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
200
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
201 push @barcodes, [$ident, $barcode];
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
202
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
203 if ($allow_partial_overlap>0) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
204 foreach my $i (1 .. $allow_partial_overlap) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
205 substr $barcode, ($barcodes_at_bol)?0:-1, 1, '';
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
206 push @barcodes, [$ident, $barcode];
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
207 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
208 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
209 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
210 close BCFILE;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
211
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
212 if ($debug) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
213 print STDERR "barcode\tsequence\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
214 foreach my $barcoderef (@barcodes) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
215 my ($ident, $seq) = @{$barcoderef};
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
216 print STDERR $ident,"\t", $seq ,"\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
217 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
218 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
219 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
220
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
221 # Create one output file for each barcode.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
222 # (Also create a file for the dummy 'unmatched' barcode)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
223 sub create_output_files {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
224 my %barcodes = map { $_->[0] => 1 } @barcodes; #generate a uniq list of barcode identifiers;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
225 $barcodes{'unmatched'} = 1 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
226
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
227 foreach my $ident (keys %barcodes) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
228 my $new_filename = $newfile_prefix . $ident . $newfile_suffix;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
229 $filenames{$ident} = $new_filename;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
230 open my $file, ">$new_filename" or die "Error: failed to create output file ($new_filename)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
231 $files{$ident} = $file ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
232 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
233 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
234
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
235 sub match_sequences {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
236
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
237 my %barcodes = map { $_->[0] => 1 } @barcodes; #generate a uniq list of barcode identifiers;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
238 $barcodes{'unmatched'} = 1 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
239
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
240 #reset counters
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
241 foreach my $ident ( keys %barcodes ) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
242 $counts{$ident} = 0;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
243 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
244
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
245 create_output_files;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
246
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
247 # Read file FASTQ file
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
248 # split accotding to barcodes
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
249 while ( read_record ) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
250 chomp $seq_name;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
251 chomp $seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
252 if (defined $index_read_file) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
253 read_index_record() or die "Error: Unable to read index sequence for sequence name ($seq_name), check to make sure the file lengths match.\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
254 chomp $index_seq_name;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
255 chomp $index_seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
256
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
257 # Assert that the read ids match
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
258 my $seq_name_match = &strip_read_id($seq_name);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
259 my $index_seq_name_match = &strip_read_id($index_seq_name);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
260 if ($seq_name_match ne $index_seq_name_match) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
261 die "Error: Index sequence name ($index_seq_name) does not match sequence name ($seq_name)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
262 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
263
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
264 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
265
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
266 print STDERR "sequence $seq_bases: \n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
267
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
268 my $best_barcode_mismatches_count = $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
269 my $best_barcode_ident = undef;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
270
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
271 #Try all barcodes, find the one with the lowest mismatch count
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
272 foreach my $barcoderef (@barcodes) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
273 my ($ident, $barcode) = @{$barcoderef};
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
274
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
275 # Get DNA fragment (in the length of the barcodes)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
276 # The barcode will be tested only against this fragment
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
277 # (no point in testing the barcode against the whole sequence)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
278 my $sequence_fragment;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
279 if ($barcodes_at_bol) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
280 $sequence_fragment = substr $seq_bases, 0, $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
281 } elsif ($barcodes_at_eol) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
282 $sequence_fragment = substr $seq_bases, - $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
283 } else {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
284 $sequence_fragment = substr $index_seq_bases, 0, $barcodes_length;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
285 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
286
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
287 my $mm = mismatch_count($sequence_fragment, $barcode) ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
288
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
289 # if this is a partial match, add the non-overlap as a mismatch
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
290 # (partial barcodes are shorter than the length of the original barcodes)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
291 $mm += ($barcodes_length - length($barcode));
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
292
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
293 if ( $mm < $best_barcode_mismatches_count ) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
294 $best_barcode_mismatches_count = $mm ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
295 $best_barcode_ident = $ident ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
296 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
297 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
298
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
299 $best_barcode_ident = 'unmatched'
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
300 if ( (!defined $best_barcode_ident) || $best_barcode_mismatches_count>$allowed_mismatches) ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
301
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
302 print STDERR "sequence $seq_bases matched barcode: $best_barcode_ident\n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
303
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
304 $counts{$best_barcode_ident}++;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
305
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
306 #get the file associated with the matched barcode.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
307 #(note: there's also a file associated with 'unmatched' barcode)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
308 my $file = $files{$best_barcode_ident};
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
309
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
310 write_record($file);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
311 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
312 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
313
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
314 # Strip end of readids when matching to avoid mismatch between read 1, 2, 3, etc.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
315 sub strip_read_id {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
316 my $read_id = shift;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
317 my $stripped_read_id = $read_id;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
318 if ($read_id_check_strip_characters) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
319 if ($read_id =~ /@([a-zA-Z0-9_-]+):([0-9]+):([a-zA-Z0-9]+):([0-9]+):([0-9]+):([0-9]+):([0-9]+) ([0-9]+):([YN]):([0-9]+):([ACGT]+){0,1}/) { # CASAVA 1.8+
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
320 my @parts = split(/ /,$read_id);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
321 $stripped_read_id = $parts[0];
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
322 } else { # CASAVA 1.7 and earlier
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
323 $stripped_read_id = substr($read_id, 0, length($read_id)-$read_id_check_strip_characters);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
324 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
325 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
326 return $stripped_read_id;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
327 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
328
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
329
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
330 #Quickly calculate hamming distance between two strings
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
331 #
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
332 #NOTE: Strings must be same length.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
333 # returns number of different characters.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
334 #see http://www.perlmonks.org/?node_id=500235
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
335 sub mismatch_count($$) { length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ tr[\0][\0] ) }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
336
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
337
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
338
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
339 sub print_results
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
340 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
341 print "Barcode\tCount\tLocation\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
342 my $total = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
343 foreach my $ident (sort keys %counts) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
344 print $ident, "\t", $counts{$ident},"\t",$filenames{$ident},"\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
345 $total += $counts{$ident};
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
346 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
347 print "total\t",$total,"\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
348 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
349
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
350
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
351 sub read_record
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
352 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
353 $seq_name = $input_file_io->getline();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
354
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
355 return undef unless defined $seq_name; # End of file?
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
356
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
357 $seq_bases = $input_file_io->getline();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
358 die "Error: bad input file, expecting line with sequences\n" unless defined $seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
359
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
360 # If using FASTQ format, read two more lines
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
361 if ($fastq_format) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
362 $seq_name2 = $input_file_io->getline();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
363 die "Error: bad input file, expecting line with sequence name2\n" unless defined $seq_name2;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
364
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
365 $seq_qualities = $input_file_io->getline();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
366 die "Error: bad input file, expecting line with quality scores\n" unless defined $seq_qualities;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
367 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
368 return 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
369 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
370
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
371 sub write_record($)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
372 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
373 my $file = shift;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
374
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
375 croak "Bad file handle" unless defined $file;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
376
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
377 print $file $seq_name,"\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
378 print $file $seq_bases,"\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
379
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
380 #if using FASTQ format, write two more lines
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
381 if ($fastq_format) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
382 print $file $seq_name2;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
383 print $file $seq_qualities;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
384 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
385 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
386
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
387 sub open_and_detect_input_format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
388 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
389 $input_file_io = new IO::Handle;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
390 die "Failed to open STDIN " unless $input_file_io->fdopen(fileno(STDIN),"r");
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
391
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
392 # Get the first characeter, and push it back
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
393 my $first_char = $input_file_io->getc();
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
394 $input_file_io->ungetc(ord $first_char);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
395
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
396 if ($first_char eq '>') {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
397 # FASTA format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
398 $fastq_format = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
399 print STDERR "Detected FASTA format\n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
400 } elsif ($first_char eq '@') {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
401 # FASTQ format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
402 $fastq_format = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
403 print STDERR "Detected FASTQ format\n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
404 } else {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
405 die "Error: unknown file format. First character = '$first_char' (expecting > or \@)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
406 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
407 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
408
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
409
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
410 sub open_index_and_detect_input_format($) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
411 my $filename = shift or croak "Missing index read file name";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
412
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
413 open IDXFILE,"<$filename" or die "Error: failed to open index read file ($filename)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
414
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
415 # Get the first line, and reset file pointer
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
416 my $first_line = <IDXFILE>;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
417 my $first_char = substr($first_line, 0, 1);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
418 seek(IDXFILE, 0, 0);
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
419
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
420 if ($first_char eq '>') {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
421 # FASTA format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
422 $index_fastq_format = 0 ;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
423 print STDERR "Detected FASTA format for index file\n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
424 } elsif ($first_char eq '@') {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
425 # FASTQ format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
426 $index_fastq_format = 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
427 print STDERR "Detected FASTQ format for index file\n" if $debug;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
428 } else {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
429 die "Error: unknown index file format. First character = '$first_char' (expecting > or \@)\n";
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
430 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
431 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
432
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
433 sub read_index_record
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
434 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
435 $index_seq_name = <IDXFILE>;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
436
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
437 return undef unless defined $index_seq_name; # End of file?
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
438
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
439 $index_seq_bases = <IDXFILE>;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
440 die "Error: bad input file, expecting line with sequences\n" unless defined $index_seq_bases;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
441
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
442 # If using FASTQ format, read two more lines
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
443 if ($index_fastq_format) {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
444 $index_seq_name2 = <IDXFILE>;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
445 die "Error: bad input file, expecting line with sequence name2\n" unless defined $index_seq_name2;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
446
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
447 $index_seq_qualities = <IDXFILE>;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
448 die "Error: bad input file, expecting line with quality scores\n" unless defined $index_seq_qualities;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
449 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
450 return 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
451 }
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
452
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
453 sub usage()
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
454 {
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
455 print<<EOF;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
456 Barcode Splitter, by Assaf Gordon (gordon\@cshl.edu), 11sep2008
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
457
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
458 This program reads FASTA/FASTQ file and splits it into several smaller files,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
459 Based on barcode matching.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
460 FASTA/FASTQ data is read from STDIN (format is auto-detected.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
461 Output files will be writen to disk.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
462 Summary will be printed to STDOUT.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
463
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
464 usage: $0 --bcfile FILE --prefix PREFIX [--suffix SUFFIX] [--bol|--eol|--idxfile]
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
465 [--mismatches N] [--exact] [--partial N] [--idxidstrip N]
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
466 [--help] [--quiet] [--debug]
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
467
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
468 Arguments:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
469
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
470 --bcfile FILE - Barcodes file name. (see explanation below.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
471 --prefix PREFIX - File prefix. will be added to the output files. Can be used
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
472 to specify output directories.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
473 --suffix SUFFIX - File suffix (optional). Can be used to specify file
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
474 extensions.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
475 --bol - Try to match barcodes at the BEGINNING of sequences.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
476 (What biologists would call the 5' end, and programmers
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
477 would call index 0.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
478 --eol - Try to match barcodes at the END of sequences.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
479 (What biologists would call the 3' end, and programmers
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
480 would call the end of the string.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
481 --idxfile FILE - Read barcodes from separate index file (fasta or fastq)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
482 NOTE: one of --bol, --eol, --idxfile must be specified,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
483 but not more than one.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
484 --idxidstrip N - When using index file, strip this number of characters
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
485 from the end of the sequence id before matching.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
486 Automatically detects CASAVA 1.8 format and strips at a
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
487 space in the id, use 0 to disable this.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
488 (Default is 1).
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
489 --mismatches N - Max. number of mismatches allowed. default is 1.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
490 --exact - Same as '--mismatches 0'. If both --exact and --mismatches
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
491 are specified, '--exact' takes precedence.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
492 --partial N - Allow partial overlap of barcodes. (see explanation below.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
493 (Default is not partial matching)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
494 --quiet - Don't print counts and summary at the end of the run.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
495 (Default is to print.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
496 --debug - Print lots of useless debug information to STDERR.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
497 --help - This helpful help screen.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
498
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
499 Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
500 the barcodes file):
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
501
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
502 \$ cat s_2_100.txt | $0 --bcfile mybarcodes.txt --bol --mismatches 2 \\
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
503 --prefix /tmp/bla_ --suffix ".txt"
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
504
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
505 Barcode file format
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
506 -------------------
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
507 Barcode files are simple text files. Each line should contain an identifier
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
508 (descriptive name for the barcode), and the barcode itself (A/C/G/T),
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
509 separated by a TAB character. Example:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
510
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
511 #This line is a comment (starts with a 'number' sign)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
512 BC1 GATCT
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
513 BC2 ATCGT
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
514 BC3 GTGAT
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
515 BC4 TGTCT
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
516
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
517 For each barcode, a new FASTQ file will be created (with the barcode's
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
518 identifier as part of the file name). Sequences matching the barcode
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
519 will be stored in the appropriate file.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
520
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
521 Running the above example (assuming "mybarcodes.txt" contains the above
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
522 barcodes), will create the following files:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
523 /tmp/bla_BC1.txt
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
524 /tmp/bla_BC2.txt
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
525 /tmp/bla_BC3.txt
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
526 /tmp/bla_BC4.txt
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
527 /tmp/bla_unmatched.txt
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
528 The 'unmatched' file will contain all sequences that didn't match any barcode.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
529
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
530 Barcode matching
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
531 ----------------
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
532
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
533 ** Without partial matching:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
534
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
535 Count mismatches between the FASTA/Q sequences and the barcodes.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
536 The barcode which matched with the lowest mismatches count (providing the
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
537 count is small or equal to '--mismatches N') 'gets' the sequences.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
538
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
539 Example (using the above barcodes):
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
540 Input Sequence:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
541 GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
542
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
543 Matching with '--bol --mismatches 1':
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
544 GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
545 GATCT (1 mismatch, BC1)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
546 ATCGT (4 mismatches, BC2)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
547 GTGAT (3 mismatches, BC3)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
548 TGTCT (3 mismatches, BC4)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
549
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
550 This sequence will be classified as 'BC1' (it has the lowest mismatch count).
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
551 If '--exact' or '--mismatches 0' were specified, this sequence would be
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
552 classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
553 it is above the maximum allowed mismatches).
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
554
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
555 Matching with '--eol' (end of line) does the same, but from the other side
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
556 of the sequence.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
557
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
558 ** With partial matching (very similar to indels):
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
559
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
560 Same as above, with the following addition: barcodes are also checked for
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
561 partial overlap (number of allowed non-overlapping bases is '--partial N').
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
562
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
563 Example:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
564 Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
565 (Same as above, but note the missing 'G' at the beginning.)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
566
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
567 Matching (without partial overlapping) against BC1 yields 4 mismatches:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
568 ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
569 GATCT (4 mismatches)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
570
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
571 Partial overlapping would also try the following match:
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
572 -ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
573 GATCT (1 mismatch)
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
574
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
575 Note: scoring counts a missing base as a mismatch, so the final
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
576 mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
577 If running with '--mismatches 2' (meaning allowing upto 2 mismatches) - this
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
578 seqeunce will be classified as BC1.
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
579
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
580 EOF
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
581
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
582 exit 1;
84bbf4fd24c3 Initial toolshed version with support for separate index reads and automatic loading of results into Galaxy history.
lparsons
parents:
diff changeset
583 }