3
|
1 FASTX-Toolkit
|
|
2 =============
|
|
3
|
|
4
|
|
5 Short Summary
|
|
6 ===============
|
|
7
|
|
8 The FASTX-Toolkit is a collection of command line tools for Short-Reads
|
|
9 FASTA/FASTQ files preprocessing.
|
|
10
|
|
11
|
|
12
|
|
13 More Details
|
|
14 ==============
|
|
15
|
|
16 Next-Generation sequencing machines usually produce FASTA or FASTQ files,
|
|
17 containing multiple short-reads sequences (possibly with quality information).
|
|
18
|
|
19 The main processing of such FASTA/FASTQ files is mapping (aka aligning)
|
|
20 the sequences to reference genomes or other databases using specialized
|
|
21 programs.
|
|
22
|
|
23 Example of such mapping programs are:
|
|
24 Blat (http://www.kentinformatics.com/index.asp),
|
|
25 SHRiMP (http://compbio.cs.toronto.edu/shrimp),
|
|
26 LastZ (http://www.bx.psu.edu/miller_lab),
|
|
27 MAQ (http://maq.sourceforge.net/)
|
|
28 And many many others.
|
|
29
|
|
30 However,
|
|
31 It is sometimes more productive to preprocess the FASTA/FASTQ files before
|
|
32 mapping the sequences to the genome - manipulating the sequences to
|
|
33 produce better mapping results.
|
|
34
|
|
35 The FASTX-Toolkit tools perform some of these preprocessing tasks.
|
|
36
|
|
37
|
|
38
|
|
39 Available Tools
|
|
40 ===============
|
|
41
|
|
42 FASTQ-to-FASTA - Converts a FASTQ file to a FASTA file..
|
|
43
|
|
44 FASTQ-Statistics - scans a FASTQ file, and produces some statistics about the
|
|
45 quality and the sequences in the file.
|
|
46
|
|
47 FASTQ-Quality-BoxPlot, and
|
|
48 FASTQ-Nucleotides-Distribution - Generates charts based on the statistics
|
|
49 generated by FASTQ-Statistics. These charts can be used to quickly
|
|
50 see the quality of the sequenced library.
|
|
51
|
|
52 FASTQ-Quality-Converter - Converts from ASCII to numeric quality scores.
|
|
53
|
|
54 FASTQ-Quality-Filter - removes low-quality sequences from FASTQ files.
|
|
55
|
|
56 FASTX-Artifacts-Filter - removes some sequencing artifacts from FASTA/Q files.
|
|
57
|
|
58 FASTX-Barcode-Splitter - A common practice is to sequence multiple biological
|
|
59 samples in the same library (marking each sample using a dedicated
|
|
60 barcode). The resulting FASTA/Q file contains intermixed sequences
|
|
61 from those samples. This tool separates FASTA/Q files into several
|
|
62 individual files, based on the barcodes.
|
|
63
|
|
64 FASTX-Clipper - Adapters (aka Linkers) are added to the library (before
|
|
65 sequencing), and should be removed from the resulting FASTA/Q file.
|
|
66 This tool removes (clips) adapters.
|
|
67
|
|
68 FASTA-Clipping-Histogram - After clipping a FASTA file, this tool generates a
|
|
69 chart showing the length of the clipped sequences.
|
|
70
|
|
71 FASTX-Reverse-Complement - Produces a reverse-complement of FASTA/Q file.
|
|
72 If a FASTQ file is given, the quality scores are also reversed.
|
|
73
|
|
74 FASTX-Trimmer - Extract sub-seqeunces from FASTA/Q file. Two examples are:
|
|
75 Removing barcodes from the 5'-end of all sequences in a FASTQ file;
|
|
76 Cutting 7 nucleotides from the 3'-end of all sequences in a FASTA file.
|
|
77
|
|
78
|
|
79
|
|
80 Galaxy
|
|
81 ======
|
|
82
|
|
83 Galaxy (http://g2.bx.psu.edu) is web-based framework for computational biology.
|
|
84
|
|
85 While the programs in the FASTX-Toolkit are command-line based, the package
|
|
86 include the necessary files to integrate the tools into a Galaxy server,
|
|
87 Allowing users to execute this tools from their web-browser.
|
|
88
|
|
89 If you run your own local mirror of a Galaxy server, you can integrate the
|
|
90 FASTX-Toolkit into your Galaxy server.
|
|
91
|
|
92
|
|
93
|
|
94 Software Requirements
|
|
95 =====================
|
|
96
|
|
97 1. GCC is required to compile most tools.
|
|
98
|
|
99 2. FASTA-Clipping-Histogram tool requires Perl, the "PerlIO::gzip",
|
|
100 "GD::Graph::bars" modules.
|
|
101
|
|
102 Installing the perl modules can be accomplised by running:
|
|
103
|
|
104 $ sudo cpan 'PerlIO::gzip'
|
|
105 $ sudo cpan 'GD::Graph::bars'
|
|
106
|
|
107 3. FASTX-Barcode-Splitter requires the GNU Sed program.
|
|
108
|
|
109 4. FASTQ-Quality-Boxplot and FASTQ-Nucleotides-Distribution requires the
|
|
110 'gnuplot' program.
|
|
111
|
|
112
|
|
113 Installation
|
|
114 ============
|
|
115
|
|
116 To compile to tools, run:
|
|
117
|
|
118 $ ./configure
|
|
119 $ make
|
|
120
|
|
121 To install the tools, run (as root):
|
|
122
|
|
123 $ sudo make install
|
|
124
|
|
125 This will install the tools into /usr/local/bin.
|
|
126 To install the tools to a different location, change the 'configure' step to:
|
|
127
|
|
128 $ ./configure --prefix=/DESTINATION/DIRECTORY
|
|
129
|
|
130
|
|
131
|
|
132 Command Line Usage
|
|
133 ==================
|
|
134
|
|
135 Most tools support "-h" argument to show a short help screen.
|
|
136 Better documentation is not available at this moment.
|
|
137 Some more details and examples are available in the <help> section
|
|
138 of the XML tool files (in the 'galaxy' subdirectory).
|
|
139
|
|
140
|
|
141 Galaxy Installation
|
|
142 ===================
|
|
143
|
|
144 Galaxy Installation should be done manually, and requires technical
|
|
145 understading of the Galaxy framework.
|
|
146
|
|
147 1. build and install the command line tools (as described above).
|
|
148
|
|
149 2. Make backup of your galaxy installation (better safe than sorry).
|
|
150
|
|
151 3. Run the 'install_galaxy_files.sh' script,
|
|
152 and specify the galaxy root directory.
|
|
153 This script copies the files from the 'galaxy' sub-directory into
|
|
154 your galaxy mirror directory.
|
|
155
|
|
156 4. Manually add the content of ./galaxy/fastx_toolkit_conf.xml file,
|
|
157 into your Galaxy's tool_conf.xml
|
|
158
|
|
159 5. Edit [YOUR-GALAXY]/tool-data/fastx_clipper_sequences.txt file,
|
|
160 And add your custom adapters/linkers.
|
|
161
|
|
162 6. Modify the "fastx_barcode_splitter_galaxy_wrapper.sh" as explained
|
|
163 Below (see section "Special configuration for Barcode-Splitter").
|
|
164
|
|
165 7. Restart Galaxy.
|
|
166
|
|
167 Always make backup of your galaxy server files before trying to install
|
|
168 the FASTX-Toolkit.
|
|
169
|
|
170
|
|
171
|
|
172 Galaxy Testing
|
|
173 ==============
|
|
174
|
|
175 The following tools support Galaxy's functional testing:
|
|
176 (Run from Galaxy's main directory)
|
|
177 $ sh run_functional_tests.sh -id cshl_fastq_qual_conv
|
|
178 $ sh run_functional_tests.sh -id cshl_fastq_to_fasta
|
|
179 $ sh run_functional_tests.sh -id cshl_fastq_qual_stat
|
|
180 $ sh run_functional_tests.sh -id cshl_fastx_trimmer
|
|
181 $ sh run_functional_tests.sh -id cshl_fastx_reverse_complement
|
|
182 $ sh run_functional_tests.sh -id cshl_fastx_artifacts_filter
|
|
183 $ sh run_functional_tests.sh -id cshl_fasta_collapser
|
|
184 $ sh run_functional_tests.sh -id cshl_fastx_clipper
|
|
185
|
|
186
|
|
187 Special configuration for Barcode-Splitter
|
|
188 ==========================================
|
|
189
|
|
190 When running the barcode-splitter tool from the command line you specify a
|
|
191 prefix direcotry - the output files will be written to that directory (similar
|
|
192 to GNU's split program usage).
|
|
193
|
|
194 Running the barcode-splittter inside galaxy requires a special hack beacuse
|
|
195 (I don't know how to|Galaxy can't) create a variable number of output datasets.
|
|
196 The number of required output files is determined by the tool only AFTER reading
|
|
197 the barcodes description file.
|
|
198
|
|
199 The Galaxy-version of Barcode-Splitter works like this:
|
|
200 1. A FASTA/FASTQ file, and a Barcode description file are fed to the tool.
|
|
201 2. The tool produces a single output dataset (inside galaxy). This output
|
|
202 is an HTML file, containing links to the split FASTA files.
|
|
203 3. Users can use the links to get the split FASTA files.
|
|
204 (Since Galaxy's 'upload data' tool accepts URLs, this is not a real problem).
|
|
205
|
|
206 4. As the galaxy administrator, you'll have to edit
|
|
207 'fastx_barcode_splitter_galaxy_wrapper.sh' script and change BASEPATH and
|
|
208 PUBLICURL to point to a publicly accesibly path on your server.
|
|
209
|
|
210 Example:
|
|
211
|
|
212 fastx_barcode_splitter_galaxy_wrapper.sh contains:
|
|
213
|
|
214 BASEPATH="/media/sdb1/galaxy/barcode_splits/"
|
|
215 PUBLICURL="http://tango.cshl.edu/barcode_splits/"
|
|
216
|
|
217 When a user runs the barcode splitter tool, the FASTA files will be generated in
|
|
218 "/media/sdb1/galaxy/barcode_splits/".
|
|
219 The URL "http://tango.cshl.edu/barcode_splits" is set (in an apache server) to
|
|
220 serve files from "/media/sdb1/galaxy/barcode_splits/", with the following
|
|
221 configuration:
|
|
222
|
|
223 Alias /barcode_splits "/media/sdb1/galaxy/barcode_splits/"
|
|
224 <Directory "/media/sdb1/galaxy/barcode_splits/">
|
|
225 AllowOverride None
|
|
226 Order allow,deny
|
|
227 Allow from all
|
|
228 </Directory>
|
|
229
|
|
230
|
|
231
|
|
232
|
|
233 Licenses
|
|
234 ========
|
|
235
|
|
236 FASTX-Toolkit is distributed under the Affero GPL version 3 or later (AGPLv3),
|
|
237
|
|
238 EXCEPT
|
|
239
|
|
240 All files under the 'galaxy' sub-directory are distributed under the
|
|
241 same license as Galaxy itself (which is an MIT-style license).
|
|
242
|
|
243
|
|
244 While IANAL, these licenses basically mean that:
|
|
245 1. You're free to use FASTX-toolkit,
|
|
246
|
|
247 2. You're free to integrate FASTX-toolkit in your Galaxy mirror server
|
|
248 (or any other server).
|
|
249
|
|
250 3. You're free to modify the files under 'galaxy',
|
|
251 without making your modifications public.
|
|
252
|
|
253 4. If you modify the FASTX-toolkit tools, and make those modifications
|
|
254 publicly available (either as downloadable tools, part of another product),
|
|
255 or as a web-based server - you must make the modified source code freely
|
|
256 available (free as in speech).
|
|
257
|
|
258 See the COPYING file for the full Affero GPL.
|
|
259 See the GALAXY-LICENSE file for galaxy's license.
|
|
260
|
|
261 Please remember:
|
|
262 THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
|
|
263 APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
|
|
264 HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
|
|
265 OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
|
|
266 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
|
|
267 PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
|
|
268 IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
|
|
269 ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
|
|
270
|
|
271
|
|
272 =============
|
|
273 Please send all comments, suggestions, bug reports (or better yet - bug fixes)
|
|
274 to gordon@cshl.edu .
|