annotate fastqFilter.xml @ 0:14e7247c1fa0 draft

Uploaded
author czlab
date Thu, 17 May 2018 21:31:10 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
1 <tool id="fastqFilter" name="Filter FASTQ files">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
2 <description></description>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
3 <command>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
4 fastq_filter.pl -v
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
5 #if $sampleIndex.filterBySampleIndex == "yes":
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
6 -index $sampleIndex.sequence
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
7 #end if
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
8 -maxN $maxN -if sanger -f $filterString -of $outputFormat $inputfile $outputfile
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
9 </command>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
10
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
11 <inputs>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
12 <param name="inputfile" format="fastq" type="data" label="Input Sanger FASTQ file (.gz file accepted; see help below for more information)" />
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
13
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
14 <conditional name="sampleIndex">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
15 <param name="filterBySampleIndex" type="select" label="Filter by sample index (see help below for parameter suggestion)" >
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
16 <option value="yes">Yes</option>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
17 <option value="no" selected="true">No</option>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
18 </param>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
19 <when value="yes">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
20 <param name="sequence" type="text" value="" label="Index position and sequence" />
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
21 </when>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
22 <when value="no">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
23 </when>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
24 </conditional>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
25
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
26 <param name="filterString" type="text" value="" label="Quality score filter string; format: Method:Start-End:Score (zero-based; see help below for parameter suggestion)" />
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
27 <param name="maxN" type="integer" value="-1" label="Max number of N in sequence (default off - value less than 0) " />
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
28 <param name="outputFormat" type="select" label="Output data type">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
29 <option value="fastq">FASTQ</option>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
30 <option value="fasta">FASTA</option>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
31 </param>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
32
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
33 </inputs>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
34
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
35 <outputs>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
36 <data name="outputfile" format="fastq" label="Read quality filtering on ${on_string}">
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
37 <change_format>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
38 <when input="OutputFormat" value="fasta" format="fasta" />
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
39 </change_format>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
40 </data>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
41 </outputs>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
42
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
43 <help>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
44
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
45 .. class:: infomark
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
46
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
47 **What this tool does**
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
48
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
49 This tool extracts reads passing quality filters.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
50
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
51 It takes as input Sanger FASTQ files and output FASTQ/A files of filtered reads.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
52
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
53 -----
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
54
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
55 **FASTQ format**
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
56
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
57 Check quality score in the FASTQ file for the right format.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
58
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
59 Reference https://en.wikipedia.org/wiki/FASTQ_format#Quality :
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
60
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
61 * Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
62 * Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
63
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
64 See http://www.asciitable.com/ for ASCII table.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
65
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
66 -----
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
67
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
68 **Filter by sample index (optional)**
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
69
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
70 For users who would like to start from a FASTQ file consisting of multiple libraries.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
71
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
72 For example:
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
73
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
74 If you have six samples with indexes GTCA, GCAT, ACTG, AGCT, GCAT, TCGA, you can extract reads for each library with indicated index sequences (e.g. GTCA, etc.) starting from position 0 in the read. For example, you could specify 0:GTCA, etc.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
75
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
76 -----
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
77
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
78 **How to set the filter**
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
79
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
80 You can apply multiple filtering criteria based on the quality scores for each read. They are separated by commas.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
81
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
82 Each critieron is composed of four components (e.g. method1:start1-end1:score1,method2:start2-end2:score2)
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
83
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
84 1. Method: min or mean, which means requirement on minimal or mean score of a region
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
85 2. Start: the first nucleotide to consider (0-based)
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
86 3. End: the last nucleotide to consider (0-based)
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
87 4. score: the threshold required
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
88
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
89 **Parameter suggestion**
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
90
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
91 For example:
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
92
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
93 * For Standard CLIP protocol filtering: mean:0-29:20 (this specifies a mean score of 20 or above in the first 30 bases, which includes 5 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag).
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
94 * For iCLIP/BrdU CLIP filtering: mean:0-38:20 (this specifies a mean score of 20 or above in the first 39 bases, which includes 14 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag).
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
95
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
96 The reason to filter as such is because low quality reads can introduce mapping errors and background. They will inflate the number of unique tags after removal of PCR duplicates.
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
97
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
98
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
99
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
100 </help>
14e7247c1fa0 Uploaded
czlab
parents:
diff changeset
101 </tool>