comparison rgPicardLibComplexity.xml @ 4:f4d018471628 draft default tip

Uploaded
author jpruab
date Tue, 13 Aug 2013 12:09:14 -0400
parents
children
comparison
equal deleted inserted replaced
3:08b477977410 4:f4d018471628
1 <tool name="Estimate Library Complexity" id="rgEstLibComp" version="1.56.0">
2 <requirements><requirement type="package" version="1.56.0">picard</requirement></requirements>
3 <command interpreter="python">
4 picard_wrapper.py -i "${input_file}" -n "${out_prefix}" --tmpdir "${__new_file_path__}" --minid "${minIDbases}"
5 --maxdiff "${maxDiff}" --minmeanq "${minMeanQ}" --readregex "${readRegex}" --optdupdist "${optDupeDist}"
6 -j "\$JAVA_JAR_PATH/EstimateLibraryComplexity.jar" -d "${html_file.files_path}" -t "${html_file}"
7 </command>
8 <inputs>
9 <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset"
10 help="If empty, upload or import a SAM/BAM dataset."/>
11 <param name="out_prefix" value="Library Complexity" type="text"
12 label="Title for the output file" help="Use this remind you what the job was for." size="80" />
13 <param name="minIDbases" value="5" type="integer" label="Minimum identical bases at starts of reads for grouping" size="5"
14 help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/memory." />
15 <param name="maxDiff" value="0.03" type="float"
16 label="Maximum difference rate for identical reads" size="5"
17 help="The maximum rate of differences between two reads to call them identical" />
18 <param name="minMeanQ" value="20" type="integer"
19 label="Minimum percentage" size="5"
20 help="The minimum mean quality of bases in a read pair. Lower average quality reads filtered out from all calculations" />
21 <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" type="text" size="120"
22 label="Regular expression that can be used to parse read names in the incoming SAM file"
23 help="Names are parsed to extract: tile/region, x coordinate and y coordinate, to estimate optical duplication rate" >
24 <sanitizer>
25 <valid initial="string.printable">
26 <remove value="&apos;"/>
27 </valid>
28 <mapping initial="none">
29 <add source="&apos;" target="__sq__"/>
30 </mapping>
31 </sanitizer>
32 </param>
33 <param name="optDupeDist" value="100" type="text"
34 label="The maximum offset between two duplicte clusters in order to consider them optical duplicates." size="5"
35 help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100" />
36
37 </inputs>
38 <outputs>
39 <data format="html" name="html_file" label="${out_prefix}_lib_complexity.html"/>
40 </outputs>
41 <tests>
42 <test>
43 <param name="input_file" value="picard_input_tiny.sam" />
44 <param name="out_prefix" value="Library Complexity" />
45 <param name="minIDbases" value="5" />
46 <param name="maxDiff" value="0.03" />
47 <param name="minMeanQ" value="20" />
48 <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" />
49 <param name="optDupeDist" value="100" />
50 <output name="html_file" file="picard_output_estlibcomplexity_tinysam.html" ftype="html" lines_diff="30" />
51 </test>
52 </tests>
53 <help>
54
55 .. class:: infomark
56
57 **Purpose**
58
59 Attempts to estimate library complexity from sequence alone.
60 Does so by sorting all reads by the first N bases (5 by default) of each read and then
61 comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be
62 duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).
63
64 Reads of poor quality are filtered out so as to provide a more accurate estimate.
65 The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than
66 MIN_MEAN_QUALITY across either the first or second read.
67
68 The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the
69 calculation of library size. Also, since there is no alignment to screen out technical reads one
70 further filter is applied on the data. After examining all reads a histogram is built of
71 [#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are
72 then removed from the histogram as outliers before library size is estimated.
73
74 **Picard documentation**
75
76 This is a Galaxy wrapper for EstimateLibraryComplexity, a part of the external package Picard-tools_.
77
78 .. _Picard-tools: http://www.google.com/search?q=picard+samtools
79
80 -----
81
82 .. class:: infomark
83
84 **Inputs, outputs, and parameters**
85
86 Picard documentation says (reformatted for Galaxy):
87
88 .. csv-table::
89 :header-rows: 1
90
91 Option Description
92 "INPUT=File","One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. This option may be specified 0 or more times."
93 "OUTPUT=File","Output file to writes per-library metrics to. Required."
94 "MIN_IDENTICAL_BASES=Integer","The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5."
95 "MAX_DIFF_RATE=Double","The maximum rate of differences between two reads to call them identical. Default value: 0.03. "
96 "MIN_MEAN_QUALITY=Integer","The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20."
97 "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to clear the default value."
98 "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100"
99 "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. "
100
101 .. class:: warningmark
102
103 **Warning on SAM/BAM quality**
104
105 Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
106 flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
107 to be the only way to deal with SAM/BAM that cannot be parsed.
108
109 .. class:: infomark
110
111 **Note on the Regular Expression**
112
113 (from the Picard docs)
114 This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file.
115 These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size.
116 The regular expression should contain three capture groups for the three variables, in order.
117 Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.
118
119
120 </help>
121 </tool>
122
123