comparison weeder2_wrapper.xml @ 0:496bc4eff47e draft

Initial version.
author pjbriggs
date Wed, 19 Nov 2014 07:56:27 -0500
parents
children 571cb77ab9e7
comparison
equal deleted inserted replaced
-1:000000000000 0:496bc4eff47e
1 <tool id="motiffinding_weeder2" name="Weeder2" version="2.0.0">
2 <description>Motif discovery in sequences from coregulated genes of a single species</description>
3 <command interpreter="bash">weeder2_wrapper.sh
4 $sequence_file $species_code
5 $output_motifs_file $output_matrix_file
6 $strands
7 #if $chipseq.use_chipseq
8 -chipseq -top $chipseq.top
9 #end if
10 #if str( $advanced_options.advanced_options_selector ) == "on"
11 -maxm $advanced_options.n_motifs_report
12 -b $advanced_options.n_motifs_build
13 -sim $advanced_options.sim_threshold
14 -em $advanced_options.em_cycles
15 #end if
16 </command>
17 <requirements>
18 <requirement type="package" version="2.0">weeder</requirement>
19 </requirements>
20 <inputs>
21 <param name="sequence_file" type="data" format="fasta" label="Input sequence" />
22 <param name="species_code" type="select" label="Species to use for background comparison">
23 <!-- Hard code options for now
24 See weeder's "organisms.txt" for full list
25 -->
26 <option value="HS">Homo sapiens (HS)</option>
27 <option value="MM">Mus musculus (MM)</option>
28 <option value="DM">Drosophila melanogaster (DM)</option>
29 <option value="SC">Saccharomyces cerevisiae (SC)</option>
30 <option value="AT">Arabidopsis thaliana (AT)</option>
31 </param>
32 <param name="strands" label="Use both strands of sequence" type="boolean"
33 truevalue="" falsevalue="-ss" checked="True"
34 help="If not checked then use -ss option" />
35 <conditional name="chipseq">
36 <param name="use_chipseq" type="boolean"
37 label="Use the ChIP-seq heuristic"
38 help="Speeds up the computation (-chipseq)"
39 truevalue="yes" falsevalue="no" checked="on" />
40 <when value="yes">
41 <param name="top" type="integer" value="100"
42 label="Number of top input sequences with oligos to scan for"
43 help="Increase this value to improve the chance of finding motifs enriched only in a subset of your input sequences (-top)" />
44 </when>
45 <when value="no"></when>
46 </conditional>
47 <conditional name="advanced_options">
48 <param name="advanced_options_selector" type="select"
49 label="Display advanced options">
50 <option value="off">Hide</option>
51 <option value="on">Display</option>
52 </param>
53 <when value="on">
54 <param name="n_motifs_report" type="integer" value="25"
55 label="Number of discovered motifs to report" help="(-maxm)" />
56 <param name="n_motifs_build" type="integer" value="50"
57 label="Number of top scoring motifs to build occurrences matrix profiles and outputs for"
58 help="(-b)" />
59 <param name="sim_threshold" type="float" min="0.0" max="1.0" value="0.95"
60 label="Similarity threshold for the redundancy filter"
61 help="Remove motifs that are too similar, with lower values imposing a stricter filter. Must be between 0.0 and 1.0 (-sim)" />
62 <param name="em_cycles" type="integer" min="0" max="100" value="1"
63 label="Number of expectation maximization (EM) cycles to perform"
64 help="Number of cycles must be between 0 and 100 (-em)" />
65 </when>
66 <when value="off">
67 </when>
68 </conditional>
69 </inputs>
70 <outputs>
71 <data name="output_motifs_file" format="txt" label="Weeder2 on ${on_string} (motifs)" />
72 <data name="output_matrix_file" format="txt" label="Weeder2 on ${on_string} (matrix)" />
73 </outputs>
74 <tests>
75 <test>
76 <param name="sequence_file" value="weeder_in.fa" ftype="fasta" />
77 <param name="species_code" value="MM" />
78 <output name="output_motifs_file" file="weeder2_motifs.out" lines_diff="2" />
79 <output name="output_matrix_file" file="weeder2_matrix.out" />
80 </test>
81 </tests>
82 <help>
83
84 .. class:: infomark
85
86 **What it does**
87
88 Weeder2 is a program for finding novel motifs (transcription factor binding sites)
89 conserved in a set of regulatory regions of related genes.
90
91 -------------
92
93 .. class:: infomark
94
95 **Usage advice**
96
97 Guidelines on how to use this tool can be seen in Zambelli et al. 2014 (see link
98 below), but the following is a brief guide. Please note that **motifs** are a model
99 or matrix that describes a set of sequences that may differ in the base composition.
100 **Oligos** are specific sequences found within the input sequences or genomic
101 background.
102
103 **Input sequence** (in FASTA format) should be short (100-200bp) and be reasonably
104 expected to contain an enriched motif(s). This is not generally an issue with
105 transcription factor ChIP-seq derived sequences centred on the summit of binding
106 regions that are expected to contain a dominant motif and possibly secondary motifs.
107
108 There is **no need to mask sequence for repetitive sequence** as factors may
109 legitimately bind repetitive sequence.
110
111 **Use both strands of sequence** by default, unless there is a specific reason not
112 to do so.
113
114 **Species to use for background comparison** should match the genome used to
115 generate the **input sequence**. The background genome motif frequencies are
116 generated from within the promoter regions of annotated genes and are shown to be a
117 good background for both promoter and other regulatory regions.
118
119 **Use the ChIP-seq heuristic** (-chipseq) when there are a large number of
120 input sequences (hundreds or thousands). When -chipseq is used Weeder will use
121 only oligos from the first 100 sequences to build motifs with which it scans
122 all of the input sequences. This speeds up the computational time without too much
123 risk of losing important motifs. Even if not strictly necessary it's advisable to
124 order input sequences by their significance, e.g. fold enrichment or Pvalue. For
125 large data sets (-top) should be set to a number equating at least 10 to 20% of
126 input sequences (as recommended by the authors).
127
128 **Number of discovered motifs to report** (-maxm) limits the number of reported
129 motifs even if there are more than -maxm. **Number of top scoring motifs to build
130 occurrences matrix profiles and outputs for** (-b) changes the number of top
131 scoring motifs of length 6, 8 and 10 for which the occurrence matrix is built.
132 Increasing -b may result in a larger number of reported motifs, but with potentially
133 more of low significance and increases the computational time. If increasing -b does
134 not result in more motifs in your results it means that the additional motifs are
135 filtered out by the redundancy filter or that the maximum number of reported motifs
136 set by -maxm has been reached.
137
138 **Similarity threshold for the redundancy filter** (-sim) default setting is
139 recommended.
140
141 **Number of expectation maximization (EM) cycles to perform** (-em) default is
142 recommended. The option is included to help "clean up" the resulting motif matrices.
143 In this version the number of EM steps can be increased, which can be useful for
144 motifs with highly redundant stretches of sequence.
145
146 -------------
147
148 .. class:: infomark
149
150 **A note on the results**
151
152 The resulting matrices are the result of scanning (by default both strands) for
153 oligos of length 6, 8 and 8, allowing 1, 2 and 3 substitutions respectively. The
154 matrices within the matrix.w2 file can be input into other tools. The recommended
155 next step is to use **STAMP** (http://www.benoslab.pitt.edu/stamp/), which displays
156 the motifs as logos and identifies matches with libraries of known DNA binding
157 motifs, such as TRANSFAC or JASPAR.
158
159 -------------
160
161 .. class:: infomark
162
163 **Credits**
164
165 This Galaxy tool has been developed by Peter Briggs and Ian Donaldson within the
166 Bioinformatics Core Facility at the University of Manchester, and runs the Weeder2
167 motif discovery package:
168
169 * Zambelli, F., Pesole, G. and Pavesi, G. 2014. Using Weeder, Pscan, and PscanChIP
170 for the Discovery of Enriched Transcription Factor Binding Site Motifs in
171 Nucleotide Sequences. Current Protocols in Bioinformatics. 47:2.11:2.11.1–2.11.31.
172 * http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0211s47/full
173
174 This tool is compatible with Weeder 2.0:
175
176 * http://159.149.160.51/modtools/downloads/weeder2.html
177
178 Please kindly acknowledge both this Galaxy tool, the Weeder package and the utility
179 scripts if you use it in your work.
180 </help>
181 </tool>