0
|
1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True">
|
|
2 <description>: order sequence matches in clusters</description>
|
|
3 <command>
|
|
4 <!-- update this path to the installed location -->
|
|
5 $tool.cmd
|
|
6 #if $tool.cmd=="gaps":
|
|
7 $in_reference
|
|
8 #if $tool.gaps_r=="yes":
|
|
9 -r
|
|
10 #end if
|
|
11 #end if
|
|
12 #if $tool.cmd=="mgaps":
|
|
13 #if $tool.cmd_C=="yes":
|
|
14 -C
|
|
15 #end if
|
|
16 -d $tool.cmd_d
|
|
17 #if $tool.cmd_e=="yes":
|
|
18 -e
|
|
19 #end if
|
|
20 -f $tool.cmd_f
|
|
21 -l $tool.cmd_l
|
|
22 -s $tool.cmd_s
|
|
23 #end if
|
|
24 < $tool.in_match_list
|
|
25 > $out_tool
|
|
26
|
|
27 </command>
|
|
28 <inputs>
|
|
29 <conditional name="tool">
|
|
30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" >
|
|
31 <option value="gaps" selected="true">gaps</option>
|
|
32 <option value="mgaps">mgaps</option>
|
|
33 </param>
|
|
34 <when value="gaps">
|
|
35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" />
|
|
36 <param name="gaps_r" type="select" label="Use reversed [-r]" >
|
|
37 <option value="no" selected="true">No</option>
|
|
38 <option value="yes">Yes</option>
|
|
39 </param>
|
|
40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
|
|
41 </when>
|
|
42 <when value="mgaps">
|
|
43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
|
|
44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" >
|
|
45 <option value="no" selected="true">No</option>
|
|
46 <option value="yes">Yes</option>
|
|
47 </param>
|
|
48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" />
|
|
49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" >
|
|
50 <option value="no" selected="true">No</option>
|
|
51 <option value="yes">Yes</option>
|
|
52 </param>
|
|
53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" />
|
|
54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" />
|
|
55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" />
|
|
56 </when>
|
|
57 </conditional>
|
|
58 </inputs>
|
|
59 <outputs>
|
|
60 <data name="out_tool" format="text" label="Clustering output" />
|
|
61 </outputs>
|
|
62 <requirements>
|
|
63 <!-- <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> -->
|
|
64 <requirement type="package" version="4.6.4">gnuplot</requirement>
|
|
65 <requirement type="package" version="3.23">mummer</requirement>
|
|
66 </requirements>
|
|
67 <tests>
|
|
68 <test>
|
|
69 </test>
|
|
70 </tests>
|
|
71 <help>
|
|
72 |
|
|
73
|
|
74
|
|
75 **Reference**
|
|
76 =============
|
|
77
|
|
78 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands.
|
|
79
|
|
80 - **MUMmer suite v3.22:** http://mummer.sourceforge.net
|
|
81
|
|
82 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/
|
|
83
|
|
84 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve
|
|
85 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :)
|
|
86
|
|
87
|
|
88 **MUMmer Clustering**
|
|
89 =====================
|
|
90
|
|
91 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters
|
|
92 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment
|
|
93 regions from a match list, however when examining the data without graphic aids, it is very difficult
|
|
94 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches
|
|
95 together into larger groups of neighboring matches makes this process much easier by ordering the
|
|
96 data and removing spurious matches.
|
|
97
|
|
98
|
|
99 Gaps
|
|
100 ----
|
|
101
|
|
102 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering"
|
|
103 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm
|
|
104 to extract the longest consistent set of matches between two sequences, and generates a single
|
|
105 cluster that represents the best "straight-line" arrangement of matches between the sequences. By
|
|
106 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches
|
|
107 between the two sequences. This limits the usability of this program to the alignment of genomes
|
|
108 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of
|
|
109 near identical sequences with the goal of finding minor mutations like SNPs and small indels.
|
|
110
|
|
111 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in
|
|
112 the Known problems (manual) section, and requires the header be stripped from the mummer output. In
|
|
113 addition, gaps is only designed to handle a single reference and a single query sequence, thus the
|
|
114 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming
|
|
115 matches as reverse complement matches which must reference the reverse complement of the sequence,
|
|
116 therefore forcing mummer to be run without the -c option.
|
|
117
|
|
118 Reference: http://mummer.sourceforge.net/manual/#gaps
|
|
119
|
|
120 **Output:**
|
|
121 ::
|
|
122
|
|
123 > /home/aphillip/data/GHP.1con Consistent matches
|
|
124 183 17 22 none - -
|
|
125 238 72 108 none 33 33
|
|
126 347 181 92 none 1 1
|
|
127 458 292 50 none 19 19
|
|
128 705 539 44 none 1 1
|
|
129 750 584 38 none 1 1
|
|
130 807 641 23 -16 0 4
|
|
131 (output continues ...)
|
|
132 > Wrap around
|
|
133 334398 329917 47 none - 225
|
|
134 334446 329965 62 none 1 1
|
|
135 334539 330058 20 none 31 31
|
|
136 334560 330079 92 none 1 1
|
|
137 334653 330172 77 none 1 1
|
|
138 334740 330259 41 none 10 10
|
|
139 (output continues ...)
|
|
140 > /home/aphillip/data/GHP.1con Other matches
|
|
141 1317231 4891 21 none - -
|
|
142 1317275 4927 21 none - -
|
|
143 1317804 5399 25 none 508 451
|
|
144 947580 5436 36 none - -
|
|
145 23406 5518 34 none - -
|
|
146 333079 6592 32 none - -
|
|
147 (output continues ...)
|
|
148
|
|
149 Where the first line is the location of the reference file, and the first three columns are the same
|
|
150 as the three column match format described in the mummer section. The final three columns are the
|
|
151 overlap between this match and the previous match, the gap between the start of this match and the
|
|
152 end of the previous match in the reference, and the gap between the start of this match and the end
|
|
153 of the previous match in the query respectively.
|
|
154
|
|
155
|
|
156 mgaps
|
|
157 -----
|
|
158
|
|
159 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale
|
|
160 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable
|
|
161 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of
|
|
162 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only
|
|
163 matches that were included in clusters will appear in the output, so by adjusting the command-line
|
|
164 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger
|
|
165 areas of conservation between the input sequences. The major advantage of mgaps is its ability to
|
|
166 identify these "islands" of conservation. This frees the user from the single LIS restraints of the
|
|
167 gaps program and allows for the identification of large-scale rearrangements, duplications, gene
|
|
168 families and so on.
|
|
169
|
|
170 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using
|
|
171 mgaps, all regions of conservation can now been identified. The only fallback being the increased
|
|
172 complexity of the output, where you once had only one cluster for the whole comparison, you usually
|
|
173 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from
|
|
174 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*.
|
|
175
|
|
176 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or
|
|
177 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer
|
|
178 to the run-mummer3 script (see online manual) for an example of how to use this program in an
|
|
179 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement
|
|
180 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer
|
|
181 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences
|
|
182 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but
|
|
183 may eventually appear.
|
|
184
|
|
185 The -d option can be interpreted as the number of insertions allowed between two matches in the same
|
|
186 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where
|
|
187 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained
|
|
188 matches unless the -e option is used. The best way to get a feel for what each parameter controls
|
|
189 is to cluster the same data set numerous times with different values and observe the resulting
|
|
190 differences. It can also be helpful to set these parameters to the size of the element you wish to
|
|
191 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap
|
|
192 to the smallest intron you expect to obtain clusters that could represent single exons (depending
|
|
193 of course of the similarity of the two sequences).
|
|
194
|
|
195 Reference: http://mummer.sourceforge.net/manual/#mgaps
|
|
196
|
|
197 **Output format**
|
|
198
|
|
199 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different
|
|
200 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output
|
|
201 of mgaps run on both forward and reverse complement matches is as follows:
|
|
202 ::
|
|
203
|
|
204 > ID41
|
|
205 > ID41 Reverse
|
|
206 5177399 1 232 none - -
|
|
207 5177632 234 6794 none 1 1
|
|
208 5184433 7035 24 none 7 7
|
|
209 5184468 7069 23 none 11 10
|
|
210 > ID42
|
|
211 10181 43 1521 none - -
|
|
212 > ID42 Reverse
|
|
213 4654536 17 36 none - -
|
|
214 4654578 57 298 none 6 4
|
|
215 4654877 356 226 none 1 1
|
|
216 #
|
|
217 4655139 845 28 none - -
|
|
218 4655178 884 694 none 11 11
|
|
219 4655873 1579 20 none 1 1
|
|
220 #
|
|
221 4850044 17 1492 none - -
|
|
222 4851537 1510 711 none 1 1
|
|
223 4852249 2222 42 none 1 1
|
|
224 (output continues ...)
|
|
225
|
|
226
|
|
227 Headers containing the ID for each query sequence are listed after the '>' characters, and a
|
|
228 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters
|
|
229 for each sequence are separated by a '#' character, and the six columns are exactly the same as the
|
|
230 gaps output (see the gaps section for more details).
|
|
231
|
|
232
|
|
233 |
|
|
234 |
|
|
235
|
|
236 </help>
|
|
237 </tool>
|
|
238
|