comparison MUMmer/mummer_clustering.xml @ 0:61f30d177448 default tip

initial commit on Mummer toolsuite on toolshed
author eric
date Tue, 31 Mar 2015 14:19:49 +0200
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:61f30d177448
1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True">
2 <description>: order sequence matches in clusters</description>
3 <command>
4 <!-- update this path to the installed location -->
5 $tool.cmd
6 #if $tool.cmd=="gaps":
7 $in_reference
8 #if $tool.gaps_r=="yes":
9 -r
10 #end if
11 #end if
12 #if $tool.cmd=="mgaps":
13 #if $tool.cmd_C=="yes":
14 -C
15 #end if
16 -d $tool.cmd_d
17 #if $tool.cmd_e=="yes":
18 -e
19 #end if
20 -f $tool.cmd_f
21 -l $tool.cmd_l
22 -s $tool.cmd_s
23 #end if
24 &lt; $tool.in_match_list
25 &gt; $out_tool
26
27 </command>
28 <inputs>
29 <conditional name="tool">
30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" >
31 <option value="gaps" selected="true">gaps</option>
32 <option value="mgaps">mgaps</option>
33 </param>
34 <when value="gaps">
35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" />
36 <param name="gaps_r" type="select" label="Use reversed [-r]" >
37 <option value="no" selected="true">No</option>
38 <option value="yes">Yes</option>
39 </param>
40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
41 </when>
42 <when value="mgaps">
43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" >
45 <option value="no" selected="true">No</option>
46 <option value="yes">Yes</option>
47 </param>
48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" />
49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" >
50 <option value="no" selected="true">No</option>
51 <option value="yes">Yes</option>
52 </param>
53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" />
54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" />
55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" />
56 </when>
57 </conditional>
58 </inputs>
59 <outputs>
60 <data name="out_tool" format="text" label="Clustering output" />
61 </outputs>
62 <requirements>
63 <!-- <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> -->
64 <requirement type="package" version="4.6.4">gnuplot</requirement>
65 <requirement type="package" version="3.23">mummer</requirement>
66 </requirements>
67 <tests>
68 <test>
69 </test>
70 </tests>
71 <help>
72 |
73
74
75 **Reference**
76 =============
77
78 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands.
79
80 - **MUMmer suite v3.22:** http://mummer.sourceforge.net
81
82 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/
83
84 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve
85 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :)
86
87
88 **MUMmer Clustering**
89 =====================
90
91 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters
92 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment
93 regions from a match list, however when examining the data without graphic aids, it is very difficult
94 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches
95 together into larger groups of neighboring matches makes this process much easier by ordering the
96 data and removing spurious matches.
97
98
99 Gaps
100 ----
101
102 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering"
103 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm
104 to extract the longest consistent set of matches between two sequences, and generates a single
105 cluster that represents the best "straight-line" arrangement of matches between the sequences. By
106 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches
107 between the two sequences. This limits the usability of this program to the alignment of genomes
108 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of
109 near identical sequences with the goal of finding minor mutations like SNPs and small indels.
110
111 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in
112 the Known problems (manual) section, and requires the header be stripped from the mummer output. In
113 addition, gaps is only designed to handle a single reference and a single query sequence, thus the
114 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming
115 matches as reverse complement matches which must reference the reverse complement of the sequence,
116 therefore forcing mummer to be run without the -c option.
117
118 Reference: http://mummer.sourceforge.net/manual/#gaps
119
120 **Output:**
121 ::
122
123 > /home/aphillip/data/GHP.1con Consistent matches
124 183 17 22 none - -
125 238 72 108 none 33 33
126 347 181 92 none 1 1
127 458 292 50 none 19 19
128 705 539 44 none 1 1
129 750 584 38 none 1 1
130 807 641 23 -16 0 4
131 (output continues ...)
132 > Wrap around
133 334398 329917 47 none - 225
134 334446 329965 62 none 1 1
135 334539 330058 20 none 31 31
136 334560 330079 92 none 1 1
137 334653 330172 77 none 1 1
138 334740 330259 41 none 10 10
139 (output continues ...)
140 > /home/aphillip/data/GHP.1con Other matches
141 1317231 4891 21 none - -
142 1317275 4927 21 none - -
143 1317804 5399 25 none 508 451
144 947580 5436 36 none - -
145 23406 5518 34 none - -
146 333079 6592 32 none - -
147 (output continues ...)
148
149 Where the first line is the location of the reference file, and the first three columns are the same
150 as the three column match format described in the mummer section. The final three columns are the
151 overlap between this match and the previous match, the gap between the start of this match and the
152 end of the previous match in the reference, and the gap between the start of this match and the end
153 of the previous match in the query respectively.
154
155
156 mgaps
157 -----
158
159 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale
160 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable
161 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of
162 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only
163 matches that were included in clusters will appear in the output, so by adjusting the command-line
164 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger
165 areas of conservation between the input sequences. The major advantage of mgaps is its ability to
166 identify these "islands" of conservation. This frees the user from the single LIS restraints of the
167 gaps program and allows for the identification of large-scale rearrangements, duplications, gene
168 families and so on.
169
170 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using
171 mgaps, all regions of conservation can now been identified. The only fallback being the increased
172 complexity of the output, where you once had only one cluster for the whole comparison, you usually
173 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from
174 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*.
175
176 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or
177 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer
178 to the run-mummer3 script (see online manual) for an example of how to use this program in an
179 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement
180 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer
181 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences
182 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but
183 may eventually appear.
184
185 The -d option can be interpreted as the number of insertions allowed between two matches in the same
186 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where
187 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained
188 matches unless the -e option is used. The best way to get a feel for what each parameter controls
189 is to cluster the same data set numerous times with different values and observe the resulting
190 differences. It can also be helpful to set these parameters to the size of the element you wish to
191 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap
192 to the smallest intron you expect to obtain clusters that could represent single exons (depending
193 of course of the similarity of the two sequences).
194
195 Reference: http://mummer.sourceforge.net/manual/#mgaps
196
197 **Output format**
198
199 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different
200 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output
201 of mgaps run on both forward and reverse complement matches is as follows:
202 ::
203
204 > ID41
205 > ID41 Reverse
206 5177399 1 232 none - -
207 5177632 234 6794 none 1 1
208 5184433 7035 24 none 7 7
209 5184468 7069 23 none 11 10
210 > ID42
211 10181 43 1521 none - -
212 > ID42 Reverse
213 4654536 17 36 none - -
214 4654578 57 298 none 6 4
215 4654877 356 226 none 1 1
216 #
217 4655139 845 28 none - -
218 4655178 884 694 none 11 11
219 4655873 1579 20 none 1 1
220 #
221 4850044 17 1492 none - -
222 4851537 1510 711 none 1 1
223 4852249 2222 42 none 1 1
224 (output continues ...)
225
226
227 Headers containing the ID for each query sequence are listed after the '>' characters, and a
228 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters
229 for each sequence are separated by a '#' character, and the six columns are exactly the same as the
230 gaps output (see the gaps section for more details).
231
232
233 |
234 |
235
236 </help>
237 </tool>
238