Mercurial > repos > kuyt002 > mummer_toolsuite
comparison MUMmer/mummer_clustering.xml @ 0:61f30d177448 default tip
initial commit on Mummer toolsuite on toolshed
author | eric |
---|---|
date | Tue, 31 Mar 2015 14:19:49 +0200 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:61f30d177448 |
---|---|
1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True"> | |
2 <description>: order sequence matches in clusters</description> | |
3 <command> | |
4 <!-- update this path to the installed location --> | |
5 $tool.cmd | |
6 #if $tool.cmd=="gaps": | |
7 $in_reference | |
8 #if $tool.gaps_r=="yes": | |
9 -r | |
10 #end if | |
11 #end if | |
12 #if $tool.cmd=="mgaps": | |
13 #if $tool.cmd_C=="yes": | |
14 -C | |
15 #end if | |
16 -d $tool.cmd_d | |
17 #if $tool.cmd_e=="yes": | |
18 -e | |
19 #end if | |
20 -f $tool.cmd_f | |
21 -l $tool.cmd_l | |
22 -s $tool.cmd_s | |
23 #end if | |
24 < $tool.in_match_list | |
25 > $out_tool | |
26 | |
27 </command> | |
28 <inputs> | |
29 <conditional name="tool"> | |
30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" > | |
31 <option value="gaps" selected="true">gaps</option> | |
32 <option value="mgaps">mgaps</option> | |
33 </param> | |
34 <when value="gaps"> | |
35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" /> | |
36 <param name="gaps_r" type="select" label="Use reversed [-r]" > | |
37 <option value="no" selected="true">No</option> | |
38 <option value="yes">Yes</option> | |
39 </param> | |
40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
41 </when> | |
42 <when value="mgaps"> | |
43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" > | |
45 <option value="no" selected="true">No</option> | |
46 <option value="yes">Yes</option> | |
47 </param> | |
48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" /> | |
49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" > | |
50 <option value="no" selected="true">No</option> | |
51 <option value="yes">Yes</option> | |
52 </param> | |
53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" /> | |
54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" /> | |
55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" /> | |
56 </when> | |
57 </conditional> | |
58 </inputs> | |
59 <outputs> | |
60 <data name="out_tool" format="text" label="Clustering output" /> | |
61 </outputs> | |
62 <requirements> | |
63 <!-- <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> --> | |
64 <requirement type="package" version="4.6.4">gnuplot</requirement> | |
65 <requirement type="package" version="3.23">mummer</requirement> | |
66 </requirements> | |
67 <tests> | |
68 <test> | |
69 </test> | |
70 </tests> | |
71 <help> | |
72 | | |
73 | |
74 | |
75 **Reference** | |
76 ============= | |
77 | |
78 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands. | |
79 | |
80 - **MUMmer suite v3.22:** http://mummer.sourceforge.net | |
81 | |
82 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/ | |
83 | |
84 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve | |
85 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :) | |
86 | |
87 | |
88 **MUMmer Clustering** | |
89 ===================== | |
90 | |
91 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters | |
92 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment | |
93 regions from a match list, however when examining the data without graphic aids, it is very difficult | |
94 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches | |
95 together into larger groups of neighboring matches makes this process much easier by ordering the | |
96 data and removing spurious matches. | |
97 | |
98 | |
99 Gaps | |
100 ---- | |
101 | |
102 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering" | |
103 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm | |
104 to extract the longest consistent set of matches between two sequences, and generates a single | |
105 cluster that represents the best "straight-line" arrangement of matches between the sequences. By | |
106 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches | |
107 between the two sequences. This limits the usability of this program to the alignment of genomes | |
108 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of | |
109 near identical sequences with the goal of finding minor mutations like SNPs and small indels. | |
110 | |
111 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in | |
112 the Known problems (manual) section, and requires the header be stripped from the mummer output. In | |
113 addition, gaps is only designed to handle a single reference and a single query sequence, thus the | |
114 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming | |
115 matches as reverse complement matches which must reference the reverse complement of the sequence, | |
116 therefore forcing mummer to be run without the -c option. | |
117 | |
118 Reference: http://mummer.sourceforge.net/manual/#gaps | |
119 | |
120 **Output:** | |
121 :: | |
122 | |
123 > /home/aphillip/data/GHP.1con Consistent matches | |
124 183 17 22 none - - | |
125 238 72 108 none 33 33 | |
126 347 181 92 none 1 1 | |
127 458 292 50 none 19 19 | |
128 705 539 44 none 1 1 | |
129 750 584 38 none 1 1 | |
130 807 641 23 -16 0 4 | |
131 (output continues ...) | |
132 > Wrap around | |
133 334398 329917 47 none - 225 | |
134 334446 329965 62 none 1 1 | |
135 334539 330058 20 none 31 31 | |
136 334560 330079 92 none 1 1 | |
137 334653 330172 77 none 1 1 | |
138 334740 330259 41 none 10 10 | |
139 (output continues ...) | |
140 > /home/aphillip/data/GHP.1con Other matches | |
141 1317231 4891 21 none - - | |
142 1317275 4927 21 none - - | |
143 1317804 5399 25 none 508 451 | |
144 947580 5436 36 none - - | |
145 23406 5518 34 none - - | |
146 333079 6592 32 none - - | |
147 (output continues ...) | |
148 | |
149 Where the first line is the location of the reference file, and the first three columns are the same | |
150 as the three column match format described in the mummer section. The final three columns are the | |
151 overlap between this match and the previous match, the gap between the start of this match and the | |
152 end of the previous match in the reference, and the gap between the start of this match and the end | |
153 of the previous match in the query respectively. | |
154 | |
155 | |
156 mgaps | |
157 ----- | |
158 | |
159 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale | |
160 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable | |
161 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of | |
162 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only | |
163 matches that were included in clusters will appear in the output, so by adjusting the command-line | |
164 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger | |
165 areas of conservation between the input sequences. The major advantage of mgaps is its ability to | |
166 identify these "islands" of conservation. This frees the user from the single LIS restraints of the | |
167 gaps program and allows for the identification of large-scale rearrangements, duplications, gene | |
168 families and so on. | |
169 | |
170 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using | |
171 mgaps, all regions of conservation can now been identified. The only fallback being the increased | |
172 complexity of the output, where you once had only one cluster for the whole comparison, you usually | |
173 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from | |
174 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*. | |
175 | |
176 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or | |
177 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer | |
178 to the run-mummer3 script (see online manual) for an example of how to use this program in an | |
179 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement | |
180 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer | |
181 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences | |
182 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but | |
183 may eventually appear. | |
184 | |
185 The -d option can be interpreted as the number of insertions allowed between two matches in the same | |
186 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where | |
187 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained | |
188 matches unless the -e option is used. The best way to get a feel for what each parameter controls | |
189 is to cluster the same data set numerous times with different values and observe the resulting | |
190 differences. It can also be helpful to set these parameters to the size of the element you wish to | |
191 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap | |
192 to the smallest intron you expect to obtain clusters that could represent single exons (depending | |
193 of course of the similarity of the two sequences). | |
194 | |
195 Reference: http://mummer.sourceforge.net/manual/#mgaps | |
196 | |
197 **Output format** | |
198 | |
199 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different | |
200 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output | |
201 of mgaps run on both forward and reverse complement matches is as follows: | |
202 :: | |
203 | |
204 > ID41 | |
205 > ID41 Reverse | |
206 5177399 1 232 none - - | |
207 5177632 234 6794 none 1 1 | |
208 5184433 7035 24 none 7 7 | |
209 5184468 7069 23 none 11 10 | |
210 > ID42 | |
211 10181 43 1521 none - - | |
212 > ID42 Reverse | |
213 4654536 17 36 none - - | |
214 4654578 57 298 none 6 4 | |
215 4654877 356 226 none 1 1 | |
216 # | |
217 4655139 845 28 none - - | |
218 4655178 884 694 none 11 11 | |
219 4655873 1579 20 none 1 1 | |
220 # | |
221 4850044 17 1492 none - - | |
222 4851537 1510 711 none 1 1 | |
223 4852249 2222 42 none 1 1 | |
224 (output continues ...) | |
225 | |
226 | |
227 Headers containing the ID for each query sequence are listed after the '>' characters, and a | |
228 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters | |
229 for each sequence are separated by a '#' character, and the six columns are exactly the same as the | |
230 gaps output (see the gaps section for more details). | |
231 | |
232 | |
233 | | |
234 | | |
235 | |
236 </help> | |
237 </tool> | |
238 |