Mercurial > repos > abossers > mummer_toolsuite
comparison MUMmer/mummer_clustering.xml @ 0:6753195df9e0 default tip
Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
| author | abossers |
|---|---|
| date | Tue, 07 Jun 2011 17:49:58 -0400 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:6753195df9e0 |
|---|---|
| 1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True"> | |
| 2 <description>: order sequence matches in clusters</description> | |
| 3 <command> | |
| 4 <!-- update this path to the installed location --> | |
| 5 /opt/MUMmer/MUMmer/$tool.cmd | |
| 6 #if $tool.cmd=="gaps": | |
| 7 $in_reference | |
| 8 #if $tool.gaps_r=="yes": | |
| 9 -r | |
| 10 #end if | |
| 11 #end if | |
| 12 #if $tool.cmd=="mgaps": | |
| 13 #if $tool.cmd_C=="yes": | |
| 14 -C | |
| 15 #end if | |
| 16 -d $tool.cmd_d | |
| 17 #if $tool.cmd_e=="yes": | |
| 18 -e | |
| 19 #end if | |
| 20 -f $tool.cmd_f | |
| 21 -l $tool.cmd_l | |
| 22 -s $tool.cmd_s | |
| 23 #end if | |
| 24 < $tool.in_match_list | |
| 25 > $out_tool | |
| 26 | |
| 27 </command> | |
| 28 <inputs> | |
| 29 <conditional name="tool"> | |
| 30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" > | |
| 31 <option value="gaps" selected="true">gaps</option> | |
| 32 <option value="mgaps">mgaps</option> | |
| 33 </param> | |
| 34 <when value="gaps"> | |
| 35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" /> | |
| 36 <param name="gaps_r" type="select" label="Use reversed [-r]" > | |
| 37 <option value="no" selected="true">No</option> | |
| 38 <option value="yes">Yes</option> | |
| 39 </param> | |
| 40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
| 41 </when> | |
| 42 <when value="mgaps"> | |
| 43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
| 44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" > | |
| 45 <option value="no" selected="true">No</option> | |
| 46 <option value="yes">Yes</option> | |
| 47 </param> | |
| 48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" /> | |
| 49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" > | |
| 50 <option value="no" selected="true">No</option> | |
| 51 <option value="yes">Yes</option> | |
| 52 </param> | |
| 53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" /> | |
| 54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" /> | |
| 55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" /> | |
| 56 </when> | |
| 57 </conditional> | |
| 58 </inputs> | |
| 59 <outputs> | |
| 60 <data name="out_tool" format="text" label="Clustering output" /> | |
| 61 </outputs> | |
| 62 <requirements> | |
| 63 <requirement type="binary">gaps</requirement> | |
| 64 <requirement type="binary">mgaps</requirement> | |
| 65 </requirements> | |
| 66 <tests> | |
| 67 <test> | |
| 68 </test> | |
| 69 </tests> | |
| 70 <help> | |
| 71 | | |
| 72 | |
| 73 | |
| 74 **Reference** | |
| 75 ============= | |
| 76 | |
| 77 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands. | |
| 78 | |
| 79 - **MUMmer suite v3.22:** http://mummer.sourceforge.net | |
| 80 | |
| 81 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/ | |
| 82 | |
| 83 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve | |
| 84 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :) | |
| 85 | |
| 86 | |
| 87 **MUMmer Clustering** | |
| 88 ===================== | |
| 89 | |
| 90 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters | |
| 91 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment | |
| 92 regions from a match list, however when examining the data without graphic aids, it is very difficult | |
| 93 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches | |
| 94 together into larger groups of neighboring matches makes this process much easier by ordering the | |
| 95 data and removing spurious matches. | |
| 96 | |
| 97 | |
| 98 Gaps | |
| 99 ---- | |
| 100 | |
| 101 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering" | |
| 102 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm | |
| 103 to extract the longest consistent set of matches between two sequences, and generates a single | |
| 104 cluster that represents the best "straight-line" arrangement of matches between the sequences. By | |
| 105 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches | |
| 106 between the two sequences. This limits the usability of this program to the alignment of genomes | |
| 107 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of | |
| 108 near identical sequences with the goal of finding minor mutations like SNPs and small indels. | |
| 109 | |
| 110 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in | |
| 111 the Known problems (manual) section, and requires the header be stripped from the mummer output. In | |
| 112 addition, gaps is only designed to handle a single reference and a single query sequence, thus the | |
| 113 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming | |
| 114 matches as reverse complement matches which must reference the reverse complement of the sequence, | |
| 115 therefore forcing mummer to be run without the -c option. | |
| 116 | |
| 117 Reference: http://mummer.sourceforge.net/manual/#gaps | |
| 118 | |
| 119 **Output:** | |
| 120 :: | |
| 121 | |
| 122 > /home/aphillip/data/GHP.1con Consistent matches | |
| 123 183 17 22 none - - | |
| 124 238 72 108 none 33 33 | |
| 125 347 181 92 none 1 1 | |
| 126 458 292 50 none 19 19 | |
| 127 705 539 44 none 1 1 | |
| 128 750 584 38 none 1 1 | |
| 129 807 641 23 -16 0 4 | |
| 130 (output continues ...) | |
| 131 > Wrap around | |
| 132 334398 329917 47 none - 225 | |
| 133 334446 329965 62 none 1 1 | |
| 134 334539 330058 20 none 31 31 | |
| 135 334560 330079 92 none 1 1 | |
| 136 334653 330172 77 none 1 1 | |
| 137 334740 330259 41 none 10 10 | |
| 138 (output continues ...) | |
| 139 > /home/aphillip/data/GHP.1con Other matches | |
| 140 1317231 4891 21 none - - | |
| 141 1317275 4927 21 none - - | |
| 142 1317804 5399 25 none 508 451 | |
| 143 947580 5436 36 none - - | |
| 144 23406 5518 34 none - - | |
| 145 333079 6592 32 none - - | |
| 146 (output continues ...) | |
| 147 | |
| 148 Where the first line is the location of the reference file, and the first three columns are the same | |
| 149 as the three column match format described in the mummer section. The final three columns are the | |
| 150 overlap between this match and the previous match, the gap between the start of this match and the | |
| 151 end of the previous match in the reference, and the gap between the start of this match and the end | |
| 152 of the previous match in the query respectively. | |
| 153 | |
| 154 | |
| 155 mgaps | |
| 156 ----- | |
| 157 | |
| 158 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale | |
| 159 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable | |
| 160 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of | |
| 161 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only | |
| 162 matches that were included in clusters will appear in the output, so by adjusting the command-line | |
| 163 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger | |
| 164 areas of conservation between the input sequences. The major advantage of mgaps is its ability to | |
| 165 identify these "islands" of conservation. This frees the user from the single LIS restraints of the | |
| 166 gaps program and allows for the identification of large-scale rearrangements, duplications, gene | |
| 167 families and so on. | |
| 168 | |
| 169 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using | |
| 170 mgaps, all regions of conservation can now been identified. The only fallback being the increased | |
| 171 complexity of the output, where you once had only one cluster for the whole comparison, you usually | |
| 172 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from | |
| 173 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*. | |
| 174 | |
| 175 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or | |
| 176 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer | |
| 177 to the run-mummer3 script (see online manual) for an example of how to use this program in an | |
| 178 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement | |
| 179 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer | |
| 180 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences | |
| 181 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but | |
| 182 may eventually appear. | |
| 183 | |
| 184 The -d option can be interpreted as the number of insertions allowed between two matches in the same | |
| 185 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where | |
| 186 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained | |
| 187 matches unless the -e option is used. The best way to get a feel for what each parameter controls | |
| 188 is to cluster the same data set numerous times with different values and observe the resulting | |
| 189 differences. It can also be helpful to set these parameters to the size of the element you wish to | |
| 190 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap | |
| 191 to the smallest intron you expect to obtain clusters that could represent single exons (depending | |
| 192 of course of the similarity of the two sequences). | |
| 193 | |
| 194 Reference: http://mummer.sourceforge.net/manual/#mgaps | |
| 195 | |
| 196 **Output format** | |
| 197 | |
| 198 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different | |
| 199 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output | |
| 200 of mgaps run on both forward and reverse complement matches is as follows: | |
| 201 :: | |
| 202 | |
| 203 > ID41 | |
| 204 > ID41 Reverse | |
| 205 5177399 1 232 none - - | |
| 206 5177632 234 6794 none 1 1 | |
| 207 5184433 7035 24 none 7 7 | |
| 208 5184468 7069 23 none 11 10 | |
| 209 > ID42 | |
| 210 10181 43 1521 none - - | |
| 211 > ID42 Reverse | |
| 212 4654536 17 36 none - - | |
| 213 4654578 57 298 none 6 4 | |
| 214 4654877 356 226 none 1 1 | |
| 215 # | |
| 216 4655139 845 28 none - - | |
| 217 4655178 884 694 none 11 11 | |
| 218 4655873 1579 20 none 1 1 | |
| 219 # | |
| 220 4850044 17 1492 none - - | |
| 221 4851537 1510 711 none 1 1 | |
| 222 4852249 2222 42 none 1 1 | |
| 223 (output continues ...) | |
| 224 | |
| 225 | |
| 226 Headers containing the ID for each query sequence are listed after the '>' characters, and a | |
| 227 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters | |
| 228 for each sequence are separated by a '#' character, and the six columns are exactly the same as the | |
| 229 gaps output (see the gaps section for more details). | |
| 230 | |
| 231 | |
| 232 | | |
| 233 | | |
| 234 | |
| 235 </help> | |
| 236 </tool> | |
| 237 |
