Mercurial > repos > kuyt002 > mummer_toolsuite
view MUMmer/mummer_clustering.xml @ 0:61f30d177448 default tip
initial commit on Mummer toolsuite on toolshed
author | eric |
---|---|
date | Tue, 31 Mar 2015 14:19:49 +0200 |
parents | |
children |
line wrap: on
line source
<tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True"> <description>: order sequence matches in clusters</description> <command> <!-- update this path to the installed location --> $tool.cmd #if $tool.cmd=="gaps": $in_reference #if $tool.gaps_r=="yes": -r #end if #end if #if $tool.cmd=="mgaps": #if $tool.cmd_C=="yes": -C #end if -d $tool.cmd_d #if $tool.cmd_e=="yes": -e #end if -f $tool.cmd_f -l $tool.cmd_l -s $tool.cmd_s #end if < $tool.in_match_list > $out_tool </command> <inputs> <conditional name="tool"> <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" > <option value="gaps" selected="true">gaps</option> <option value="mgaps">mgaps</option> </param> <when value="gaps"> <param name="in_reference" type="data" format="fasta" label="Reference FastA file" /> <param name="gaps_r" type="select" label="Use reversed [-r]" > <option value="no" selected="true">No</option> <option value="yes">Yes</option> </param> <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> </when> <when value="mgaps"> <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" > <option value="no" selected="true">No</option> <option value="yes">Yes</option> </param> <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" /> <param name="cmd_e" type="select" label="Use extent of cluster [-e]" > <option value="no" selected="true">No</option> <option value="yes">Yes</option> </param> <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" /> <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" /> <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" /> </when> </conditional> </inputs> <outputs> <data name="out_tool" format="text" label="Clustering output" /> </outputs> <requirements> <!-- <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> --> <requirement type="package" version="4.6.4">gnuplot</requirement> <requirement type="package" version="3.23">mummer</requirement> </requirements> <tests> <test> </test> </tests> <help> | **Reference** ============= - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands. - **MUMmer suite v3.22:** http://mummer.sourceforge.net - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/ If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve or modify the wrappers please add instead of substitute yourself into the acknowlegement section :) **MUMmer Clustering** ===================== MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment regions from a match list, however when examining the data without graphic aids, it is very difficult to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches together into larger groups of neighboring matches makes this process much easier by ordering the data and removing spurious matches. Gaps ---- *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering" step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm to extract the longest consistent set of matches between two sequences, and generates a single cluster that represents the best "straight-line" arrangement of matches between the sequences. By straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches between the two sequences. This limits the usability of this program to the alignment of genomes that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of near identical sequences with the goal of finding minor mutations like SNPs and small indels. Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in the Known problems (manual) section, and requires the header be stripped from the mummer output. In addition, gaps is only designed to handle a single reference and a single query sequence, thus the preceding mummer run must also follow this constraint. The -r is optional and designates the incoming matches as reverse complement matches which must reference the reverse complement of the sequence, therefore forcing mummer to be run without the -c option. Reference: http://mummer.sourceforge.net/manual/#gaps **Output:** :: > /home/aphillip/data/GHP.1con Consistent matches 183 17 22 none - - 238 72 108 none 33 33 347 181 92 none 1 1 458 292 50 none 19 19 705 539 44 none 1 1 750 584 38 none 1 1 807 641 23 -16 0 4 (output continues ...) > Wrap around 334398 329917 47 none - 225 334446 329965 62 none 1 1 334539 330058 20 none 31 31 334560 330079 92 none 1 1 334653 330172 77 none 1 1 334740 330259 41 none 10 10 (output continues ...) > /home/aphillip/data/GHP.1con Other matches 1317231 4891 21 none - - 1317275 4927 21 none - - 1317804 5399 25 none 508 451 947580 5436 36 none - - 23406 5518 34 none - - 333079 6592 32 none - - (output continues ...) Where the first line is the location of the reference file, and the first three columns are the same as the three column match format described in the mummer section. The final three columns are the overlap between this match and the previous match, the gap between the start of this match and the end of the previous match in the reference, and the gap between the start of this match and the end of the previous match in the query respectively. mgaps ----- *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only matches that were included in clusters will appear in the output, so by adjusting the command-line parameters it is possible to filter out many of the spurious matches, thus leaving only the larger areas of conservation between the input sequences. The major advantage of mgaps is its ability to identify these "islands" of conservation. This frees the user from the single LIS restraints of the gaps program and allows for the identification of large-scale rearrangements, duplications, gene families and so on. Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using mgaps, all regions of conservation can now been identified. The only fallback being the increased complexity of the output, where you once had only one cluster for the whole comparison, you usually now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*. Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or more query sequences, thus the preceding mummer run must also follow this constraint. Please refer to the run-mummer3 script (see online manual) for an example of how to use this program in an alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement matches must reference the reverse complement strand of the query sequence, therefore forcing mummer to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but may eventually appear. The -d option can be interpreted as the number of insertions allowed between two matches in the same cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained matches unless the -e option is used. The best way to get a feel for what each parameter controls is to cluster the same data set numerous times with different values and observe the resulting differences. It can also be helpful to set these parameters to the size of the element you wish to capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap to the smallest intron you expect to obtain clusters that could represent single exons (depending of course of the similarity of the two sequences). Reference: http://mummer.sourceforge.net/manual/#mgaps **Output format** Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different header formatting than gaps to allow for multiple query sequences and multiple clusters. The output of mgaps run on both forward and reverse complement matches is as follows: :: > ID41 > ID41 Reverse 5177399 1 232 none - - 5177632 234 6794 none 1 1 5184433 7035 24 none 7 7 5184468 7069 23 none 11 10 > ID42 10181 43 1521 none - - > ID42 Reverse 4654536 17 36 none - - 4654578 57 298 none 6 4 4654877 356 226 none 1 1 # 4655139 845 28 none - - 4655178 884 694 none 11 11 4655873 1579 20 none 1 1 # 4850044 17 1492 none - - 4851537 1510 711 none 1 1 4852249 2222 42 none 1 1 (output continues ...) Headers containing the ID for each query sequence are listed after the '>' characters, and a following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters for each sequence are separated by a '#' character, and the six columns are exactly the same as the gaps output (see the gaps section for more details). | | </help> </tool>