Mercurial > repos > kuyt002 > mummer_toolsuite

<tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True">
  <description>: order sequence matches in clusters</description>
  <command>
	<!-- update this path to the installed location -->
		$tool.cmd
		#if $tool.cmd=="gaps":
			$in_reference
			#if $tool.gaps_r=="yes":
				-r
			#end if
		#end if
		#if $tool.cmd=="mgaps":
			#if $tool.cmd_C=="yes":
				-C
			#end if
			-d $tool.cmd_d
			#if $tool.cmd_e=="yes":
				-e
			#end if
			-f $tool.cmd_f
			-l $tool.cmd_l
			-s $tool.cmd_s
		#end if
		&lt; $tool.in_match_list
		&gt; $out_tool

  </command>
	<inputs>
	  <conditional name="tool">
		<param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" >
			<option value="gaps" selected="true">gaps</option>
			<option value="mgaps">mgaps</option>
		</param>
		<when value="gaps">
			<param name="in_reference" type="data" format="fasta" label="Reference FastA file" />
			<param name="gaps_r" type="select" label="Use reversed [-r]" >
				<option value="no" selected="true">No</option>
				<option value="yes">Yes</option>
			</param>
			<param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
		</when>
		<when value="mgaps">
			<param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
			<param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" >
				<option value="no" selected="true">No</option>
				<option value="yes">Yes</option>
			</param>
			<param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" />
			<param name="cmd_e" type="select" label="Use extent of cluster [-e]" >
				<option value="no" selected="true">No</option>
				<option value="yes">Yes</option>
			</param>
			<param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" />
			<param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" />
			<param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" />
		</when>
	  </conditional>
	</inputs>
	<outputs>
		<data name="out_tool" format="text" label="Clustering output" />
	</outputs>
	<requirements>
<!--         <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> -->
        <requirement type="package" version="4.6.4">gnuplot</requirement>
        <requirement type="package" version="3.23">mummer</requirement>
	</requirements>
	<tests>
		<test>
		</test>
	</tests>
	<help>
|


**Reference**
=============

- **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands.

- **MUMmer suite v3.22:** http://mummer.sourceforge.net

- **MUMmer tutorials:** http://mummer.sourceforge.net/examples/

If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve
or modify the wrappers please add instead of substitute yourself into the acknowlegement section :)


**MUMmer Clustering**
=====================

MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters
in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment
regions from a match list, however when examining the data without graphic aids, it is very difficult
to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches
together into larger groups of neighboring matches makes this process much easier by ordering the
data and removing spurious matches.


Gaps
----

*gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering"
step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm
to extract the longest consistent set of matches between two sequences, and generates a single
cluster that represents the best "straight-line" arrangement of matches between the sequences. By
straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches
between the two sequences. This limits the usability of this program to the alignment of genomes
that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of
near identical sequences with the goal of finding minor mutations like SNPs and small indels.

Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in
the Known problems (manual) section, and requires the header be stripped from the mummer output. In
addition, gaps is only designed to handle a single reference and a single query sequence, thus the
preceding mummer run must also follow this constraint. The -r is optional and designates the incoming
matches as reverse complement matches which must reference the reverse complement of the sequence,
therefore forcing mummer to be run without the -c option.

Reference: http://mummer.sourceforge.net/manual/#gaps

**Output:**
::

 > /home/aphillip/data/GHP.1con  Consistent matches
      183       17     22    none      -      -
      238       72    108    none     33     33
      347      181     92    none      1      1
      458      292     50    none     19     19
      705      539     44    none      1      1
      750      584     38    none      1      1
      807      641     23     -16      0      4
 (output continues ...)
 > Wrap around
   334398   329917     47    none      -    225
   334446   329965     62    none      1      1
   334539   330058     20    none     31     31
   334560   330079     92    none      1      1
   334653   330172     77    none      1      1
   334740   330259     41    none     10     10
 (output continues ...)
 > /home/aphillip/data/GHP.1con  Other matches
  1317231     4891     21    none      -      -
  1317275     4927     21    none      -      -
  1317804     5399     25    none    508    451
   947580     5436     36    none      -      -
    23406     5518     34    none      -      -
   333079     6592     32    none      -      -
 (output continues ...)

Where the first line is the location of the reference file, and the first three columns are the same
as the three column match format described in the mummer section. The final three columns are the
overlap between this match and the previous match, the gap between the start of this match and the
end of the previous match in the reference, and the gap between the start of this match and the end
of the previous match in the query respectively.


mgaps
-----

*mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale
rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable
of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of
command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only
matches that were included in clusters will appear in the output, so by adjusting the command-line
parameters it is possible to filter out many of the spurious matches, thus leaving only the larger
areas of conservation between the input sequences. The major advantage of mgaps is its ability to
identify these "islands" of conservation. This frees the user from the single LIS restraints of the
gaps program and allows for the identification of large-scale rearrangements, duplications, gene
families and so on.

Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using
mgaps, all regions of conservation can now been identified. The only fallback being the increased
complexity of the output, where you once had only one cluster for the whole comparison, you usually
now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from
"correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*.

Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or
more query sequences, thus the preceding mummer run must also follow this constraint. Please refer
to the run-mummer3 script (see online manual) for an example of how to use this program in an
alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement
matches must reference the reverse complement strand of the query sequence, therefore forcing mummer
to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences
and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but
may eventually appear.

The -d option can be interpreted as the number of insertions allowed between two matches in the same
cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where
a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained
matches unless the -e option is used. The best way to get a feel for what each parameter controls
is to cluster the same data set numerous times with different values and observe the resulting
differences. It can also be helpful to set these parameters to the size of the element you wish to
capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap
to the smallest intron you expect to obtain clusters that could represent single exons (depending
of course of the similarity of the two sequences).

Reference: http://mummer.sourceforge.net/manual/#mgaps

**Output format**

Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different
header formatting than gaps to allow for multiple query sequences and multiple clusters. The output
of mgaps run on both forward and reverse complement matches is as follows:
::

 > ID41
 > ID41 Reverse
  5177399        1    232    none      -      -
  5177632      234   6794    none      1      1
  5184433     7035     24    none      7      7
  5184468     7069     23    none     11     10
 > ID42
    10181       43   1521    none      -      -
 > ID42 Reverse
  4654536       17     36    none      -      -
  4654578       57    298    none      6      4
  4654877      356    226    none      1      1
 #
  4655139      845     28    none      -      -
  4655178      884    694    none     11     11
  4655873     1579     20    none      1      1
 #
  4850044       17   1492    none      -      -
  4851537     1510    711    none      1      1
  4852249     2222     42    none      1      1
 (output continues ...)


Headers containing the ID for each query sequence are listed after the '>' characters, and a
following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters
for each sequence are separated by a '#' character, and the six columns are exactly the same as the
gaps output (see the gaps section for more details).


|
|

	</help>
</tool>
author	eric
date	Tue, 31 Mar 2015 14:19:49 +0200
parents
children