mummer_toolsuite: MUMmer/mummer_clustering.xml comparison

comparison MUMmer/mummer_clustering.xml @ 0:61f30d177448 default tip

initial commit on Mummer toolsuite on toolshed

author	eric
date	Tue, 31 Mar 2015 14:19:49 +0200
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:61f30d177448
+<tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True">
+<description>: order sequence matches in clusters</description>
+<command>
+	<!-- update this path to the installed location -->
+		$tool.cmd
+		#if $tool.cmd=="gaps":
+			$in_reference
+			#if $tool.gaps_r=="yes":
+				-r
+			#end if
+		#end if
+		#if $tool.cmd=="mgaps":
+			#if $tool.cmd_C=="yes":
+				-C
+			#end if
+			-d $tool.cmd_d
+			#if $tool.cmd_e=="yes":
+				-e
+			#end if
+			-f $tool.cmd_f
+			-l $tool.cmd_l
+			-s $tool.cmd_s
+		#end if
+		&lt; $tool.in_match_list
+		&gt; $out_tool
+</command>
+	<inputs>
+	  <conditional name="tool">
+		<param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" >
+			<option value="gaps" selected="true">gaps</option>
+			<option value="mgaps">mgaps</option>
+		</param>
+		<when value="gaps">
+			<param name="in_reference" type="data" format="fasta" label="Reference FastA file" />
+			<param name="gaps_r" type="select" label="Use reversed [-r]" >
+				<option value="no" selected="true">No</option>
+				<option value="yes">Yes</option>
+			</param>
+			<param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
+		</when>
+		<when value="mgaps">
+			<param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
+			<param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" >
+				<option value="no" selected="true">No</option>
+				<option value="yes">Yes</option>
+			</param>
+			<param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" />
+			<param name="cmd_e" type="select" label="Use extent of cluster [-e]" >
+				<option value="no" selected="true">No</option>
+				<option value="yes">Yes</option>
+			</param>
+			<param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" />
+			<param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" />
+			<param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" />
+		</when>
+	  </conditional>
+	</inputs>
+	<outputs>
+		<data name="out_tool" format="text" label="Clustering output" />
+	</outputs>
+	<requirements>
+<!--         <requirement type="set_environment" version="3.23">MUMMER_PATH</requirement> -->
+<requirement type="package" version="4.6.4">gnuplot</requirement>
+<requirement type="package" version="3.23">mummer</requirement>
+	</requirements>
+	<tests>
+		<test>
+		</test>
+	</tests>
+	<help>
+|
+**Reference**
+=============
+- **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands.
+- **MUMmer suite v3.22:** http://mummer.sourceforge.net
+- **MUMmer tutorials:** http://mummer.sourceforge.net/examples/
+If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve
+or modify the wrappers please add instead of substitute yourself into the acknowlegement section :)
+**MUMmer Clustering**
+=====================
+MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters
+in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment
+regions from a match list, however when examining the data without graphic aids, it is very difficult
+to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches
+together into larger groups of neighboring matches makes this process much easier by ordering the
+data and removing spurious matches.
+Gaps
+----
+*gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering"
+step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm
+to extract the longest consistent set of matches between two sequences, and generates a single
+cluster that represents the best "straight-line" arrangement of matches between the sequences. By
+straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches
+between the two sequences. This limits the usability of this program to the alignment of genomes
+that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of
+near identical sequences with the goal of finding minor mutations like SNPs and small indels.
+Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in
+the Known problems (manual) section, and requires the header be stripped from the mummer output. In
+addition, gaps is only designed to handle a single reference and a single query sequence, thus the
+preceding mummer run must also follow this constraint. The -r is optional and designates the incoming
+matches as reverse complement matches which must reference the reverse complement of the sequence,
+therefore forcing mummer to be run without the -c option.
+Reference: http://mummer.sourceforge.net/manual/#gaps
+**Output:**
+::
+> /home/aphillip/data/GHP.1con  Consistent matches
+183       17     22    none      -      -
+238       72    108    none     33     33
+347      181     92    none      1      1
+458      292     50    none     19     19
+705      539     44    none      1      1
+750      584     38    none      1      1
+807      641     23     -16      0      4
+(output continues ...)
+> Wrap around
+334398   329917     47    none      -    225
+334446   329965     62    none      1      1
+334539   330058     20    none     31     31
+334560   330079     92    none      1      1
+334653   330172     77    none      1      1
+334740   330259     41    none     10     10
+(output continues ...)
+> /home/aphillip/data/GHP.1con  Other matches
+1317231     4891     21    none      -      -
+1317275     4927     21    none      -      -
+1317804     5399     25    none    508    451
+947580     5436     36    none      -      -
+23406     5518     34    none      -      -
+333079     6592     32    none      -      -
+(output continues ...)
+Where the first line is the location of the reference file, and the first three columns are the same
+as the three column match format described in the mummer section. The final three columns are the
+overlap between this match and the previous match, the gap between the start of this match and the
+end of the previous match in the reference, and the gap between the start of this match and the end
+of the previous match in the query respectively.
+mgaps
+-----
+*mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale
+rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable
+of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of
+command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only
+matches that were included in clusters will appear in the output, so by adjusting the command-line
+parameters it is possible to filter out many of the spurious matches, thus leaving only the larger
+areas of conservation between the input sequences. The major advantage of mgaps is its ability to
+identify these "islands" of conservation. This frees the user from the single LIS restraints of the
+gaps program and allows for the identification of large-scale rearrangements, duplications, gene
+families and so on.
+Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using
+mgaps, all regions of conservation can now been identified. The only fallback being the increased
+complexity of the output, where you once had only one cluster for the whole comparison, you usually
+now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from
+"correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*.
+Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or
+more query sequences, thus the preceding mummer run must also follow this constraint. Please refer
+to the run-mummer3 script (see online manual) for an example of how to use this program in an
+alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement
+matches must reference the reverse complement strand of the query sequence, therefore forcing mummer
+to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences
+and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but
+may eventually appear.
+The -d option can be interpreted as the number of insertions allowed between two matches in the same
+cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where
+a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained
+matches unless the -e option is used. The best way to get a feel for what each parameter controls
+is to cluster the same data set numerous times with different values and observe the resulting
+differences. It can also be helpful to set these parameters to the size of the element you wish to
+capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap
+to the smallest intron you expect to obtain clusters that could represent single exons (depending
+of course of the similarity of the two sequences).
+Reference: http://mummer.sourceforge.net/manual/#mgaps
+**Output format**
+Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different
+header formatting than gaps to allow for multiple query sequences and multiple clusters. The output
+of mgaps run on both forward and reverse complement matches is as follows:
+::
+> ID41
+> ID41 Reverse
+5177399        1    232    none      -      -
+5177632      234   6794    none      1      1
+5184433     7035     24    none      7      7
+5184468     7069     23    none     11     10
+> ID42
+10181       43   1521    none      -      -
+> ID42 Reverse
+4654536       17     36    none      -      -
+4654578       57    298    none      6      4
+4654877      356    226    none      1      1
+#
+4655139      845     28    none      -      -
+4655178      884    694    none     11     11
+4655873     1579     20    none      1      1
+#
+4850044       17   1492    none      -      -
+4851537     1510    711    none      1      1
+4852249     2222     42    none      1      1
+(output continues ...)
+Headers containing the ID for each query sequence are listed after the '>' characters, and a
+following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters
+for each sequence are separated by a '#' character, and the six columns are exactly the same as the
+gaps output (see the gaps section for more details).
+|
+|
+	</help>
+</tool>

Mercurial > repos > kuyt002 > mummer_toolsuite

comparison MUMmer/mummer_clustering.xml @ 0:61f30d177448 default tip