view rnabob.xml @ 0:cd00b4fe6552 draft

Imported from capsule None
author rnateam
date Mon, 22 Dec 2014 09:08:31 -0500
parents
children 5a4b00c84f50
line wrap: on
line source

<tool id="rbc_rnabob" name="RNABOB" version="2.2.1.0">
    <description>Fast Pattern searching for RNA secondary structures</description>
    <requirements>
        <requirement type="package" version="2.2.1">rnabob</requirement>
    </requirements>
    <version_command>echo "2.2.1"</version_command>
    <command>
<![CDATA[
    rnabob 
    	-q
    	$fancy
    	$compStrands
    	$skipOverlapping
    	$descriptorFile
    	$sequenceFile > $stdout
]]>
    </command>
    <stdio>
        <exit_code range="1:" level="fatal" description="Error occurred. Please check Tool Standard Error" />
        <exit_code range=":-1" level="fatal" description="Error occurred. Please check Tool Standard Error" />
    </stdio>
    <inputs>
        <param name="descriptorFile" type="data" format="txt" multiple="false" label="Motif Descriptor File" help="This file contains the description of the motif for which to search"/>
	    <param name="sequenceFile" type="data" format="fasta" multiple="false" label="Sequence File" help="This file specifies the sequence in which the motif will be searched"/>
	    <param name="compStrands" type="boolean" truevalue="-c" falsevalue="" checked="false" label="Also search on complementary strands" help="-c : Search both strands of the supplied sequence"/>
	    <param name="skipOverlapping" type="boolean" truevalue="-s" falsevalue="" checked="false" label="Skip overlapping matches" help="-s : This is a workaround to avoid a problem in the DNABANK, overlapping matches will be ignored"/>
	    <param name="fancy" type="boolean" checked="false" truevalue="-F" falsevalue="" label="Show Alignments" help="Display full alignments to pattern"/>
    </inputs>
    <outputs>
        <data format="txt" name="stdout" label="${tool.name} on ${on_string}" />
    </outputs>
    <tests>
        <test>
            <param name="descriptorFile" value="r17.des" />
            <param name="sequenceFile" value="F22B7.fa" />
            <param name="compStrands" value="True" />
            <param name="skipOverlapping" value="False" />
            <param name="fancy" value="False" />
            <output name="stdout" file="r17.bob" />
        </test>
        <test>
            <param name="descriptorFile" value="trna.des" />
            <param name="sequenceFile" value="F22B7.fa" />
            <param name="compStrands" value="True" />
            <param name="skipOverlapping" value="False" />
            <param name="fancy" value="False" />
            <output name="stdout" file="trna.bob" />
        </test>
    </tests>
    <help>
**What RNABOB does**

RNABOB allows searching a sequence database for RNA structural motifs.
The probe motif is specified in a *descriptor* file,
which describes its primary sequence, secondary structure, and tertiary constraints.
The source in its original packaging can be found at http://selab.janelia.org/software/#rnabob.

-----

**Sequence database format**

RNABOB is currently restricted to reading sequence files in FASTA format. 
The command line version of RNABOB can also read sequence files in GCG, EMBL, GenBank and other formats.

-----

**Descriptor file syntax**

The descriptor file syntax is fairly powerful, and allows a great deal of freedom for specifying 
RNA motifs. The syntax is therefore a bit complicated.

The descriptor file has two parts: a **topology** description and an **explicit** description.

The first non-blank, non-comment line of the file is the topology description. It defines the 
order of occurrence of a series of single-stranded, double-stranded and related elements. Each 
element must be given a unique name (a number, typically) and must be prefixed with '**s**', 
'**h**', or '**r**', indicating single-strand, helical, or a relational element. Helical and 
relational elements are paired to other elements, which are suffixed by a prime, **\'**.

For example::

	\
			h1 s1 h1'

describes a hairpin loop structure with a simple helix and single-stranded loop. If the helix 
always contained a non-canonical base pair at one position, the topology coud be described as::

	\
			h1 r1 h2 s1 h2' r1' h1'

where r1,r1' indicate a correlation, where the sequence r1 constrains the sequence of r1'. 
(Helices are a special case of this.)

The remaining non-comment, non-blank lines are explicit descriptions of each element in turn. Each 
line contains 3 or 4 fields, separated by tabs or blank space. The first field is the name of the 
element, from the topology description. The second field is the number of mismatches allowed in 
this element. The third field is the primary sequence constraint to apply to this element.

Helices and relational element pairs are specified on a single line rather than two. Mismatches 
and primary sequence constraints are given as pairs, separated by a colon '**:**'. The left side 
is the constraint applied to the upstream element, and the right side is applied to the downstream 
elements.

The primary sequence constraint is given as a sequence of nucleotides. Any IUPAC single-letter 
code is recognized, including N if the position can have any base identity. Allowed length 
variations are specified with asterisks ``'*'``, where each ``*`` will allow either 0 or 1 N at 
that position.

For example::

	\
			GGAGG******NNNAUG

specifies a GGAGG Shine/Dalgarno site and an AUG initiation codon, separated by a spacer of 3 to 9 
nucleotides of any sequence.

An alternative syntax can be used for very long gaps::

	\
			GGAGG[10]NNNAUG is the same as GGAGG**********NNNAUG

Be careful defining variable length helices and relational elements; if the number and type (gap 
or identity) of position do not match on left and right sides, the program will refuse to accept 
the descriptor.

Relational elements have an additional field which specifies a "transformation matrix" of four 
nucleotides, specifying the rule for making the ``r'`` pattern from the ``r`` sequence in order 
``A-C-G-T``. For example, the transformation matrix for a simple helix is ``TGCA``; if you allow 
``G-U`` pairs, it is ``TGYR``. RNABOB allows ``G-U`` pairing by default and uses the ``TGYR`` 
matrix for helical elements.

For example, the explicit description of our hairpin might be:

::

	\
	 		h1 0:0 NNN:NNN
	 		r1 0:0 R:N GNAN
	 		h2 0:0 **NC:GN**
		 	s1 0 UUCG

This describes a stem of 6 to 8 base pairs, in which the 4th pair from the bottom of the stem must 
be a non-canonical GA pair. Note that, in general, the left side of the primary constraint for 
helices and relational elements is redundant, and should be given as all N. In some cases it is 
convenient to constrain the right side to require a particular base pair (GU, for instance) at one 
position.

A note on mismatches: The split format for helices and relational elements works like this. The 
number on the left constrains the primary sequence match of the left side of the primary 
constraint. The number on the right constrains the match of the right side of the primary 
constraint, *after* that side has been constructed according to the sequence on the left. In other 
words, the number on the left constrains the mismatches in primary sequence only, while the number 
on the right will constrain the number of mispaired positions in the helix.

Finally: any line that begins with a pound sign '#' is a comment line, and will not be interpreted 
by the pattern compiler.

**Options**

The behavior of RNABOB can be modified by use of the following options:

*Complement*: Selecting this option will cause RNABOB to search for the pattern also on the 
complementary strands.

*Skip*: This is a workaround to avoid a problem in the DNABANK. There are some sequences in the 
database which have long stretches of ambiguous sequence (N's). Descriptors with no primary 
sequence constraints will match these garbage sequences at many, many positions, and generate huge 
outputs. This option toggles a search strategy that skips forward a pattern-length rather than a 
single base when a match is found, thus printing out only a single match when overlapping matches 
are found. 

**Examples**

The following example descriptors included in the source distribution 
(http://selab.janelia.org/software/rnabob/rnabob.tar.gz):

	- trna.des - a general descriptor of a tRNA structure
	- r17.des - descriptor of the consensus binding site for the r17 phage coat protein
	- pseudoknot.des - description of a simple pseudoknotted structure

An example cosmid ``F22B7.fa`` from the *C. elegans* genome sequencing project is also provided 
for running these descriptors against.

::

	\
		# trna.des
		#
		# Generalized descriptor of a tRNA cloverleaf. Doesn't
		# find them all though. 
		#

		h1 s1 h2 s2 h2' s3 h3 s4 h3' s5 h4 s6 h4' h1' s8

		h1 0:2 NNNNNNN:NNNNNNN
		h2 0:1 *NNN:NNN*
		h3 0:1 NNNNN:NNNNN
		h4 0:1 NNNNN:NNNNN
		s1 0 TN
		s2 0 NNNN**********
		s3 0 N
		s4 0 NNNNNN*
		s5 0 NN********************
		s6 0 TTC****
		s8 0 NCCA

Running RNABOB with ``trna.des`` against ``F22B7.fa`` searches the top strand of the cosmid for 
the above motif. ``trna.des`` hits twice, once on each strand. (F22B7 has several other tRNA genes 
in it which the pattern fails to detect - this is *not* a pattern to use for tRNA genefinding!).
    </help> 
    <citations>
	<citation type="doi">10.1093/bioinformatics/6.4.325</citation>
	<citation type="bibtex">@UNPUBLISHED{rnabob,
author = {Eddy S.R},
title = {RNABOB: a program to search for RNA secondary structure motifs in sequence databases},
note = {}}</citation>
    </citations>
</tool>