view sm_tophat2_toolshed.xml @ 2:f50a064ebd1c draft

Uploaded
author sarahinraauzeville
date Thu, 11 Feb 2016 08:45:37 -0500
parents 038c61725cfb
children
line wrap: on
line source

<!--# Copyright (C) 2013 INRA
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see http://www.gnu.org/licenses/.
#-->
<tool id="sm_tophat2" name="Tophat 2 for Illumina">
    <description>Find splice junctions using RNA-seq data</description>
          <command interpreter="perl">sm_tophat2.pl $lib $input_read1 $input_read2 $reference_source.reference_source_selector
          #if $reference_source.reference_source_selector =="cached":
			  $reference_source.ref_file_cached.fields.path
			#end if					
          #if $reference_source.reference_source_selector =="history":
			  $reference_source.ref_file 
			#end if					
          $p $r $max_intron $output_bam $output_bed $output_unmapped_bam $zip $gtf_cond.gtf     			
		  #if $gtf_cond.gtf =="T":
			  $gtf_cond.input_gtf  
		  #end if						
          </command>
          <version_command>echo tophat2 version : ; tophat2 --version</version_command>
                  <inputs>
                        <param format="fastq, fastqsanger, fastqillumina" name="input_read1" type="data" label="Your RNA-Seq FASTQ file (read 1)"/>
                        <param format="fastq, fastqsanger, fastqillumina" name="input_read2" type="data" label="Your RNA-Seq FASTQ file (read 2)"/>
                        
                        <conditional name="reference_source">
						  <param name="reference_source_selector" type="select" label="Load reference genome from">
							<option value="cached">Local cache</option>
							<option value="history">History</option>
						  </param>
						  <when value="cached">
							   <param name="ref_file_cached" type="select" label="Using reference genome" help="Select genome from the list">
						          <options from_data_table="tophat_ind">
						            <filter type="sort_by" column="2" />
						            <validator type="no_options" message="No indexes are available" />
						          </options>
						          <validator type="no_options" message="A built-in reference genome is not available for the build associated with the selected input file"/>
						        </param>
						        							  						  
						  </when>
						  <when value="history"> 
							<param name="ref_file" type="data" format="fasta" label="Use the following dataset as the reference sequence" help="You can upload a FASTA sequence to the history and use it as reference" />
						  </when>
						</conditional>                                               
						
                        <param name="p" size="20" type="text" value="16" label="Number of threads used to align reads"/>
                        <param name="max_intron" size="20" type="text" value="5000" label="Maximum intron length"/>
                        <param name="r" size="20" type="text" value="200" label="Expected (mean) inner distance between mate pairs"/>
			            <param name="zip" type="select" display="checkboxes" multiple="True" label="Your RNA-seq FASTQ file are zipped" help="Please check this option if your files are zipped.">
			                    <option value="YES">Yes</option>
			            </param>
			
			            <conditional name="gtf_cond">
                                   <param name="gtf" type="select" help="Do you have a gtf file available ?" label="GTF file available">
                                       <option value="T">Yes</option>
                                       <option value="F" selected="true">No</option>
                                   </param> 
                             <when value="F" />
                             <when value="T">
                                <param format="gtf, gff" name="input_gtf" type="data" label="Your GTF file"/>   
                             </when>
                        </conditional>
                        
                         <param name="lib" type="select" label="Library type">
                                       <option value="fr-unstranded">fr-unstranded</option>
                                       <option value="fr-firststrand">fr-firststrand</option>
                                       <option value="fr-secondstrand">fr-secondstrand</option>
                         </param> 
                                   
                  </inputs>
                  <outputs>
                     <data format="bam" name="output_bam"  label ="{$input_read1.name}-Tophat_mapped.bam"/>
                     <data format="bed" name="output_bed"  label ="{$input_read1.name}-Tophat.bed"/>
                     <data format="bam" name="output_unmapped_bam"  label ="{$input_read1.name}-Tophat_unmapped.bam"/>
                  </outputs>
  <help>
.. class:: infomark

What it does : TopHat 2 is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie 2. TopHat runs on Linux and OS X.


*What types of reads can I use TopHat 2 with?*

TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. In TopHat 1.1.0, we began supporting Applied Biosystems' Colorspace format. The software is optimized for reads 75bp or longer.

Mixing paired- and single- end reads together is not supported.



*How does TopHat 2 find junctions?*

TopHat can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions and then maps the reads against these junctions to confirm them.

Short read sequencing machines can currently produce reads 100bp or longer but many exons are shorter than this so they would be missed in the initial mapping. TopHat solves this problem mainly by splitting all input reads into smaller segments which are then mapped independently. The segment alignments are put back together in a final step of the program to produce the end-to-end read alignments.

TopHat generates its database of possible splice junctions from two sources of evidence. The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search)  for short reads (inf. 45bp) and with a small number of reads (inf or egal 10 million).  This latter option will only report alignments across "GT-AG" introns


Command line :  Please see "information" then "stdout".


Parameters :

-o/--output-dir string 

Sets the name of the directory in which TopHat will write all of its output. The default is "./tophat_out".


-r/--mate-inner-dist int

This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 50bp. 


-I/--max-intron-length int

The maximum intron length. When searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. The default is 500000. 


-p/--num-threads int

Use this many threads to align reads. The default is 1. 


--library-type         
fr-unstranded, fr-firststrand, fr-secondstrand



----
    
Version Galaxy Tool : V2.0

Versions of bioinformatics tools used : Tophat 2

----

Contacts (noms et emails) : sigenae-support@listes.inra.fr

E-learning available : Yes.

Please cite :

    Depending on the help provided you can cite us in acknowledgements, references or both.
    
    Examples :
    Acknowledgements
    We wish to thank the SIGENAE group for ....
    
    References
    X. SIGENAE [http://www.sigenae.org/]
  </help>

</tool>