Mercurial > repos > bgruening > glimmer_gene_calling_workflow
view readme.rst @ 6:9d5515db5920 draft default tip
Uploaded
author | bgruening |
---|---|
date | Fri, 23 Aug 2013 02:54:15 -0400 |
parents | ad01b12e0a0c |
children |
line wrap: on
line source
This package is a Galaxy workflow for gene prediction using Glimmer3. It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of genes to generate gene predictions on a new genome, and then calls EMBOSS (Rice et al. 2000) to translate the predictions into a FASTA file of predicted protein sequences. The workflow requires two input files: * Nucleotide FASTA file of know gene sequences (training set) * Nucleotide FASTA file of genome sequence or assembled contigs First an interpolated context model (ICM) is built from the set of known genes, preferably from the closest relative organism(s) available. Next this ICM model is used to predict genes on the genomic FASTA file. This produces a FASTA file of the predicted gene nucleotide sequences, which is translated into protein sequences using the EMBOSS tool transeq. Glimmer is intended for finding genes in microbial DNA, especially bacteria, archaea, and viruses. See http://www.galaxyproject.org for information about the Galaxy Project. Sample Data =========== As an example, we will use the first public assembly of the 2011 Shiga-toxin producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the open-source crowd-sourcing analysis described in Rohde et al. (2011) and here: https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki You can upload this assembly directly into Galaxy using the "Upload File" tool with either of these URLs - Galaxy should recognise this is a FASTA file with 3,057 sequences: * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the MIRA 3.2 assembler. It was initially released via his blog, http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ We will also need a training set of known *E. coli* genes, for example the model strain *Escherichia coli* str. K-12 substr. MG1655 which is well annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the gene nucleotide sequences directly into Galaxy via this URL, which Galaxy should recognise as a FASTA file with 4,321 sequences: * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn Then run the workflow, which should produce 2,333 predicted genes for the TY2482 assembly (two FASTA files, nucleotide and protein sequences). Citation ======== If you use this workflow directly, or a derivative of it, or the associated Glimmer wrappers for Galaxy, in work leading to a scientific publication, please cite: Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013) Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. (Submitted). For Glimmer3 please cite: Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6), 673-679. http://dx.doi.org/10.1093/bioinformatics/btm009 For EMBOSS please cite: Rice, P., Longden, I. and Bleasby, A. (2000) EMBOSS: The European Molecular Biology Open Software Suite Trends in Genetics 16(6), 276-277. http://dx.doi.org/10.1016/S0168-9525(00)02024-2 Additional References ===================== Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011) Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4. New England Journal of Medicine 365, 718-724. http://dx.doi.org/10.1056/NEJMoa1107643 Availability ============ This workflow is available on the main Galaxy Tool Shed: http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow Development is being done on github: https://github.com/bgruening/galaxytools/workflows/glimmer3/ Dependencies ============ These dependencies should be resolved automatically via the Galaxy Tool Shed: * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5