Mercurial > repos > bgruening > glimmer_gene_calling_workflow
diff readme.rst @ 6:9d5515db5920 draft default tip
Uploaded
author | bgruening |
---|---|
date | Fri, 23 Aug 2013 02:54:15 -0400 |
parents | ad01b12e0a0c |
children |
line wrap: on
line diff
--- a/readme.rst Mon Aug 12 12:56:19 2013 -0400 +++ b/readme.rst Fri Aug 23 02:54:15 2013 -0400 @@ -1,28 +1,108 @@ -============================== -Glimmer3 gene calling workflow -============================== +This package is a Galaxy workflow for gene prediction using Glimmer3. + +It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of +genes to generate gene predictions on a new genome, and then calls EMBOSS +(Rice et al. 2000) to translate the predictions into a FASTA file of +predicted protein sequences. The workflow requires two input files: + +* Nucleotide FASTA file of know gene sequences (training set) +* Nucleotide FASTA file of genome sequence or assembled contigs + +First an interpolated context model (ICM) is built from the set of known +genes, preferably from the closest relative organism(s) available. Next this +ICM model is used to predict genes on the genomic FASTA file. This produces +a FASTA file of the predicted gene nucleotide sequences, which is translated +into protein sequences using the EMBOSS tool transeq. + +Glimmer is intended for finding genes in microbial DNA, especially bacteria, +archaea, and viruses. + +See http://www.galaxyproject.org for information about the Galaxy Project. -This Tool Shed Repository contains a workflow for the gene prediction of from a given nucleotide FASTA file. + +Sample Data +=========== + +As an example, we will use the first public assembly of the 2011 Shiga-toxin +producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the +open-source crowd-sourcing analysis described in Rohde et al. (2011) and here: +https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki -At first an interpolated context model (ICM) is build from a know set of genes, preferable from the closest relative available organism(s). In a following step this ICM model is used to predict genes on the second input. The output is a FASTA file with nucleotide sequences that is further converted to proteins sequences. +You can upload this assembly directly into Galaxy using the "Upload File" tool +with either of these URLs - Galaxy should recognise this is a FASTA file with +3,057 sequences: -To run that worflow glimmer_ und the EMBOSS_ suite is required. Both can be installed from the Tool Shed. +* http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt +* https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt + +This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled +by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the +MIRA 3.2 assembler. It was initially released via his blog, +http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ -.. _glimmer: http://www.cbcb.umd.edu/software/glimmer/ -.. _EMBOSS: http://emboss.sourceforge.net/ +We will also need a training set of known *E. coli* genes, for example the +model strain *Escherichia coli* str. K-12 substr. MG1655 which is well +annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the +gene nucleotide sequences directly into Galaxy via this URL, which Galaxy +should recognise as a FASTA file with 4,321 sequences: + +* ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn + +Then run the workflow, which should produce 2,333 predicted genes for the +TY2482 assembly (two FASTA files, nucleotide and protein sequences). + -| A. L. Delcher, K.A. Bratke, E.C. Powers, and S.L. Salzberg. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Advance online version) (2007). +Citation +======== + +If you use this workflow directly, or a derivative of it, or the associated +Glimmer wrappers for Galaxy, in work leading to a scientific publication, +please cite: + +Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013) +Galaxy tools and workflows for sequence analysis with applications in +molecular plant pathology. (Submitted). + +For Glimmer3 please cite: -EMBOSS: The European Molecular Biology Open Software Suite (2000) -Rice,P. Longden,I. and Bleasby,A. -Trends in Genetics 16, (6) pp276--277 +Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007) +Identifying bacterial genes and endosymbiont DNA with Glimmer. +Bioinformatics 23(6), 673-679. +http://dx.doi.org/10.1093/bioinformatics/btm009 + +For EMBOSS please cite: + +Rice, P., Longden, I. and Bleasby, A. (2000) +EMBOSS: The European Molecular Biology Open Software Suite +Trends in Genetics 16(6), 276-277. +http://dx.doi.org/10.1016/S0168-9525(00)02024-2 -************ + +Additional References +===================== + +Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011) +Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4. +New England Journal of Medicine 365, 718-724. +http://dx.doi.org/10.1056/NEJMoa1107643 + + Availability -************ +============ This workflow is available on the main Galaxy Tool Shed: + http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow Development is being done on github: + https://github.com/bgruening/galaxytools/workflows/glimmer3/ + + +Dependencies +============ + +These dependencies should be resolved automatically via the Galaxy Tool Shed: + +* http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3 +* http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5