6
|
1 This package is a Galaxy workflow for gene prediction using Glimmer3.
|
|
2
|
|
3 It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of
|
|
4 genes to generate gene predictions on a new genome, and then calls EMBOSS
|
|
5 (Rice et al. 2000) to translate the predictions into a FASTA file of
|
|
6 predicted protein sequences. The workflow requires two input files:
|
|
7
|
|
8 * Nucleotide FASTA file of know gene sequences (training set)
|
|
9 * Nucleotide FASTA file of genome sequence or assembled contigs
|
|
10
|
|
11 First an interpolated context model (ICM) is built from the set of known
|
|
12 genes, preferably from the closest relative organism(s) available. Next this
|
|
13 ICM model is used to predict genes on the genomic FASTA file. This produces
|
|
14 a FASTA file of the predicted gene nucleotide sequences, which is translated
|
|
15 into protein sequences using the EMBOSS tool transeq.
|
|
16
|
|
17 Glimmer is intended for finding genes in microbial DNA, especially bacteria,
|
|
18 archaea, and viruses.
|
|
19
|
|
20 See http://www.galaxyproject.org for information about the Galaxy Project.
|
0
|
21
|
6
|
22
|
|
23 Sample Data
|
|
24 ===========
|
|
25
|
|
26 As an example, we will use the first public assembly of the 2011 Shiga-toxin
|
|
27 producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the
|
|
28 open-source crowd-sourcing analysis described in Rohde et al. (2011) and here:
|
|
29 https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki
|
0
|
30
|
6
|
31 You can upload this assembly directly into Galaxy using the "Upload File" tool
|
|
32 with either of these URLs - Galaxy should recognise this is a FASTA file with
|
|
33 3,057 sequences:
|
0
|
34
|
6
|
35 * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt
|
|
36 * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt
|
|
37
|
|
38 This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled
|
|
39 by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the
|
|
40 MIRA 3.2 assembler. It was initially released via his blog,
|
|
41 http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/
|
0
|
42
|
6
|
43 We will also need a training set of known *E. coli* genes, for example the
|
|
44 model strain *Escherichia coli* str. K-12 substr. MG1655 which is well
|
|
45 annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the
|
|
46 gene nucleotide sequences directly into Galaxy via this URL, which Galaxy
|
|
47 should recognise as a FASTA file with 4,321 sequences:
|
|
48
|
|
49 * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn
|
|
50
|
|
51 Then run the workflow, which should produce 2,333 predicted genes for the
|
|
52 TY2482 assembly (two FASTA files, nucleotide and protein sequences).
|
|
53
|
0
|
54
|
6
|
55 Citation
|
|
56 ========
|
|
57
|
|
58 If you use this workflow directly, or a derivative of it, or the associated
|
|
59 Glimmer wrappers for Galaxy, in work leading to a scientific publication,
|
|
60 please cite:
|
|
61
|
|
62 Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013)
|
|
63 Galaxy tools and workflows for sequence analysis with applications in
|
|
64 molecular plant pathology. (Submitted).
|
|
65
|
|
66 For Glimmer3 please cite:
|
0
|
67
|
6
|
68 Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007)
|
|
69 Identifying bacterial genes and endosymbiont DNA with Glimmer.
|
|
70 Bioinformatics 23(6), 673-679.
|
|
71 http://dx.doi.org/10.1093/bioinformatics/btm009
|
|
72
|
|
73 For EMBOSS please cite:
|
|
74
|
|
75 Rice, P., Longden, I. and Bleasby, A. (2000)
|
|
76 EMBOSS: The European Molecular Biology Open Software Suite
|
|
77 Trends in Genetics 16(6), 276-277.
|
|
78 http://dx.doi.org/10.1016/S0168-9525(00)02024-2
|
0
|
79
|
6
|
80
|
|
81 Additional References
|
|
82 =====================
|
|
83
|
|
84 Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011)
|
|
85 Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4.
|
|
86 New England Journal of Medicine 365, 718-724.
|
|
87 http://dx.doi.org/10.1056/NEJMoa1107643
|
|
88
|
|
89
|
0
|
90 Availability
|
6
|
91 ============
|
0
|
92
|
|
93 This workflow is available on the main Galaxy Tool Shed:
|
6
|
94
|
0
|
95 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow
|
|
96
|
2
|
97 Development is being done on github:
|
6
|
98
|
0
|
99 https://github.com/bgruening/galaxytools/workflows/glimmer3/
|
6
|
100
|
|
101
|
|
102 Dependencies
|
|
103 ============
|
|
104
|
|
105 These dependencies should be resolved automatically via the Galaxy Tool Shed:
|
|
106
|
|
107 * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3
|
|
108 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5
|