Mercurial > repos > bgruening > glimmer_gene_calling_workflow
comparison readme.rst @ 6:9d5515db5920 draft default tip
Uploaded
author | bgruening |
---|---|
date | Fri, 23 Aug 2013 02:54:15 -0400 |
parents | ad01b12e0a0c |
children |
comparison
equal
deleted
inserted
replaced
5:2405efd751a0 | 6:9d5515db5920 |
---|---|
1 ============================== | 1 This package is a Galaxy workflow for gene prediction using Glimmer3. |
2 Glimmer3 gene calling workflow | |
3 ============================== | |
4 | 2 |
5 This Tool Shed Repository contains a workflow for the gene prediction of from a given nucleotide FASTA file. | 3 It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of |
4 genes to generate gene predictions on a new genome, and then calls EMBOSS | |
5 (Rice et al. 2000) to translate the predictions into a FASTA file of | |
6 predicted protein sequences. The workflow requires two input files: | |
6 | 7 |
7 At first an interpolated context model (ICM) is build from a know set of genes, preferable from the closest relative available organism(s). In a following step this ICM model is used to predict genes on the second input. The output is a FASTA file with nucleotide sequences that is further converted to proteins sequences. | 8 * Nucleotide FASTA file of know gene sequences (training set) |
9 * Nucleotide FASTA file of genome sequence or assembled contigs | |
8 | 10 |
9 To run that worflow glimmer_ und the EMBOSS_ suite is required. Both can be installed from the Tool Shed. | 11 First an interpolated context model (ICM) is built from the set of known |
12 genes, preferably from the closest relative organism(s) available. Next this | |
13 ICM model is used to predict genes on the genomic FASTA file. This produces | |
14 a FASTA file of the predicted gene nucleotide sequences, which is translated | |
15 into protein sequences using the EMBOSS tool transeq. | |
10 | 16 |
11 .. _glimmer: http://www.cbcb.umd.edu/software/glimmer/ | 17 Glimmer is intended for finding genes in microbial DNA, especially bacteria, |
12 .. _EMBOSS: http://emboss.sourceforge.net/ | 18 archaea, and viruses. |
13 | 19 |
14 | A. L. Delcher, K.A. Bratke, E.C. Powers, and S.L. Salzberg. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics (Advance online version) (2007). | 20 See http://www.galaxyproject.org for information about the Galaxy Project. |
15 | 21 |
16 EMBOSS: The European Molecular Biology Open Software Suite (2000) | |
17 Rice,P. Longden,I. and Bleasby,A. | |
18 Trends in Genetics 16, (6) pp276--277 | |
19 | 22 |
20 ************ | 23 Sample Data |
24 =========== | |
25 | |
26 As an example, we will use the first public assembly of the 2011 Shiga-toxin | |
27 producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the | |
28 open-source crowd-sourcing analysis described in Rohde et al. (2011) and here: | |
29 https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki | |
30 | |
31 You can upload this assembly directly into Galaxy using the "Upload File" tool | |
32 with either of these URLs - Galaxy should recognise this is a FASTA file with | |
33 3,057 sequences: | |
34 | |
35 * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt | |
36 * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt | |
37 | |
38 This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled | |
39 by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the | |
40 MIRA 3.2 assembler. It was initially released via his blog, | |
41 http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/ | |
42 | |
43 We will also need a training set of known *E. coli* genes, for example the | |
44 model strain *Escherichia coli* str. K-12 substr. MG1655 which is well | |
45 annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the | |
46 gene nucleotide sequences directly into Galaxy via this URL, which Galaxy | |
47 should recognise as a FASTA file with 4,321 sequences: | |
48 | |
49 * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn | |
50 | |
51 Then run the workflow, which should produce 2,333 predicted genes for the | |
52 TY2482 assembly (two FASTA files, nucleotide and protein sequences). | |
53 | |
54 | |
55 Citation | |
56 ======== | |
57 | |
58 If you use this workflow directly, or a derivative of it, or the associated | |
59 Glimmer wrappers for Galaxy, in work leading to a scientific publication, | |
60 please cite: | |
61 | |
62 Cock, P.J.A., GrĂ¼ning, B., Paszkiewicz, K. and Pritchard, L. (2013) | |
63 Galaxy tools and workflows for sequence analysis with applications in | |
64 molecular plant pathology. (Submitted). | |
65 | |
66 For Glimmer3 please cite: | |
67 | |
68 Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007) | |
69 Identifying bacterial genes and endosymbiont DNA with Glimmer. | |
70 Bioinformatics 23(6), 673-679. | |
71 http://dx.doi.org/10.1093/bioinformatics/btm009 | |
72 | |
73 For EMBOSS please cite: | |
74 | |
75 Rice, P., Longden, I. and Bleasby, A. (2000) | |
76 EMBOSS: The European Molecular Biology Open Software Suite | |
77 Trends in Genetics 16(6), 276-277. | |
78 http://dx.doi.org/10.1016/S0168-9525(00)02024-2 | |
79 | |
80 | |
81 Additional References | |
82 ===================== | |
83 | |
84 Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011) | |
85 Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4. | |
86 New England Journal of Medicine 365, 718-724. | |
87 http://dx.doi.org/10.1056/NEJMoa1107643 | |
88 | |
89 | |
21 Availability | 90 Availability |
22 ************ | 91 ============ |
23 | 92 |
24 This workflow is available on the main Galaxy Tool Shed: | 93 This workflow is available on the main Galaxy Tool Shed: |
94 | |
25 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow | 95 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow |
26 | 96 |
27 Development is being done on github: | 97 Development is being done on github: |
98 | |
28 https://github.com/bgruening/galaxytools/workflows/glimmer3/ | 99 https://github.com/bgruening/galaxytools/workflows/glimmer3/ |
100 | |
101 | |
102 Dependencies | |
103 ============ | |
104 | |
105 These dependencies should be resolved automatically via the Galaxy Tool Shed: | |
106 | |
107 * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3 | |
108 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5 |