comparison readme.md @ 3:fd620c523d63 draft

planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/workflows/GeneSeqToFamily commit 588282bf5cbb9909ad9cd8c316ec33158c858727
author earlhaminst
date Thu, 18 May 2017 14:09:32 -0400
parents b92f5ab6d127
children 06470b2e491f
comparison
equal deleted inserted replaced
2:58c9294400ed 3:fd620c523d63
3 3
4 ## Introduction 4 ## Introduction
5 5
6 GeneSeqToFamily is an open-source Galaxy workflow based on the [Ensembl GeneTrees](http://www.ensembl.org/info/genome/compara/homology_method.html) pipeline. The Ensembl GeneTrees pipeline [1] infers the evolutionary history of gene families, represented as gene trees. It is a computational pipeline that comprises clustering, multiple sequence alignment, and tree generation (using [TreeBeST](http://treesoft.sourceforge.net/treebest.shtml)), to discover familial relationship. 6 GeneSeqToFamily is an open-source Galaxy workflow based on the [Ensembl GeneTrees](http://www.ensembl.org/info/genome/compara/homology_method.html) pipeline. The Ensembl GeneTrees pipeline [1] infers the evolutionary history of gene families, represented as gene trees. It is a computational pipeline that comprises clustering, multiple sequence alignment, and tree generation (using [TreeBeST](http://treesoft.sourceforge.net/treebest.shtml)), to discover familial relationship.
7 7
8 ## Installation
9
10 To use this workflow, please [install](https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/) the required tools (listed below) into Galaxy from the Galaxy ToolShed. Also [install and import](https://galaxyproject.org/toolshed/workflow-sharing/#finding-workflows-in-toolshed-repositories) the workflow from the Galaxy ToolShed.
11
12 ### List of required tools
13 The 3 workflows in this repository requires Galaxy tools from the following ToolShed repositories:
14
15 * [emboss_5](https://toolshed.g2.bx.psu.edu/view/devteam/emboss_5/)
16 * [ncbi_blast_plus](https://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/)
17 * [blast_parser](https://toolshed.g2.bx.psu.edu/view/earlhaminst/blast_parser/)
18 * [hcluster_sg](https://toolshed.g2.bx.psu.edu/view/earlhaminst/hcluster_sg/)
19 * [hcluster_sg_parser](https://toolshed.g2.bx.psu.edu/view/earlhaminst/hcluster_sg_parser/)
20 * [t_coffee](https://toolshed.g2.bx.psu.edu/view/earlhaminst/t_coffee/)
21 * [filter_by_fasta_ids](https://toolshed.g2.bx.psu.edu/view/galaxyp/filter_by_fasta_ids/)
22 * [treebest_best](https://toolshed.g2.bx.psu.edu/view/earlhaminst/treebest_best)
23 * [gafa](https://toolshed.g2.bx.psu.edu/view/earlhaminst/gafa/)
24 * [fasta_to_tabular](https://toolshed.g2.bx.psu.edu/view/devteam/fasta_to_tabular/)
25 * [text_processing](https://toolshed.g2.bx.psu.edu/view/bgruening/text_processing/)
26 * [uniprot_rest_interface](https://toolshed.g2.bx.psu.edu/view/bgruening/uniprot_rest_interface/)
27 * [suite_ensembl_rest](https://toolshed.g2.bx.psu.edu/view/earlhaminst/suite_ensembl_rest/)
28
29 Helper tools for data preparation:
30
31 * [ensembl_longest_cds_per_gene](https://toolshed.g2.bx.psu.edu/view/earlhaminst/ensembl_longest_cds_per_gene/)
32 * [ete](https://toolshed.g2.bx.psu.edu/view/earlhaminst/ete/)
33 * [gstf_preparation](https://toolshed.g2.bx.psu.edu/view/earlhaminst/gstf_preparation/) - to convert gene feature files from GFF3 and/or JSON format to SQLite and format CDS sequence headers
34
35
8 ## Workflow inputs and steps 36 ## Workflow inputs and steps
9 37
10 ### Inputs 38 ### Inputs
11 GeneSeqToFamily requires the following inputs: 39 GeneSeqToFamily requires the following inputs:
12 40
13 * The Coding Sequence (CDS) 41 * the coding sequences (CDS) in FASTA format (this can be achieved with GeneSeqToFamily preparation tool)
14 * a species tree 42 * gene feature information in SQLite format (this can be achieved with GeneSeqToFamily preparation tool)
15 * gene feature information in JSON format 43 * a species tree in Newick format (this can be generated by ete tool in Galaxy)
16 44
17 ### Steps 45 ### Steps
18 46
19 The pipeline is made up of 7 main steps: 47 The pipeline is made up of 7 main steps:
20 48
27 7. Visualise the GAFA dataset using Aequatus 55 7. Visualise the GAFA dataset using Aequatus
28 56
29 57
30 ### Helper tools: 58 ### Helper tools:
31 59
32 We have developed various tools to help with data preparation for the workflow. This includes tools for retrieving sequences, and features from Ensembl using its REST API, and tools to parse Ensembl results into the required formats for the workflow. We also developed a tool to merge gene feature files and convert them from GFF3 (Gene Feature File) to JSON format, which is then used to generate the Aequatus dataset. 60 We have developed various tools to help with data preparation for the workflow. This includes tools for retrieving sequences, and features from Ensembl using its REST API, and tools to parse Ensembl results into the required formats for the workflow. We also developed a tool to merge gene feature files and convert them from GFF3 (Gene Feature File) and/or JSON format to SQLite, which is then used to generate the Aequatus dataset.
33 61
34 62
35 ## Results 63 ## Results
36 64
37 The resulting gene families can be visualised using the [Aequatus.js](https://github.com/TGAC/aequatus.js) interactive tool, which is developed as part of the [Aequatus software](https://github.com/TGAC/aequatus) [2]. 65 The resulting gene families can be visualised using the [Aequatus.js](https://github.com/TGAC/aequatus.js) interactive tool, which is developed as part of the [Aequatus software](https://github.com/TGAC/aequatus) [2].
38 66
39 The Aequatus.js plugin provides an interactive visual representation of the phylogenetic and structural relationships among the homologous genes, using a shared colour scheme for coding regions to represent homology in internal gene structure alongside their corresponding gene trees. It is also able to indicate insertions and deletions in homologous genes with respect to shared ancestors. 67 The Aequatus.js plugin provides an interactive visual representation of the phylogenetic and structural relationships among the homologous genes, using a shared colour scheme for coding regions to represent homology in internal gene structure alongside their corresponding gene trees. It is also able to indicate insertions and deletions in homologous genes with respect to shared ancestors.
40 68
41 ## List of tools
42 GeneSeqToFamily requires the following tools to run the workflow successfully:
43 69
44 * Transeq
45 * Filter by FASTA IDs
46 * BLAST
47 * BLAST parser
48 * hcluster_sg
49 * hcluster_sg parser
50 * T-Coffee
51 * Tranalign
52 * TreeBeST
53 * Gene Alignment and Family Aggregator (GAFA)
54
55 Some tools for data conversion during workflow:
56
57 * cut
58
59 Helper tools for data preparation:
60
61 * Ensembl REST API - tools for retrieving sequences, and features from Ensembl using its [REST API](http://rest.ensembl.org/)
62 * gff3-to-json - to merge gene feature files and convert them from GFF3 (Gene Feature File) to JSON format
63 70
64 71
65 ## References 72 ## References
66 73
67 1. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) [EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates.](http://genome.cshlp.org/content/19/2/327) *Genome Res.* 19(2):327–335, doi: 10.1101/gr.073585.107 74 1. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) [EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates.](http://genome.cshlp.org/content/19/2/327) *Genome Res.* 19(2):327–335, doi: 10.1101/gr.073585.107
68 2. Thanki AS, Ayling S, Herrero J, Davey RP (2016) [Aequatus: An open-source homology browser.](http://biorxiv.org/content/early/2016/06/01/055632) *bioRxiv*, doi: 10.1101/055632 75 2. Thanki AS, Ayling S, Herrero J, Davey RP (2016) [Aequatus: An open-source homology browser.](http://biorxiv.org/content/early/2016/06/01/055632) *bioRxiv*, doi: 10.1101/055632
69 76
77 ## Pre-print
78
79 Pre-print for this work can be found at [bioRxiv server](http://biorxiv.org/content/early/2017/04/19/096529)
80
70 ## Project contacts: 81 ## Project contacts:
71 82
72 * Anil Thanki <Anil.Thanki@earlham.ac.uk> 83 * Anil Thanki <Anil.Thanki@earlham.ac.uk>
73 * Nicola Soranzo <Nicola.Soranzo@earlham.ac.uk> 84 * Nicola Soranzo <Nicola.Soranzo@earlham.ac.uk>
74 * Robert Davey <Robert.Davey@earlham.ac.uk> 85 * Robert Davey <Robert.Davey@earlham.ac.uk>
75 86
76 Copyright &copy; 2016 Earlham Institute, Norwich, UK 87 Copyright &copy; 2016-2017 Earlham Institute, Norwich, UK