comparison readme.md @ 0:b92f5ab6d127 draft

planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/workflows/GeneSeqToFamily commit 9c148d5ba49d9de5f7b1c1f4b32cd8b5e56cb975-dirty
author earlhaminst
date Mon, 09 Jan 2017 15:06:04 -0500
parents
children fd620c523d63
comparison
equal deleted inserted replaced
-1:000000000000 0:b92f5ab6d127
1 # GeneSeqToFamily: the Ensembl GeneTrees pipeline as a Galaxy workflow
2
3
4 ## Introduction
5
6 GeneSeqToFamily is an open-source Galaxy workflow based on the [Ensembl GeneTrees](http://www.ensembl.org/info/genome/compara/homology_method.html) pipeline. The Ensembl GeneTrees pipeline [1] infers the evolutionary history of gene families, represented as gene trees. It is a computational pipeline that comprises clustering, multiple sequence alignment, and tree generation (using [TreeBeST](http://treesoft.sourceforge.net/treebest.shtml)), to discover familial relationship.
7
8 ## Workflow inputs and steps
9
10 ### Inputs
11 GeneSeqToFamily requires the following inputs:
12
13 * The Coding Sequence (CDS)
14 * a species tree
15 * gene feature information in JSON format
16
17 ### Steps
18
19 The pipeline is made up of 7 main steps:
20
21 1. Translation of CDS to protein sequences
22 2. All-vs-all BLASTP of protein sequences
23 3. Cluster protein sequences using [hcluster_sg](https://github.com/douglasgscofield/hcluster) and BLASTP scores
24 4. Multiple sequence alignment (MSA) for each cluster using [T-Coffee](http://www.tcoffee.org/Projects/tcoffee/)
25 5. Generate gene trees from MSAs using [TreeBeST](http://treesoft.sourceforge.net/treebest.shtml)
26 6. Create an SQLite database from the MSAs, gene trees and gene feature information using Gene Alignment and Family Aggregator (GAFA)
27 7. Visualise the GAFA dataset using Aequatus
28
29
30 ### Helper tools:
31
32 We have developed various tools to help with data preparation for the workflow. This includes tools for retrieving sequences, and features from Ensembl using its REST API, and tools to parse Ensembl results into the required formats for the workflow. We also developed a tool to merge gene feature files and convert them from GFF3 (Gene Feature File) to JSON format, which is then used to generate the Aequatus dataset.
33
34
35 ## Results
36
37 The resulting gene families can be visualised using the [Aequatus.js](https://github.com/TGAC/aequatus.js) interactive tool, which is developed as part of the [Aequatus software](https://github.com/TGAC/aequatus) [2].
38
39 The Aequatus.js plugin provides an interactive visual representation of the phylogenetic and structural relationships among the homologous genes, using a shared colour scheme for coding regions to represent homology in internal gene structure alongside their corresponding gene trees. It is also able to indicate insertions and deletions in homologous genes with respect to shared ancestors.
40
41 ## List of tools
42 GeneSeqToFamily requires the following tools to run the workflow successfully:
43
44 * Transeq
45 * Filter by FASTA IDs
46 * BLAST
47 * BLAST parser
48 * hcluster_sg
49 * hcluster_sg parser
50 * T-Coffee
51 * Tranalign
52 * TreeBeST
53 * Gene Alignment and Family Aggregator (GAFA)
54
55 Some tools for data conversion during workflow:
56
57 * cut
58
59 Helper tools for data preparation:
60
61 * Ensembl REST API - tools for retrieving sequences, and features from Ensembl using its [REST API](http://rest.ensembl.org/)
62 * gff3-to-json - to merge gene feature files and convert them from GFF3 (Gene Feature File) to JSON format
63
64
65 ## References
66
67 1. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) [EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates.](http://genome.cshlp.org/content/19/2/327) *Genome Res.* 19(2):327–335, doi: 10.1101/gr.073585.107
68 2. Thanki AS, Ayling S, Herrero J, Davey RP (2016) [Aequatus: An open-source homology browser.](http://biorxiv.org/content/early/2016/06/01/055632) *bioRxiv*, doi: 10.1101/055632
69
70 ## Project contacts:
71
72 * Anil Thanki <Anil.Thanki@earlham.ac.uk>
73 * Nicola Soranzo <Nicola.Soranzo@earlham.ac.uk>
74 * Robert Davey <Robert.Davey@earlham.ac.uk>
75
76 Copyright &copy; 2016 Earlham Institute, Norwich, UK