0
|
1 <?xml version='1.0' encoding='utf-8'?>
|
|
2 <tool id="gecco" name="GECCO" version="0.8.4" python_template_version="3.5">
|
|
3 <description>GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
|
|
4 <requirements>
|
|
5 <requirement type="package" version="0.8.4">gecco</requirement>
|
|
6 </requirements>
|
|
7 <version_command>gecco --version</version_command>
|
|
8 <command detect_errors="aggressive"><![CDATA[
|
|
9
|
|
10 #if str($input.ext) == 'genbank':
|
|
11 #set $file_extension = 'gbk'
|
|
12 #else:
|
|
13 #set $file_extension = $input.ext
|
|
14 #end if
|
|
15 ln -s '$input' input_tempfile.$file_extension &&
|
|
16
|
|
17 gecco -vv run -g input_tempfile.$file_extension &&
|
|
18 mv input_tempfile.features.tsv $features &&
|
|
19 mv input_tempfile.clusters.tsv $clusters
|
|
20
|
|
21 ]]></command>
|
|
22 <inputs>
|
|
23 <param name="input" type="data" format="genbank,fasta" label="Sequence file in GenBank or FASTA format"/>
|
|
24 </inputs>
|
|
25 <outputs>
|
|
26 <collection name="records" type="list" label="${tool.name} detected Biosynthetic Gene Clusters on ${on_string} (GenBank)">
|
|
27 <discover_datasets pattern="(?P<designation>.*)\.gbk" ext="genbank" visible="false" />
|
|
28 </collection>
|
|
29 <data name="features" format="tabular" label="${tool.name} summary of detected features on ${on_string} (TSV)"/>
|
|
30 <data name="clusters" format="tabular" label="${tool.name} summary of detected BGCs on ${on_string} (TSV)"/>
|
|
31 </outputs>
|
|
32 <tests>
|
|
33 <test>
|
|
34 <param name="input" value="BGC0001866.fna"/>
|
|
35 <output name="features" file="features.tsv"/>
|
|
36 <output name="clusters" file="clusters.tsv"/>
|
|
37 <output_collection name="records" type="list">
|
|
38 <element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>
|
|
39 </output_collection>
|
|
40 </test>
|
|
41 </tests>
|
|
42 <help>
|
|
43 <![CDATA[
|
|
44
|
|
45 **Overview**
|
|
46
|
|
47 GECCO is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
|
|
48 It is developed in the Zeller group and is part of the suite of computational microbiome analysis tools hosted at EMBL.
|
|
49
|
|
50 **Input**
|
|
51
|
|
52 GECCO works with DNA sequences, and loads them using Biopython, allowing it to support a large variety of formats, including the common FASTA and GenBank files.
|
|
53
|
|
54 **Output**
|
|
55
|
|
56 GECCO will create the following files once done (using the same prefix as the input file):
|
|
57
|
|
58 - features.tsv: The features file, containing the identified proteins and domains in the input sequences.
|
|
59 - clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
|
|
60 - {sequence}_cluster_{N}.gbk: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
|
|
61
|
|
62 **Contact**
|
|
63
|
|
64 If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the GitHub repository.
|
|
65 You can also directly contact Martin Larralde via email. If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to
|
|
66 open a pull request on the GitHub repository.
|
|
67
|
|
68 ]]>
|
|
69 </help>
|
|
70 <citations>
|
|
71 <citation type="bibtex">
|
|
72 @article {Carroll2021.05.03.442509,
|
|
73 author = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},
|
|
74 title = {Accurate de novo identification of biosynthetic gene clusters with GECCO},
|
|
75 elocation-id = {2021.05.03.442509},
|
|
76 year = {2021},
|
|
77 doi = {10.1101/2021.05.03.442509},
|
|
78 publisher = {Cold Spring Harbor Laboratory},
|
|
79 abstract = {Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.Competing Interest StatementThe authors have declared no competing interest.},
|
|
80 URL = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509},
|
|
81 eprint = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509.full.pdf},
|
|
82 journal = {bioRxiv}
|
|
83 }
|
|
84 </citation>
|
|
85 </citations>
|
|
86 </tool>
|