changeset 3:359232b58f6a draft

"Update Galaxy tool wrapper to follow the IUC best practices"
author althonos
date Sun, 21 Nov 2021 19:47:22 +0000
parents e618ab1c78d9
children 88dc16b4f583
files CHANGELOG.md README.rst gecco.xml test-data/sideload.json
diffstat 4 files changed, 401 insertions(+), 38 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/CHANGELOG.md	Sun Nov 21 19:47:22 2021 +0000
@@ -0,0 +1,299 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
+and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...master
+
+## [v0.8.5] - 2021-11-21
+[v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5
+### Added
+- Minimal compatibility support for running GECCO inside of Galaxy workflows.
+
+## [v0.8.4] - 2021-09-26
+[v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4
+### Fixed
+- `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)).
+- `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input.
+### Changed
+- Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported.
+
+## [v0.8.3-post1] - 2021-08-23
+[v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1
+### Fixed
+- Wrong default value for `--threshold` being shown in `gecco run` help message.
+
+## [v0.8.3] - 2021-08-23
+[v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3
+### Changed
+- Default probability threshold for segmentation to 0.3 (from 0.4).
+
+## [v0.9.0] - 2021-08-10 - **YANKED**
+[v0.9.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.9.0
+### Changed
+- Retrain internal model using `--select=0.35` instead of `--select=0.25` like before.
+- Change default *p-value* filter from 1e-9 to 1e-5 to detect more features.
+
+## [v0.8.2] - 2021-07-31
+[v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2
+### Fixed
+- `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class.
+### Changed
+- `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag.
+- `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier.
+
+## [v0.8.1] - 2021-07-29
+[v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1
+### Changed
+- `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`.
+### Fixed
+- `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`.
+### Added
+- Missing documentation for the `strand` attribute of `gecco.model.Gene`.
+
+## [v0.8.0] - 2021-07-03
+[v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0
+### Changed
+- Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
+- Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling.
+- Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms.
+- Use p-values instead of e-values to filter domains obtained with HMMER.
+- `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data.
+### Fixed
+- Extraction of BGC compositions for the type predictor while training.
+- `ClusterCRF.trained` failing to open an external model.
+### Added
+- `Domain.pvalue` attribute to access the p-value of a domain annotation.
+- Mandatory `pvalue` column to `FeatureTable` objects.
+- Support for loading several feature tables in `gecco train` and `gecco cv`.
+- Warnings to `ClusterCRF.fit` when selecting uninformative features.
+- `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.
+### Removed
+- Outdated `gecco embed` command.
+- Unused `--truncate` flag from the `gecco train` CLI.
+- Tigrfam domains, which is not improving performance on the new training data.
+
+## [v0.7.0] - 2021-05-31
+[v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0
+### Added
+- Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow.
+- Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand.
+- Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.
+### Changed
+- Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling.
+- Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files.
+### Removed
+- Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command.
+
+## [v0.6.3] - 2021-05-10
+[v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3
+### Fixed
+- HMMER annotation not properly handling inputs with multiple contigs.
+- Some progress bar totals displaying as floats in the CLI.
+### Changed
+- `PyHMMER` now sets the `Z` and `domZ` values from the number of proteins given to the search pipeline.
+- `gecco.cli` delegates imports to make CLI more responsive.
+- `pkg_resources` has been replaced with `importlib.resources` and `importlib.metadata` where applicable.
+- `multiprocessing.cpu_count` has been replaced with `os.cpu_count` where applicable.
+
+## [v0.6.2] - 2021-05-04
+[v0.6.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.1...v0.6.2
+### Fixed
+- `gecco cv loto` crashing because of outdated code.
+### Changed
+- Logging-style prompt will only display if GECCO is running with `-vv` flag.
+### Added
+- GECCO bioRxiv paper reference to `Cluster.to_seq_record` output record.
+
+## [v0.6.1] - 2021-03-15
+[v0.6.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.0...v0.6.1
+### Fixed
+- Progress bar not being disabled by `-q` flag in CLI.
+- Fallback to using HMM name if accession is not available in `PyHMMER`.
+- Group genes by source contig and process them separately in `PyHMMER` to avoid bogus E-values.
+### Added
+- `psutil` dependency to get the number of physical CPU cores on the host machine.
+- Support for using an arbitrary mapping of positives to negatives in `gecco embed`.
+### Removed
+- Unused and outdated `HMMER` and `DomainRow` classes from `gecco.hmmer`.
+
+## [v0.6.0] - 2021-02-28
+[v0.6.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.5...v0.6.0
+### Changed
+- Updated internal model with a cleaned-up version of the MIBiG-2.0
+  Pfam-33.1/Tigrfam-15.0 embedding.
+- Updated internal InterPro catalog.
+### Fixed
+- Features not being grouped together in `gecco cv` and `gecco train`
+  when provided with a feature table where rows were not sorted by
+  protein IDs.
+
+## [v0.5.5] - 2021-02-28
+[v0.5.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.4...v0.5.5
+### Fixed
+- `gecco cv` bug causing only the last fold to be written.
+
+## [v0.5.4] - 2021-02-28
+[v0.5.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.3...v0.5.4
+### Changed
+- Replaced `verboselogs`, `coloredlogs` and `better-exceptions` with `rich`.
+### Removed
+- `tqdm` training dependency.
+### Added
+- `gecco annotate` command to produce a feature table from a genomic file.
+- `gecco embed` to embed BGCs into non-BGC regions using feature tables.
+
+## [v0.5.3] - 2021-02-21
+[v0.5.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...v0.5.3
+### Fixed
+- Coordinates of genes in output GenBank files.
+- Potential issue with the number of CPUs in `PyHMMER.run`.
+### Changed
+- Bump required `pyrodigal` version to `v0.4.2` to fix buffer overflow.
+
+## [v0.5.2] - 2021-01-29
+[v0.5.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2
+### Added
+- Support for downloading HMM files directly from GitHub releases assets.
+- Validation of filtered HMMs with MD5 checksum.
+### Fixed
+- Invalid coordinates of protein domains in GenBank output files.
+- `gecco.interpro` module not being added to wheel distribution.
+### Changed
+- Bump required `pyhmmer` version to `v0.2.1`.
+
+## [v0.5.1] - 2021-01-15
+[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1
+### Fixed
+- `--hmm` flag being ignored in in `gecco run` command.
+- `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs.
+
+## [v0.5.0] - 2021-01-11
+[v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0
+### Added
+- Explicit support for Python 3.9.
+### Changed
+- [`pyhmmer`](https://pypi.org/project/pyhmmer) is used to annotate protein sequences instead of HMMER3 binary `hmmsearch`.
+- HMM files are stored in binary format to speedup parsing and reduce storage size.
+- `tqdm` is now a *training*-only dependency.
+- `gecco cv` now requires *training* dependencies.
+
+## [v0.4.5] - 2020-11-23
+[v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.4...v0.4.5
+### Added
+- Additional `fold` column to cross-validation table output.
+### Changed
+- Use sequence ID instead of protein ID to extract type from cluster in `gecco cv`.
+- Install HMM data in pre-pressed format to make `hmmsearch` runs faster on short sequences.
+- `gecco.orf` was rewritten to extract genes from input sequences in parallel.
+
+## [v0.4.4] - 2020-09-30
+[v0.4.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.3...v0.4.4
+### Added
+- `gecco cv loto` command to run LOTO cross-validation using BGC types
+  for stratification.
+- `header` keyword argument to `FeatureTable.dump` and `ClusterTable.dump`
+  to write the table without the column header allowing to append to an
+  existing table.
+- `__getitem__` implementation for `FeatureTable` and `ClusterTable`
+  that returns a single row or a sub-table from a table.
+### Fixed
+- `gecco cv` command now writes results iteratively instead of holding
+  the tables for every fold in memory.
+### Changed
+- Bumped `pandas` training dependency to `v1.0`.
+
+## [v0.4.3] - 2020-09-07
+[v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3
+### Fixed
+- GenBank files being written with invalid `/cds` feature type.
+### Changed
+- Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet`
+  and breaks the current code.
+
+## [v0.4.2] - 2020-08-07
+[v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2
+### Fixed
+- `TypeClassifier.predict_types` using inverse type probabilities when
+  given several clusters to process.
+
+## [v0.4.1] - 2020-08-07
+[v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1
+### Fixed
+- `gecco run` command crashing on input sequences not containing any genes.
+
+## [v0.4.0] - 2020-08-06
+[v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0
+### Added
+- `gecco.model.ProductType` enum to model the biosynthetic class of a BGC.
+### Removed
+- `pandas` interaction from internal data model.
+- `ClusterCRF` code specific to cross-validation.
+### Changed
+- `pandas`, `fisher` and `statsmodels` dependencies are now optional.
+- `gecco train` command expects a cluster table in addition to the feature
+   table to know the types of the input BGCs.
+
+## [v0.3.0] - 2020-08-03
+[v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0
+### Changed
+- Replaced Nearest-Neighbours classifier with Random Forest to perform type
+  prediction for candidate BGCs.
+- `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`.
+### Fixed
+- Extraction of domain composition taking a long time in `gecco train` command.
+### Removed
+- `--metric` argument to the `gecco run` CLI command.
+
+## [v0.2.2] - 2020-07-31
+[v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2
+### Changed
+- `Domain` and `Gene` can now carry qualifiers that are used when they
+  are translated to a sequence feature.
+### Added
+- InterPro names, accessions, and HMMER e-value for each annotated domain
+  in GenBank output files.
+
+## [v0.2.1] - 2020-07-23
+[v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1
+### Fixed
+- Various potential crashes in `ClusterRefiner` code.
+### Removed
+- Uneeded feature dictionary filtering in `ClusterCRF` for models with
+  Fisher Exact Test feature selection.
+
+## [v0.2.0] - 2020-07-23
+[v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0
+### Fixed
+- `pandas` warning about unsorted columns in `gecco run`.
+### Removed
+- `Gene.probability` property, replaced by `Gene.maximum_probability` and
+  `Gene.average_probability` properties to be explicit.
+### Changed
+- Internal model now uses `Pfam` and `Tigrfam` with the top 35% features
+  selected with Fisher's Exact Test.
+- `ClusterRefiner` now removes genes on `Cluster` edges if they do not
+  contain any domain annotation.
+
+## [v0.1.1] - 2020-07-22
+[v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1
+### Added
+- `ClusterCRF.predict_probabilities` to annotate a list of `Gene`.
+### Changed
+- BGC probability is now stored at the `Domain` level instead of at the `Gene`
+  level, independently of the feature extraction level used by the CRF.
+- `ClusterKNN` will use the model path provided to `gecco run` if any.
+### Docs
+- Added this changelog file to document changes in the code.
+- Added documentation to `gecco` submodules missing some.
+- Included the `CHANGELOG.md` file to the generated docs.
+
+## [v0.1.0] - 2020-07-17
+[v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0
+Initial release.
+
+## [v0.0.1] - 2018-08-13
+[v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1
+Proof-of-concept.
--- a/README.rst	Sun Nov 21 17:40:58 2021 +0000
+++ b/README.rst	Sun Nov 21 19:47:22 2021 +0000
@@ -14,7 +14,7 @@
 Fields (CRFs).
 
 |GitLabCI| |License| |Coverage| |Docs| |Source| |Mirror| |Changelog|
-|Issues| |Preprint| |PyPI| |Bioconda| |Versions| |Wheel|
+|Issues| |Preprint| |PyPI| |Bioconda| |Galaxy| |Versions| |Wheel|
 
 🔧 Installing GECCO
 -------------------
@@ -132,3 +132,5 @@
    :target: https://pypi.org/project/gecco-tool/#files
 .. |Wheel| image:: https://img.shields.io/pypi/wheel/gecco-tool?style=flat-square&maxAge=3600
    :target: https://pypi.org/project/gecco-tool/#files
+.. |Galaxy| image:: https://img.shields.io/badge/Galaxy-GECCO-darkblue?style=flat-square&maxAge=3600
+   :target: https://toolshed.g2.bx.psu.edu/repository?repository_id=c29bc911b3fc5f8c
--- a/gecco.xml	Sun Nov 21 17:40:58 2021 +0000
+++ b/gecco.xml	Sun Nov 21 19:47:22 2021 +0000
@@ -1,8 +1,8 @@
 <?xml version='1.0' encoding='utf-8'?>
-<tool id="gecco" name="GECCO" version="0.8.4" python_template_version="3.5">
-    <description>GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
+<tool id="gecco" name="GECCO" version="0.8.5" python_template_version="3.5">
+    <description>is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).</description>
     <requirements>
-        <requirement type="package" version="0.8.4">gecco</requirement>
+        <requirement type="package" version="0.8.5">gecco</requirement>
     </requirements>
     <version_command>gecco --version</version_command>
     <command detect_errors="aggressive"><![CDATA[
@@ -14,13 +14,37 @@
         #end if
         ln -s '$input' input_tempfile.$file_extension &&
 
-        gecco -vv run -g input_tempfile.$file_extension &&
-        mv input_tempfile.features.tsv $features &&
-        mv input_tempfile.clusters.tsv $clusters
+        gecco -vv run
+        --format $input.ext
+        --genome input_tempfile.$file_extension
+        --postproc $postproc
+        --force-clusters-tsv
+        #if $cds:
+            --cds $cds
+        #end if
+        #if $threshold:
+            --threshold $threshold
+        #end if
+        #if $antismash_sideload:
+            --antismash-sideload
+        #end if
+
+        && mv input_tempfile.features.tsv '$features'
+        && mv input_tempfile.clusters.tsv '$clusters'
+        #if $antismash_sideload
+        && mv input_tempfile.sideload.json '$sideload'
+        #end if
 
     ]]></command>
     <inputs>
-        <param name="input" type="data" format="genbank,fasta" label="Sequence file in GenBank or FASTA format"/>
+        <param name="input" type="data" format="genbank,fasta,embl" label="Sequence file in GenBank, EMBL or FASTA format"/>
+        <param argument="--cds" type="integer" min="0" value="" optional="true" label="Minimum number of genes required for a cluster"/>
+        <param argument="--threshold" type="float" min="0" max="1" value="" optional="true" label="Probability threshold for cluster detection"/>
+        <param argument="--postproc" type="select" label="Post-processing method for gene cluster validation">
+            <option value="antismash">antiSMASH</option>
+            <option value="gecco" selected="true">GECCO</option>
+        </param>
+        <param argument="--antismash-sideload" type="boolean" checked="false" label="Generate an antiSMASH v6 sideload JSON file"/>
     </inputs>
     <outputs>
         <collection name="records" type="list" label="${tool.name} detected Biosynthetic Gene Clusters on ${on_string} (GenBank)">
@@ -28,6 +52,9 @@
         </collection>
         <data name="features" format="tabular" label="${tool.name} summary of detected features on ${on_string} (TSV)"/>
         <data name="clusters" format="tabular" label="${tool.name} summary of detected BGCs on ${on_string} (TSV)"/>
+        <data name="sideload" format="json" label="antiSMASH v6 sideload file with ${tool.name} detected BGCs on ${on_string} (JSON)">
+            <filter>antismash_sideload</filter>
+        </data>
     </outputs>
     <tests>
         <test>
@@ -38,49 +65,48 @@
                 <element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>
             </output_collection>
         </test>
+        <test>
+            <param name="input" value="BGC0001866.fna"/>
+            <param name="antismash_sideload" value="True"/>
+            <output name="features" file="features.tsv"/>
+            <output name="clusters" file="clusters.tsv"/>
+            <output name="sideload" file="sideload.json"/>
+            <output_collection name="records" type="list">
+                <element name="BGC0001866.1_cluster_1" file="BGC0001866.1_cluster_1.gbk" ftype="genbank" lines_diff="2"/>
+            </output_collection>
+        </test>
     </tests>
-    <help>
-<![CDATA[
+    <help><![CDATA[
 
-**Overview**
+Overview
+--------
 
-GECCO is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
+GECCO (Gene Cluster prediction with Conditional Random Fields) is a fast and scalable method for identifying putative novel Biosynthetic Gene Clusters (BGCs) in genomic and metagenomic data using Conditional Random Fields (CRFs).
 It is developed in the Zeller group and is part of the suite of computational microbiome analysis tools hosted at EMBL.
 
-**Input**
+Input
+-----
 
 GECCO works with DNA sequences, and loads them using Biopython, allowing it to support a large variety of formats, including the common FASTA and GenBank files.
 
-**Output**
+Output
+------
 
 GECCO will create the following files once done (using the same prefix as the input file):
 
-- features.tsv: The features file, containing the identified proteins and domains in the input sequences.
-- clusters.tsv: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
-- {sequence}_cluster_{N}.gbk: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
-
-**Contact**
+- ``features.tsv``: The features file, containing the identified proteins and domains in the input sequences.
+- ``clusters.tsv``: If any were found, a clusters file, containing the coordinates of the predicted clusters, along their putative biosynthetic type.
+- ``{sequence}_cluster_{N}.gbk``: If any BGCs were found, a GenBank file per cluster, containing the cluster sequence annotated with its member proteins and domains.
 
-If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the GitHub repository. 
-You can also directly contact Martin Larralde via email. If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to 
-open a pull request on the GitHub repository.
+Contact
+-------
 
-]]>
-    </help>
+If you have any question about GECCO, if you run into any issue, or if you would like to make a feature request, please create an issue in the
+`GitHub repository <https://github.com/zellerlab/gecco>`_. You can also directly contact `Martin Larralde via email <mailto:martin.larralde@embl.de>`_.
+If you want to contribute to GECCO, please have a look at the contribution guide first, and feel free to open a pull request on the GitHub repository.
+
+    ]]></help>
     <citations>
-        <citation type="bibtex">
-@article {Carroll2021.05.03.442509,
-	author = {Carroll, Laura M. and Larralde, Martin and Fleck, Jonas Simon and Ponnudurai, Ruby and Milanese, Alessio and Cappio, Elisa and Zeller, Georg},
-	title = {Accurate de novo identification of biosynthetic gene clusters with GECCO},
-	elocation-id = {2021.05.03.442509},
-	year = {2021},
-	doi = {10.1101/2021.05.03.442509},
-	publisher = {Cold Spring Harbor Laboratory},
-	abstract = {Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.Competing Interest StatementThe authors have declared no competing interest.},
-	URL = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509},
-	eprint = {https://www.biorxiv.org/content/early/2021/05/04/2021.05.03.442509.full.pdf},
-	journal = {bioRxiv}
-}
-        </citation>
+        <citation type="doi">10.1101/2021.05.03.442509</citation>
     </citations>
 </tool>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/sideload.json	Sun Nov 21 19:47:22 2021 +0000
@@ -0,0 +1,36 @@
+{
+    "records": [
+        {
+            "name": "BGC0001866.1",
+            "subregions": [
+                {
+                    "details": {
+                        "alkaloid_probability": "0.000",
+                        "average_p": "0.997",
+                        "max_p": "1.000",
+                        "nrp_probability": "0.140",
+                        "other_probability": "0.000",
+                        "polyketide_probability": "0.980",
+                        "ripp_probability": "0.000",
+                        "saccharide_probability": "0.000",
+                        "terpene_probability": "0.000"
+                    },
+                    "end": 32979,
+                    "label": "Polyketide",
+                    "start": 347
+                }
+            ]
+        }
+    ],
+    "tool": {
+        "configuration": {
+            "cds": "3",
+            "e-filter": "None",
+            "postproc": "'gecco'",
+            "threshold": "0.3"
+        },
+        "description": "Biosynthetic Gene Cluster prediction with Conditional Random Fields.",
+        "name": "GECCO",
+        "version": "0.8.4"
+    }
+}
\ No newline at end of file