diff CHANGELOG.md @ 3:359232b58f6a draft

"Update Galaxy tool wrapper to follow the IUC best practices"
author althonos
date Sun, 21 Nov 2021 19:47:22 +0000
parents
children 169849dfb098
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/CHANGELOG.md	Sun Nov 21 19:47:22 2021 +0000
@@ -0,0 +1,299 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
+and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+[Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...master
+
+## [v0.8.5] - 2021-11-21
+[v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5
+### Added
+- Minimal compatibility support for running GECCO inside of Galaxy workflows.
+
+## [v0.8.4] - 2021-09-26
+[v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4
+### Fixed
+- `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)).
+- `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input.
+### Changed
+- Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported.
+
+## [v0.8.3-post1] - 2021-08-23
+[v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1
+### Fixed
+- Wrong default value for `--threshold` being shown in `gecco run` help message.
+
+## [v0.8.3] - 2021-08-23
+[v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3
+### Changed
+- Default probability threshold for segmentation to 0.3 (from 0.4).
+
+## [v0.9.0] - 2021-08-10 - **YANKED**
+[v0.9.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.9.0
+### Changed
+- Retrain internal model using `--select=0.35` instead of `--select=0.25` like before.
+- Change default *p-value* filter from 1e-9 to 1e-5 to detect more features.
+
+## [v0.8.2] - 2021-07-31
+[v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2
+### Fixed
+- `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class.
+### Changed
+- `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag.
+- `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier.
+
+## [v0.8.1] - 2021-07-29
+[v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1
+### Changed
+- `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`.
+### Fixed
+- `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`.
+### Added
+- Missing documentation for the `strand` attribute of `gecco.model.Gene`.
+
+## [v0.8.0] - 2021-07-03
+[v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0
+### Changed
+- Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0.
+- Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling.
+- Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms.
+- Use p-values instead of e-values to filter domains obtained with HMMER.
+- `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data.
+### Fixed
+- Extraction of BGC compositions for the type predictor while training.
+- `ClusterCRF.trained` failing to open an external model.
+### Added
+- `Domain.pvalue` attribute to access the p-value of a domain annotation.
+- Mandatory `pvalue` column to `FeatureTable` objects.
+- Support for loading several feature tables in `gecco train` and `gecco cv`.
+- Warnings to `ClusterCRF.fit` when selecting uninformative features.
+- `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests.
+### Removed
+- Outdated `gecco embed` command.
+- Unused `--truncate` flag from the `gecco train` CLI.
+- Tigrfam domains, which is not improving performance on the new training data.
+
+## [v0.7.0] - 2021-05-31
+[v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0
+### Added
+- Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow.
+- Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand.
+- Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE.
+### Changed
+- Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling.
+- Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files.
+### Removed
+- Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command.
+
+## [v0.6.3] - 2021-05-10
+[v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3
+### Fixed
+- HMMER annotation not properly handling inputs with multiple contigs.
+- Some progress bar totals displaying as floats in the CLI.
+### Changed
+- `PyHMMER` now sets the `Z` and `domZ` values from the number of proteins given to the search pipeline.
+- `gecco.cli` delegates imports to make CLI more responsive.
+- `pkg_resources` has been replaced with `importlib.resources` and `importlib.metadata` where applicable.
+- `multiprocessing.cpu_count` has been replaced with `os.cpu_count` where applicable.
+
+## [v0.6.2] - 2021-05-04
+[v0.6.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.1...v0.6.2
+### Fixed
+- `gecco cv loto` crashing because of outdated code.
+### Changed
+- Logging-style prompt will only display if GECCO is running with `-vv` flag.
+### Added
+- GECCO bioRxiv paper reference to `Cluster.to_seq_record` output record.
+
+## [v0.6.1] - 2021-03-15
+[v0.6.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.0...v0.6.1
+### Fixed
+- Progress bar not being disabled by `-q` flag in CLI.
+- Fallback to using HMM name if accession is not available in `PyHMMER`.
+- Group genes by source contig and process them separately in `PyHMMER` to avoid bogus E-values.
+### Added
+- `psutil` dependency to get the number of physical CPU cores on the host machine.
+- Support for using an arbitrary mapping of positives to negatives in `gecco embed`.
+### Removed
+- Unused and outdated `HMMER` and `DomainRow` classes from `gecco.hmmer`.
+
+## [v0.6.0] - 2021-02-28
+[v0.6.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.5...v0.6.0
+### Changed
+- Updated internal model with a cleaned-up version of the MIBiG-2.0
+  Pfam-33.1/Tigrfam-15.0 embedding.
+- Updated internal InterPro catalog.
+### Fixed
+- Features not being grouped together in `gecco cv` and `gecco train`
+  when provided with a feature table where rows were not sorted by
+  protein IDs.
+
+## [v0.5.5] - 2021-02-28
+[v0.5.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.4...v0.5.5
+### Fixed
+- `gecco cv` bug causing only the last fold to be written.
+
+## [v0.5.4] - 2021-02-28
+[v0.5.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.3...v0.5.4
+### Changed
+- Replaced `verboselogs`, `coloredlogs` and `better-exceptions` with `rich`.
+### Removed
+- `tqdm` training dependency.
+### Added
+- `gecco annotate` command to produce a feature table from a genomic file.
+- `gecco embed` to embed BGCs into non-BGC regions using feature tables.
+
+## [v0.5.3] - 2021-02-21
+[v0.5.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...v0.5.3
+### Fixed
+- Coordinates of genes in output GenBank files.
+- Potential issue with the number of CPUs in `PyHMMER.run`.
+### Changed
+- Bump required `pyrodigal` version to `v0.4.2` to fix buffer overflow.
+
+## [v0.5.2] - 2021-01-29
+[v0.5.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2
+### Added
+- Support for downloading HMM files directly from GitHub releases assets.
+- Validation of filtered HMMs with MD5 checksum.
+### Fixed
+- Invalid coordinates of protein domains in GenBank output files.
+- `gecco.interpro` module not being added to wheel distribution.
+### Changed
+- Bump required `pyhmmer` version to `v0.2.1`.
+
+## [v0.5.1] - 2021-01-15
+[v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1
+### Fixed
+- `--hmm` flag being ignored in in `gecco run` command.
+- `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs.
+
+## [v0.5.0] - 2021-01-11
+[v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0
+### Added
+- Explicit support for Python 3.9.
+### Changed
+- [`pyhmmer`](https://pypi.org/project/pyhmmer) is used to annotate protein sequences instead of HMMER3 binary `hmmsearch`.
+- HMM files are stored in binary format to speedup parsing and reduce storage size.
+- `tqdm` is now a *training*-only dependency.
+- `gecco cv` now requires *training* dependencies.
+
+## [v0.4.5] - 2020-11-23
+[v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.4...v0.4.5
+### Added
+- Additional `fold` column to cross-validation table output.
+### Changed
+- Use sequence ID instead of protein ID to extract type from cluster in `gecco cv`.
+- Install HMM data in pre-pressed format to make `hmmsearch` runs faster on short sequences.
+- `gecco.orf` was rewritten to extract genes from input sequences in parallel.
+
+## [v0.4.4] - 2020-09-30
+[v0.4.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.3...v0.4.4
+### Added
+- `gecco cv loto` command to run LOTO cross-validation using BGC types
+  for stratification.
+- `header` keyword argument to `FeatureTable.dump` and `ClusterTable.dump`
+  to write the table without the column header allowing to append to an
+  existing table.
+- `__getitem__` implementation for `FeatureTable` and `ClusterTable`
+  that returns a single row or a sub-table from a table.
+### Fixed
+- `gecco cv` command now writes results iteratively instead of holding
+  the tables for every fold in memory.
+### Changed
+- Bumped `pandas` training dependency to `v1.0`.
+
+## [v0.4.3] - 2020-09-07
+[v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3
+### Fixed
+- GenBank files being written with invalid `/cds` feature type.
+### Changed
+- Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet`
+  and breaks the current code.
+
+## [v0.4.2] - 2020-08-07
+[v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2
+### Fixed
+- `TypeClassifier.predict_types` using inverse type probabilities when
+  given several clusters to process.
+
+## [v0.4.1] - 2020-08-07
+[v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1
+### Fixed
+- `gecco run` command crashing on input sequences not containing any genes.
+
+## [v0.4.0] - 2020-08-06
+[v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0
+### Added
+- `gecco.model.ProductType` enum to model the biosynthetic class of a BGC.
+### Removed
+- `pandas` interaction from internal data model.
+- `ClusterCRF` code specific to cross-validation.
+### Changed
+- `pandas`, `fisher` and `statsmodels` dependencies are now optional.
+- `gecco train` command expects a cluster table in addition to the feature
+   table to know the types of the input BGCs.
+
+## [v0.3.0] - 2020-08-03
+[v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0
+### Changed
+- Replaced Nearest-Neighbours classifier with Random Forest to perform type
+  prediction for candidate BGCs.
+- `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`.
+### Fixed
+- Extraction of domain composition taking a long time in `gecco train` command.
+### Removed
+- `--metric` argument to the `gecco run` CLI command.
+
+## [v0.2.2] - 2020-07-31
+[v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2
+### Changed
+- `Domain` and `Gene` can now carry qualifiers that are used when they
+  are translated to a sequence feature.
+### Added
+- InterPro names, accessions, and HMMER e-value for each annotated domain
+  in GenBank output files.
+
+## [v0.2.1] - 2020-07-23
+[v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1
+### Fixed
+- Various potential crashes in `ClusterRefiner` code.
+### Removed
+- Uneeded feature dictionary filtering in `ClusterCRF` for models with
+  Fisher Exact Test feature selection.
+
+## [v0.2.0] - 2020-07-23
+[v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0
+### Fixed
+- `pandas` warning about unsorted columns in `gecco run`.
+### Removed
+- `Gene.probability` property, replaced by `Gene.maximum_probability` and
+  `Gene.average_probability` properties to be explicit.
+### Changed
+- Internal model now uses `Pfam` and `Tigrfam` with the top 35% features
+  selected with Fisher's Exact Test.
+- `ClusterRefiner` now removes genes on `Cluster` edges if they do not
+  contain any domain annotation.
+
+## [v0.1.1] - 2020-07-22
+[v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1
+### Added
+- `ClusterCRF.predict_probabilities` to annotate a list of `Gene`.
+### Changed
+- BGC probability is now stored at the `Domain` level instead of at the `Gene`
+  level, independently of the feature extraction level used by the CRF.
+- `ClusterKNN` will use the model path provided to `gecco run` if any.
+### Docs
+- Added this changelog file to document changes in the code.
+- Added documentation to `gecco` submodules missing some.
+- Included the `CHANGELOG.md` file to the generated docs.
+
+## [v0.1.0] - 2020-07-17
+[v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0
+Initial release.
+
+## [v0.0.1] - 2018-08-13
+[v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1
+Proof-of-concept.