Mercurial > repos > althonos > gecco
view CHANGELOG.md @ 16:042a23379d2d draft
Release v0.9.3
author | althonos |
---|---|
date | Fri, 13 May 2022 14:36:19 +0000 |
parents | 64528877558f |
children | e28ffe93dc69 |
line wrap: on
line source
# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). ## [Unreleased] [Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.3...master ## [v0.9.3] - 2022-05-13 [v0.9.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.2...v0.9.3 ### Changed - `--format` flag of `gecco annotate` and `gecco run` CLI commands is now made lowercase before giving value to `Bio.SeqIO`. ### Fixed - Genes with duplicate IDs being silently ignored in `HMMER.run`. ## [v0.9.2] - 2022-04-11 [v0.9.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1...v0.9.2 ### Added - Padding of short sequences with empty genes when predicting probabilities in `ClusterCRF`. ## [v0.9.1] - 2022-04-05 [v0.9.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha4...v0.9.1 ### Changed - Make the `genes.tsv` and `features.tsv` table contain all genes even when they come from a contig too short to be processed by the CRF sliding window. - Replaced the `--force-clusters-tsv` flag with a `--force-tsv` flag to force writing TSV tables even when no genes or clusters were found in `gecco run` or `gecco annotate`. ## [v0.9.1-alpha4] - 2022-03-31 [v0.9.1-alpha4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha3...v0.9.1-alpha4 Retrain internal model with: ``` $ python -m gecco -vv train --c1 0.4 --c2 0 --select 0.25 --window-size 20 \ -f mibig-2.0.proG2.Pfam-v35.0.features.tsv \ -c mibig-2.0.proG2.clusters.tsv \ -g GECCO-data/data/embeddings/mibig-2.0.proG2.genes.tsv \ -o models/v0.9.1-alpha4 ``` ## [v0.9.1-alpha3] - 2022-03-23 [v0.9.1-alpha3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha2...v0.9.1-alpha3 ### Added - `gecco.model.GeneTable` class to store gene coordinates independently of protein domains. ### Changed - Refactored implementation of `load` and `dump` methods for `Table` classes into a dedicated base class. - `gecco run` and `gecco annotate` now output a gene table in addition to the feature and cluster tables. - `gecco train` expects a gene table instead of a GFF file for the gene coordinates. ## [v0.9.1-alpha2] - 2022-03-23 [v0.9.1-alpha2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.9.1-alpha1...v0.9.1-alpha2 ### Fixed - `TypeClassifier.trained` not being able to read unknown types from type tables. ## [v0.9.1-alpha1] - 2022-03-20 [v0.9.1-alpha1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.10...v0.9.1-alpha1 Candidate release with support for a sliding window in the CRF prediction algorithm. ## [v0.8.10] - 2022-02-23 [v0.8.10]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.9...v0.8.10 ### Fixed - `--antismash-sideload` flag of `gecco run` causing command to crash. ## [v0.8.9] - 2022-02-22 [v0.8.9]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.8...v0.8.9 ### Removed - Prediction and support for the *Other* biosynthetic type of MIBiG clusters. ## [v0.8.8] - 2022-02-21 [v0.8.8]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.7...v0.8.8 ### Fixed - `ClusterRefiner` filtering method for edge genes not working as intended. - `gecco run` and `gecco annotate` commands crashing on missing input files instead of nicely rendering the error. ## [v0.8.7] - 2022-02-18 [v0.8.7]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.6...v0.8.7 ### Fixed - `interpro.json` metadata file not being included in distribution files. - Missing docstring for `Protein.with_domains` method. ### Changed - Bump minimum `scikit-learn` version to `v1.0` for Python3.7+. ## [v0.8.6] - 2022-02-17 - YANKED [v0.8.6]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...v0.8.6 ### Added - CLI flag for enabling region masking for contigs processed by Prodigal. - CLI flag for controlling region distance used for edge distance filtering. ### Changed - `gecco.model.Gene` and `gecco.model.Protein` are now immutable data classes. - Bump minimum `pyrodigal` version to `v0.6.4` to use region masking. - Implement filtering for extracted clusters based on distance to the contig edge. - Store InterPro metadata file uncompressed for version-control integration. ### Fixed - Mark `BGC0000930` as `Terpene` in the type classifier data. - Progress bar messages are now in consistent format. ## [v0.8.5] - 2021-11-21 [v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5 ### Added - Minimal compatibility support for running GECCO inside of Galaxy workflows. ## [v0.8.4] - 2021-09-26 [v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4 ### Fixed - `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)). - `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input. ### Changed - Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported. ## [v0.8.3-post1] - 2021-08-23 [v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1 ### Fixed - Wrong default value for `--threshold` being shown in `gecco run` help message. ## [v0.8.3] - 2021-08-23 [v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3 ### Changed - Default probability threshold for segmentation to 0.3 (from 0.4). ## [v0.8.2] - 2021-07-31 [v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2 ### Fixed - `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class. ### Changed - `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag. - `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier. ## [v0.8.1] - 2021-07-29 [v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1 ### Changed - `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`. ### Fixed - `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`. ### Added - Missing documentation for the `strand` attribute of `gecco.model.Gene`. ## [v0.8.0] - 2021-07-03 [v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0 ### Changed - Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0. - Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling. - Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms. - Use p-values instead of e-values to filter domains obtained with HMMER. - `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data. ### Fixed - Extraction of BGC compositions for the type predictor while training. - `ClusterCRF.trained` failing to open an external model. ### Added - `Domain.pvalue` attribute to access the p-value of a domain annotation. - Mandatory `pvalue` column to `FeatureTable` objects. - Support for loading several feature tables in `gecco train` and `gecco cv`. - Warnings to `ClusterCRF.fit` when selecting uninformative features. - `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests. ### Removed - Outdated `gecco embed` command. - Unused `--truncate` flag from the `gecco train` CLI. - Tigrfam domains, which is not improving performance on the new training data. ## [v0.7.0] - 2021-05-31 [v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0 ### Added - Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow. - Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand. - Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE. ### Changed - Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling. - Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files. ### Removed - Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command. ## [v0.6.3] - 2021-05-10 [v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3 ### Fixed - HMMER annotation not properly handling inputs with multiple contigs. - Some progress bar totals displaying as floats in the CLI. ### Changed - `PyHMMER` now sets the `Z` and `domZ` values from the number of proteins given to the search pipeline. - `gecco.cli` delegates imports to make CLI more responsive. - `pkg_resources` has been replaced with `importlib.resources` and `importlib.metadata` where applicable. - `multiprocessing.cpu_count` has been replaced with `os.cpu_count` where applicable. ## [v0.6.2] - 2021-05-04 [v0.6.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.1...v0.6.2 ### Fixed - `gecco cv loto` crashing because of outdated code. ### Changed - Logging-style prompt will only display if GECCO is running with `-vv` flag. ### Added - GECCO bioRxiv paper reference to `Cluster.to_seq_record` output record. ## [v0.6.1] - 2021-03-15 [v0.6.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.0...v0.6.1 ### Fixed - Progress bar not being disabled by `-q` flag in CLI. - Fallback to using HMM name if accession is not available in `PyHMMER`. - Group genes by source contig and process them separately in `PyHMMER` to avoid bogus E-values. ### Added - `psutil` dependency to get the number of physical CPU cores on the host machine. - Support for using an arbitrary mapping of positives to negatives in `gecco embed`. ### Removed - Unused and outdated `HMMER` and `DomainRow` classes from `gecco.hmmer`. ## [v0.6.0] - 2021-02-28 [v0.6.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.5...v0.6.0 ### Changed - Updated internal model with a cleaned-up version of the MIBiG-2.0 Pfam-33.1/Tigrfam-15.0 embedding. - Updated internal InterPro catalog. ### Fixed - Features not being grouped together in `gecco cv` and `gecco train` when provided with a feature table where rows were not sorted by protein IDs. ## [v0.5.5] - 2021-02-28 [v0.5.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.4...v0.5.5 ### Fixed - `gecco cv` bug causing only the last fold to be written. ## [v0.5.4] - 2021-02-28 [v0.5.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.3...v0.5.4 ### Changed - Replaced `verboselogs`, `coloredlogs` and `better-exceptions` with `rich`. ### Removed - `tqdm` training dependency. ### Added - `gecco annotate` command to produce a feature table from a genomic file. - `gecco embed` to embed BGCs into non-BGC regions using feature tables. ## [v0.5.3] - 2021-02-21 [v0.5.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...v0.5.3 ### Fixed - Coordinates of genes in output GenBank files. - Potential issue with the number of CPUs in `PyHMMER.run`. ### Changed - Bump required `pyrodigal` version to `v0.4.2` to fix buffer overflow. ## [v0.5.2] - 2021-01-29 [v0.5.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2 ### Added - Support for downloading HMM files directly from GitHub releases assets. - Validation of filtered HMMs with MD5 checksum. ### Fixed - Invalid coordinates of protein domains in GenBank output files. - `gecco.interpro` module not being added to wheel distribution. ### Changed - Bump required `pyhmmer` version to `v0.2.1`. ## [v0.5.1] - 2021-01-15 [v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1 ### Fixed - `--hmm` flag being ignored in in `gecco run` command. - `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs. ## [v0.5.0] - 2021-01-11 [v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0 ### Added - Explicit support for Python 3.9. ### Changed - [`pyhmmer`](https://pypi.org/project/pyhmmer) is used to annotate protein sequences instead of HMMER3 binary `hmmsearch`. - HMM files are stored in binary format to speedup parsing and reduce storage size. - `tqdm` is now a *training*-only dependency. - `gecco cv` now requires *training* dependencies. ## [v0.4.5] - 2020-11-23 [v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.4...v0.4.5 ### Added - Additional `fold` column to cross-validation table output. ### Changed - Use sequence ID instead of protein ID to extract type from cluster in `gecco cv`. - Install HMM data in pre-pressed format to make `hmmsearch` runs faster on short sequences. - `gecco.orf` was rewritten to extract genes from input sequences in parallel. ## [v0.4.4] - 2020-09-30 [v0.4.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.3...v0.4.4 ### Added - `gecco cv loto` command to run LOTO cross-validation using BGC types for stratification. - `header` keyword argument to `FeatureTable.dump` and `ClusterTable.dump` to write the table without the column header allowing to append to an existing table. - `__getitem__` implementation for `FeatureTable` and `ClusterTable` that returns a single row or a sub-table from a table. ### Fixed - `gecco cv` command now writes results iteratively instead of holding the tables for every fold in memory. ### Changed - Bumped `pandas` training dependency to `v1.0`. ## [v0.4.3] - 2020-09-07 [v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3 ### Fixed - GenBank files being written with invalid `/cds` feature type. ### Changed - Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet` and breaks the current code. ## [v0.4.2] - 2020-08-07 [v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2 ### Fixed - `TypeClassifier.predict_types` using inverse type probabilities when given several clusters to process. ## [v0.4.1] - 2020-08-07 [v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1 ### Fixed - `gecco run` command crashing on input sequences not containing any genes. ## [v0.4.0] - 2020-08-06 [v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0 ### Added - `gecco.model.ProductType` enum to model the biosynthetic class of a BGC. ### Removed - `pandas` interaction from internal data model. - `ClusterCRF` code specific to cross-validation. ### Changed - `pandas`, `fisher` and `statsmodels` dependencies are now optional. - `gecco train` command expects a cluster table in addition to the feature table to know the types of the input BGCs. ## [v0.3.0] - 2020-08-03 [v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0 ### Changed - Replaced Nearest-Neighbours classifier with Random Forest to perform type prediction for candidate BGCs. - `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`. ### Fixed - Extraction of domain composition taking a long time in `gecco train` command. ### Removed - `--metric` argument to the `gecco run` CLI command. ## [v0.2.2] - 2020-07-31 [v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2 ### Changed - `Domain` and `Gene` can now carry qualifiers that are used when they are translated to a sequence feature. ### Added - InterPro names, accessions, and HMMER e-value for each annotated domain in GenBank output files. ## [v0.2.1] - 2020-07-23 [v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1 ### Fixed - Various potential crashes in `ClusterRefiner` code. ### Removed - Uneeded feature dictionary filtering in `ClusterCRF` for models with Fisher Exact Test feature selection. ## [v0.2.0] - 2020-07-23 [v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0 ### Fixed - `pandas` warning about unsorted columns in `gecco run`. ### Removed - `Gene.probability` property, replaced by `Gene.maximum_probability` and `Gene.average_probability` properties to be explicit. ### Changed - Internal model now uses `Pfam` and `Tigrfam` with the top 35% features selected with Fisher's Exact Test. - `ClusterRefiner` now removes genes on `Cluster` edges if they do not contain any domain annotation. ## [v0.1.1] - 2020-07-22 [v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1 ### Added - `ClusterCRF.predict_probabilities` to annotate a list of `Gene`. ### Changed - BGC probability is now stored at the `Domain` level instead of at the `Gene` level, independently of the feature extraction level used by the CRF. - `ClusterKNN` will use the model path provided to `gecco run` if any. ### Docs - Added this changelog file to document changes in the code. - Added documentation to `gecco` submodules missing some. - Included the `CHANGELOG.md` file to the generated docs. ## [v0.1.0] - 2020-07-17 [v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0 Initial release. ## [v0.0.1] - 2018-08-13 [v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1 Proof-of-concept.