Mercurial > repos > althonos > gecco
comparison CHANGELOG.md @ 3:359232b58f6a draft
"Update Galaxy tool wrapper to follow the IUC best practices"
author | althonos |
---|---|
date | Sun, 21 Nov 2021 19:47:22 +0000 |
parents | |
children | 169849dfb098 |
comparison
equal
deleted
inserted
replaced
2:e618ab1c78d9 | 3:359232b58f6a |
---|---|
1 # Changelog | |
2 All notable changes to this project will be documented in this file. | |
3 | |
4 The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) | |
5 and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html). | |
6 | |
7 ## [Unreleased] | |
8 [Unreleased]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.5...master | |
9 | |
10 ## [v0.8.5] - 2021-11-21 | |
11 [v0.8.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.4...v0.8.5 | |
12 ### Added | |
13 - Minimal compatibility support for running GECCO inside of Galaxy workflows. | |
14 | |
15 ## [v0.8.4] - 2021-09-26 | |
16 [v0.8.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3-post1...v0.8.4 | |
17 ### Fixed | |
18 - `gecco convert gbk --format bigslice` failing to run because of outdated code ([#5](https://github.com/zellerlab/GECCO/issues/5)). | |
19 - `gecco convert gbk --format bigslice` not creating files with names conforming to BiG-SLiCE expected input. | |
20 ### Changed | |
21 - Bump minimum `pyrodigal` version to `v0.6.2` to use platform-accelerated code if supported. | |
22 | |
23 ## [v0.8.3-post1] - 2021-08-23 | |
24 [v0.8.3-post1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.3...v0.8.3-post1 | |
25 ### Fixed | |
26 - Wrong default value for `--threshold` being shown in `gecco run` help message. | |
27 | |
28 ## [v0.8.3] - 2021-08-23 | |
29 [v0.8.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.8.3 | |
30 ### Changed | |
31 - Default probability threshold for segmentation to 0.3 (from 0.4). | |
32 | |
33 ## [v0.9.0] - 2021-08-10 - **YANKED** | |
34 [v0.9.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.2...v0.9.0 | |
35 ### Changed | |
36 - Retrain internal model using `--select=0.35` instead of `--select=0.25` like before. | |
37 - Change default *p-value* filter from 1e-9 to 1e-5 to detect more features. | |
38 | |
39 ## [v0.8.2] - 2021-07-31 | |
40 [v0.8.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.1...v0.8.2 | |
41 ### Fixed | |
42 - `gecco run` crashing on Python 3.6 because of missing `contextlib.nullcontext` class. | |
43 ### Changed | |
44 - `gecco run` and `gecco annotate` will not try to count the number of profiles when given an external HMM file with the `--hmm` flag. | |
45 - `PyHMMER.run` now reports the *p-value* of each domain in addition to the *e-value* as a `/note` qualifier. | |
46 | |
47 ## [v0.8.1] - 2021-07-29 | |
48 [v0.8.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.8.0...v0.8.1 | |
49 ### Changed | |
50 - `gecco run` now filters out unneeded features before annotating, making it easier to analyze the results of a run with a custom `--model`. | |
51 ### Fixed | |
52 - `gecco` reporting about using Pfam `v33.1` while actually using `v34.0` because of an outdated field in `gecco/hmmer/Pfam.ini`. | |
53 ### Added | |
54 - Missing documentation for the `strand` attribute of `gecco.model.Gene`. | |
55 | |
56 ## [v0.8.0] - 2021-07-03 | |
57 [v0.8.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.7.0...v0.8.0 | |
58 ### Changed | |
59 - Retrain internal model using new sequence embeddings and remove broken/duplicate BGCs from MIBiG 2.0. | |
60 - Bump minimum `pyhmmer` version to `v0.4.0` to improve exception handling. | |
61 - Bump minimum `pyrodigal` version to `v0.5.0` to fix sequence decoding on some platforms. | |
62 - Use p-values instead of e-values to filter domains obtained with HMMER. | |
63 - `gecco cv` and `gecco train` now seed the RNG with a user-defined seed before shuffling rows of training data. | |
64 ### Fixed | |
65 - Extraction of BGC compositions for the type predictor while training. | |
66 - `ClusterCRF.trained` failing to open an external model. | |
67 ### Added | |
68 - `Domain.pvalue` attribute to access the p-value of a domain annotation. | |
69 - Mandatory `pvalue` column to `FeatureTable` objects. | |
70 - Support for loading several feature tables in `gecco train` and `gecco cv`. | |
71 - Warnings to `ClusterCRF.fit` when selecting uninformative features. | |
72 - `--correction` flag to `gecco train` and `gecco cv`, allowing to give a multiple testing correction method when computing p-values with the Fisher Exact Tests. | |
73 ### Removed | |
74 - Outdated `gecco embed` command. | |
75 - Unused `--truncate` flag from the `gecco train` CLI. | |
76 - Tigrfam domains, which is not improving performance on the new training data. | |
77 | |
78 ## [v0.7.0] - 2021-05-31 | |
79 [v0.7.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.3...v0.7.0 | |
80 ### Added | |
81 - Support for writing an AntiSMASH sideload JSON file after a `gecco run` workflow. | |
82 - Code for converting GenBank files in BiG-SLiCE compatible format with the `gecco convert` subcommand. | |
83 - Documentation about using GECCO in combination with AntiSMASH or BiG-SLiCE. | |
84 ### Changed | |
85 - Minimum Biopython version to `v1.73` for compatibility with older bioinformatics tooling. | |
86 - Internal domain composition shipped in the `gecco.types` with newer composition array obtained directly from MIBiG files. | |
87 ### Removed | |
88 - Outdated notice about `-vvv` verbosity level in the help message of the main `gecco` command. | |
89 | |
90 ## [v0.6.3] - 2021-05-10 | |
91 [v0.6.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.2...v0.6.3 | |
92 ### Fixed | |
93 - HMMER annotation not properly handling inputs with multiple contigs. | |
94 - Some progress bar totals displaying as floats in the CLI. | |
95 ### Changed | |
96 - `PyHMMER` now sets the `Z` and `domZ` values from the number of proteins given to the search pipeline. | |
97 - `gecco.cli` delegates imports to make CLI more responsive. | |
98 - `pkg_resources` has been replaced with `importlib.resources` and `importlib.metadata` where applicable. | |
99 - `multiprocessing.cpu_count` has been replaced with `os.cpu_count` where applicable. | |
100 | |
101 ## [v0.6.2] - 2021-05-04 | |
102 [v0.6.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.1...v0.6.2 | |
103 ### Fixed | |
104 - `gecco cv loto` crashing because of outdated code. | |
105 ### Changed | |
106 - Logging-style prompt will only display if GECCO is running with `-vv` flag. | |
107 ### Added | |
108 - GECCO bioRxiv paper reference to `Cluster.to_seq_record` output record. | |
109 | |
110 ## [v0.6.1] - 2021-03-15 | |
111 [v0.6.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.6.0...v0.6.1 | |
112 ### Fixed | |
113 - Progress bar not being disabled by `-q` flag in CLI. | |
114 - Fallback to using HMM name if accession is not available in `PyHMMER`. | |
115 - Group genes by source contig and process them separately in `PyHMMER` to avoid bogus E-values. | |
116 ### Added | |
117 - `psutil` dependency to get the number of physical CPU cores on the host machine. | |
118 - Support for using an arbitrary mapping of positives to negatives in `gecco embed`. | |
119 ### Removed | |
120 - Unused and outdated `HMMER` and `DomainRow` classes from `gecco.hmmer`. | |
121 | |
122 ## [v0.6.0] - 2021-02-28 | |
123 [v0.6.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.5...v0.6.0 | |
124 ### Changed | |
125 - Updated internal model with a cleaned-up version of the MIBiG-2.0 | |
126 Pfam-33.1/Tigrfam-15.0 embedding. | |
127 - Updated internal InterPro catalog. | |
128 ### Fixed | |
129 - Features not being grouped together in `gecco cv` and `gecco train` | |
130 when provided with a feature table where rows were not sorted by | |
131 protein IDs. | |
132 | |
133 ## [v0.5.5] - 2021-02-28 | |
134 [v0.5.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.4...v0.5.5 | |
135 ### Fixed | |
136 - `gecco cv` bug causing only the last fold to be written. | |
137 | |
138 ## [v0.5.4] - 2021-02-28 | |
139 [v0.5.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.3...v0.5.4 | |
140 ### Changed | |
141 - Replaced `verboselogs`, `coloredlogs` and `better-exceptions` with `rich`. | |
142 ### Removed | |
143 - `tqdm` training dependency. | |
144 ### Added | |
145 - `gecco annotate` command to produce a feature table from a genomic file. | |
146 - `gecco embed` to embed BGCs into non-BGC regions using feature tables. | |
147 | |
148 ## [v0.5.3] - 2021-02-21 | |
149 [v0.5.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.2...v0.5.3 | |
150 ### Fixed | |
151 - Coordinates of genes in output GenBank files. | |
152 - Potential issue with the number of CPUs in `PyHMMER.run`. | |
153 ### Changed | |
154 - Bump required `pyrodigal` version to `v0.4.2` to fix buffer overflow. | |
155 | |
156 ## [v0.5.2] - 2021-01-29 | |
157 [v0.5.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.1...v0.5.2 | |
158 ### Added | |
159 - Support for downloading HMM files directly from GitHub releases assets. | |
160 - Validation of filtered HMMs with MD5 checksum. | |
161 ### Fixed | |
162 - Invalid coordinates of protein domains in GenBank output files. | |
163 - `gecco.interpro` module not being added to wheel distribution. | |
164 ### Changed | |
165 - Bump required `pyhmmer` version to `v0.2.1`. | |
166 | |
167 ## [v0.5.1] - 2021-01-15 | |
168 [v0.5.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.5.0...v0.5.1 | |
169 ### Fixed | |
170 - `--hmm` flag being ignored in in `gecco run` command. | |
171 - `PyHMMER` using HMM names instead of accessions, causing issues with Pfam HMMs. | |
172 | |
173 ## [v0.5.0] - 2021-01-11 | |
174 [v0.5.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.5...v0.5.0 | |
175 ### Added | |
176 - Explicit support for Python 3.9. | |
177 ### Changed | |
178 - [`pyhmmer`](https://pypi.org/project/pyhmmer) is used to annotate protein sequences instead of HMMER3 binary `hmmsearch`. | |
179 - HMM files are stored in binary format to speedup parsing and reduce storage size. | |
180 - `tqdm` is now a *training*-only dependency. | |
181 - `gecco cv` now requires *training* dependencies. | |
182 | |
183 ## [v0.4.5] - 2020-11-23 | |
184 [v0.4.5]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.4...v0.4.5 | |
185 ### Added | |
186 - Additional `fold` column to cross-validation table output. | |
187 ### Changed | |
188 - Use sequence ID instead of protein ID to extract type from cluster in `gecco cv`. | |
189 - Install HMM data in pre-pressed format to make `hmmsearch` runs faster on short sequences. | |
190 - `gecco.orf` was rewritten to extract genes from input sequences in parallel. | |
191 | |
192 ## [v0.4.4] - 2020-09-30 | |
193 [v0.4.4]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.3...v0.4.4 | |
194 ### Added | |
195 - `gecco cv loto` command to run LOTO cross-validation using BGC types | |
196 for stratification. | |
197 - `header` keyword argument to `FeatureTable.dump` and `ClusterTable.dump` | |
198 to write the table without the column header allowing to append to an | |
199 existing table. | |
200 - `__getitem__` implementation for `FeatureTable` and `ClusterTable` | |
201 that returns a single row or a sub-table from a table. | |
202 ### Fixed | |
203 - `gecco cv` command now writes results iteratively instead of holding | |
204 the tables for every fold in memory. | |
205 ### Changed | |
206 - Bumped `pandas` training dependency to `v1.0`. | |
207 | |
208 ## [v0.4.3] - 2020-09-07 | |
209 [v0.4.3]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.2...v0.4.3 | |
210 ### Fixed | |
211 - GenBank files being written with invalid `/cds` feature type. | |
212 ### Changed | |
213 - Blocked installation of Biopython `v1.78` or newer as it removes `Bio.Alphabet` | |
214 and breaks the current code. | |
215 | |
216 ## [v0.4.2] - 2020-08-07 | |
217 [v0.4.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.1...v0.4.2 | |
218 ### Fixed | |
219 - `TypeClassifier.predict_types` using inverse type probabilities when | |
220 given several clusters to process. | |
221 | |
222 ## [v0.4.1] - 2020-08-07 | |
223 [v0.4.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.4.0...v0.4.1 | |
224 ### Fixed | |
225 - `gecco run` command crashing on input sequences not containing any genes. | |
226 | |
227 ## [v0.4.0] - 2020-08-06 | |
228 [v0.4.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.3.0...v0.4.0 | |
229 ### Added | |
230 - `gecco.model.ProductType` enum to model the biosynthetic class of a BGC. | |
231 ### Removed | |
232 - `pandas` interaction from internal data model. | |
233 - `ClusterCRF` code specific to cross-validation. | |
234 ### Changed | |
235 - `pandas`, `fisher` and `statsmodels` dependencies are now optional. | |
236 - `gecco train` command expects a cluster table in addition to the feature | |
237 table to know the types of the input BGCs. | |
238 | |
239 ## [v0.3.0] - 2020-08-03 | |
240 [v0.3.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.2...v0.3.0 | |
241 ### Changed | |
242 - Replaced Nearest-Neighbours classifier with Random Forest to perform type | |
243 prediction for candidate BGCs. | |
244 - `gecco.knn` module was renamed to implementation-agnostic name `gecco.types`. | |
245 ### Fixed | |
246 - Extraction of domain composition taking a long time in `gecco train` command. | |
247 ### Removed | |
248 - `--metric` argument to the `gecco run` CLI command. | |
249 | |
250 ## [v0.2.2] - 2020-07-31 | |
251 [v0.2.2]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.1...v0.2.2 | |
252 ### Changed | |
253 - `Domain` and `Gene` can now carry qualifiers that are used when they | |
254 are translated to a sequence feature. | |
255 ### Added | |
256 - InterPro names, accessions, and HMMER e-value for each annotated domain | |
257 in GenBank output files. | |
258 | |
259 ## [v0.2.1] - 2020-07-23 | |
260 [v0.2.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.2.0...v0.2.1 | |
261 ### Fixed | |
262 - Various potential crashes in `ClusterRefiner` code. | |
263 ### Removed | |
264 - Uneeded feature dictionary filtering in `ClusterCRF` for models with | |
265 Fisher Exact Test feature selection. | |
266 | |
267 ## [v0.2.0] - 2020-07-23 | |
268 [v0.2.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.1...v0.2.0 | |
269 ### Fixed | |
270 - `pandas` warning about unsorted columns in `gecco run`. | |
271 ### Removed | |
272 - `Gene.probability` property, replaced by `Gene.maximum_probability` and | |
273 `Gene.average_probability` properties to be explicit. | |
274 ### Changed | |
275 - Internal model now uses `Pfam` and `Tigrfam` with the top 35% features | |
276 selected with Fisher's Exact Test. | |
277 - `ClusterRefiner` now removes genes on `Cluster` edges if they do not | |
278 contain any domain annotation. | |
279 | |
280 ## [v0.1.1] - 2020-07-22 | |
281 [v0.1.1]: https://git.embl.de/grp-zeller/GECCO/compare/v0.1.0...v0.1.1 | |
282 ### Added | |
283 - `ClusterCRF.predict_probabilities` to annotate a list of `Gene`. | |
284 ### Changed | |
285 - BGC probability is now stored at the `Domain` level instead of at the `Gene` | |
286 level, independently of the feature extraction level used by the CRF. | |
287 - `ClusterKNN` will use the model path provided to `gecco run` if any. | |
288 ### Docs | |
289 - Added this changelog file to document changes in the code. | |
290 - Added documentation to `gecco` submodules missing some. | |
291 - Included the `CHANGELOG.md` file to the generated docs. | |
292 | |
293 ## [v0.1.0] - 2020-07-17 | |
294 [v0.1.0]: https://git.embl.de/grp-zeller/GECCO/compare/v0.0.1...v0.1.0 | |
295 Initial release. | |
296 | |
297 ## [v0.0.1] - 2018-08-13 | |
298 [v0.0.1]: https://git.embl.de/grp-zeller/GECCO/compare/37afb97...v0.0.1 | |
299 Proof-of-concept. |