Mercurial > repos > dereeper > pangenome_explorer
diff COG/bac-genomics-scripts/genomes_feature_table/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COG/bac-genomics-scripts/genomes_feature_table/README.md Thu May 30 11:52:25 2024 +0000 @@ -0,0 +1,143 @@ +genomes_feature_table +===================== + +`genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format. + +* [Synopsis](#synopsis) +* [Description](#description) +* [Usage](#usage) +* [Options](#options) +* [Output](#output) +* [Run environment](#run-environment) +* [Dependencies](#dependencies) +* [Author - contact](#author---contact) +* [Citation, installation, and license](#citation-installation-and-license) +* [Changelog](#changelog) + +## Synopsis + + perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv + +## Description + +A genome feature table lists basic stats/info (e.g. genome size, GC +content, coding percentage, accession number(s)) and the numbers of +annotated primary features (e.g. CDS, genes, RNAs) of genomes. It +can be used to have an overview of these features in different +genomes, e.g. in comparative genomics publications. + +`genomes_feature_table.pl` is designed to extract (or calculate) +these basic stats and **all** annotated primary features from RichSeq +files (**EMBL** or **GENBANK** format) in a specified directory (with the +correct file extension, see option **-e**). The **default** directory +is the current working directory. The primary features are +counted and the results for each genome printed in tab-separated +format. It is a requirement that each file contains **only one** +genome (complete or draft, with or without plasmids). + +The most important features will be listed first, like genome +description, genome size, GC content, coding percentage (calculated +based on non-pseudo CDS annotation), CDS and gene numbers, accession +number(s) (first..last in the sequence file), RNAs (rRNA, tRNA, +tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are +annotated in a sequence file, the number of plasmids are +counted and listed as well (needs a */plasmid="plasmid_name"* tag in the +*source* primary tag, see e.g. Genbank accession number +[CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p** +to list plasmids as separate entries (lines) in the feature table. + +For draft genomes the number of contigs/scaffolds are counted. All +contigs/scaffolds of draft genomes should be marked with the *WGS* +keyword (see e.g. draft NCBI Genbank entry +[JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is +not the case for your file(s) you can add those keywords to each +sequence entry with the following Perl one-liners (will +edit files in place). For files in **GENBANK** format if 'KEYWORDS .' is present + + perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file + +or if 'KEYWORDS' isn't present at all + + perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS WGS.\n";} else{ print;}' file + +For files in **EMBL** format if 'KW .' is present + + perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file + +or if 'KW' isn't present at all + + perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW WGS.\n";} else{ print;}' file + +## Usage + + perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv + + perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv + +## Options + +- -h, -help + + Help (perldoc POD) + +- -e, -extensions + + File extensions to include in the analysis (EMBL or GENBANK format), + either comma-separated list or multiple occurences of the option + [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank] + +- -p, -plasmids + + Optionally list plasmids as extra entries in the feature table, if + they are annotated with a */plasmid="plasmid_name"* tag in the + *source* primary tag + +- -v, -version + + Print version number to *STDERR* + +## Output + +- *STDOUT* + + The resulting feature table is printed to *STDOUT*. Redirect or + pipe into another tool as needed (e.g. `cut`, `grep`, or `head`). + +## Run environment + +The Perl script runs under Windows and UNIX flavors. + +## Dependencies + +- [BioPerl](http://www.bioperl.org) (tested version 1.006923) + +## Author - contact + +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) + +## Citation, installation, and license + +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). + +## Changelog + +- v0.5 (14.09.2015) + - changed script name to `genomes_feature_table.pl` + - included a POD + - options with Getopt::Long + - included `pod2usage` with Pod::Usage + - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes + - changed input options to get folder path from STDIN + - as a consequence new option **-e|-extensions** + - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file + - draft genomes should include 'WGS' keyword (warning if not) + - option **-p|-plasmids** works now correctly with complete and draft genomes + - count plasmids without option **-p** +- v0.4 (11.08.2013) + - included 'use autodie;' pragma + - included version switch +- v0.3 (05.11.2012) + - new option **p** to report plasmid features in multi-sequence draft files separately +- v0.2 (19.09.2012) +- v0.1 (25.11.2011) + - **original** script name: `get_genome_features.pl`