diff COG/bac-genomics-scripts/genomes_feature_table/README.md @ 3:e42d30da7a74 draft

Uploaded
author dereeper
date Thu, 30 May 2024 11:52:25 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/COG/bac-genomics-scripts/genomes_feature_table/README.md	Thu May 30 11:52:25 2024 +0000
@@ -0,0 +1,143 @@
+genomes_feature_table
+=====================
+
+`genomes_feature_table.pl` is a script to create a feature table for genomes in EMBL and GENBANK format.
+
+* [Synopsis](#synopsis)
+* [Description](#description)
+* [Usage](#usage)
+* [Options](#options)
+* [Output](#output)
+* [Run environment](#run-environment)
+* [Dependencies](#dependencies)
+* [Author - contact](#author---contact)
+* [Citation, installation, and license](#citation-installation-and-license)
+* [Changelog](#changelog)
+
+## Synopsis
+
+    perl genomes_feature_table.pl path/to/genome_dir > feature_table.tsv
+
+## Description
+
+A genome feature table lists basic stats/info (e.g. genome size, GC
+content, coding percentage, accession number(s)) and the numbers of
+annotated primary features (e.g. CDS, genes, RNAs) of genomes. It
+can be used to have an overview of these features in different
+genomes, e.g. in comparative genomics publications.
+
+`genomes_feature_table.pl` is designed to extract (or calculate)
+these basic stats and **all** annotated primary features from RichSeq
+files (**EMBL** or **GENBANK** format) in a specified directory (with the
+correct file extension, see option **-e**). The **default** directory
+is the current working directory. The primary features are
+counted and the results for each genome printed in tab-separated
+format. It is a requirement that each file contains **only one**
+genome (complete or draft, with or without plasmids).
+
+The most important features will be listed first, like genome
+description, genome size, GC content, coding percentage (calculated
+based on non-pseudo CDS annotation), CDS and gene numbers, accession
+number(s) (first..last in the sequence file), RNAs (rRNA, tRNA,
+tmRNA, ncRNA), and unresolved bases (IUPAC code 'N'). If plasmids are
+annotated in a sequence file, the number of plasmids are
+counted and listed as well (needs a */plasmid="plasmid_name"* tag in the
+*source* primary tag, see e.g. Genbank accession number
+[CP009167](http://www.ncbi.nlm.nih.gov/nuccore/CP009167)). Use option **-p**
+to list plasmids as separate entries (lines) in the feature table.
+
+For draft genomes the number of contigs/scaffolds are counted. All
+contigs/scaffolds of draft genomes should be marked with the *WGS*
+keyword (see e.g. draft NCBI Genbank entry
+[JSAY00000000](http://www.ncbi.nlm.nih.gov/nuccore/JSAY00000000)). If this is
+not the case for your file(s) you can add those keywords to each
+sequence entry with the following Perl one-liners (will
+edit files in place). For files in **GENBANK** format if 'KEYWORDS    .' is present
+
+    perl -i -pe 's/^KEYWORDS(\s+)\./KEYWORDS$1WGS\./' file
+
+or if 'KEYWORDS' isn't present at all
+
+    perl -i -ne 'if(/^ACCESSION/){ print; print "KEYWORDS    WGS.\n";} else{ print;}' file
+
+For files in **EMBL** format if 'KW   .' is present
+
+    perl -i -pe 's/^KW(\s+)\./KW$1WGS\./' file
+
+or if 'KW' isn't present at all
+
+    perl -i -ne 'if(/^DE/){ $dw=1; print;} elsif(/^XX/ && $dw){ print; $dw=0; print "KW   WGS.\n";} else{ print;}' file
+
+## Usage
+
+    perl genomes_feature_table.pl -p -e gb,gbk > feature_table_plasmids.tsv
+
+    perl genomes_feature_table.pl path/to/genome_dir/ -e gbf -e embl > feature_table.tsv
+
+## Options
+
+- -h, -help
+
+    Help (perldoc POD)
+
+- -e, -extensions
+
+    File extensions to include in the analysis (EMBL or GENBANK format),
+    either comma-separated list or multiple occurences of the option
+    [default = ebl,emb,embl,gb,gbf,gbff,gbank,gbk,genbank]
+
+- -p, -plasmids
+
+    Optionally list plasmids as extra entries in the feature table, if
+    they are annotated with a */plasmid="plasmid_name"* tag in the
+    *source* primary tag
+
+- -v, -version
+
+    Print version number to *STDERR*
+
+## Output
+
+- *STDOUT*
+
+    The resulting feature table is printed to *STDOUT*. Redirect or
+    pipe into another tool as needed (e.g. `cut`, `grep`, or `head`).
+
+## Run environment
+
+The Perl script runs under Windows and UNIX flavors.
+
+## Dependencies
+
+- [BioPerl](http://www.bioperl.org) (tested version 1.006923)
+
+## Author - contact
+
+Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster)
+
+## Citation, installation, and license
+
+For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md).
+
+## Changelog
+
+- v0.5 (14.09.2015)
+    - changed script name to `genomes_feature_table.pl`
+    - included a POD
+    - options with Getopt::Long
+    - included `pod2usage` with Pod::Usage
+    - major code overhaul with restructuring (removing code redundancy, print out without temp file etc.) and Perl syntax changes
+    - changed input options to get folder path from STDIN
+    - as a consequence new option **-e|-extensions**
+    - accession numbers not essential anymore, changed hash key to filename; but requires now only one genome per file
+    - draft genomes should include 'WGS' keyword (warning if not)
+    - option **-p|-plasmids** works now correctly with complete and draft genomes
+    - count plasmids without option **-p**
+- v0.4 (11.08.2013)
+    - included 'use autodie;' pragma
+    - included version switch
+- v0.3 (05.11.2012)
+    - new option **p** to report plasmid features in multi-sequence draft files separately
+- v0.2 (19.09.2012)
+- v0.1 (25.11.2011)
+    - **original** script name: `get_genome_features.pl`