Mercurial > repos > dereeper > pangenome_explorer
diff COG/bac-genomics-scripts/order_fastx/README.md @ 3:e42d30da7a74 draft
Uploaded
author | dereeper |
---|---|
date | Thu, 30 May 2024 11:52:25 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COG/bac-genomics-scripts/order_fastx/README.md Thu May 30 11:52:25 2024 +0000 @@ -0,0 +1,113 @@ +order_fastx +=========== + +`order_fastx.pl` is a script to order sequences in FASTA or FASTQ files. + +* [Synopsis](#synopsis) +* [Description](#description) +* [Usage](#usage) +* [Options](#options) + * [Mandatory options](#mandatory-options) + * [Optional options](#optional-options) +* [Output](#output) +* [Run environment](#run-environment) +* [Author - contact](#author---contact) +* [Citation, installation, and license](#citation-installation-and-license) +* [Changelog](#changelog) + + +## Synopsis + + perl order_fastx.pl -i infile.fasta -l order_id_list.txt > ordered.fasta + +## Description + +Order sequence entries in FASTA or FASTQ sequence files according to +an ID list with a given order. Beware, the IDs in the order list +have to be **identical** to the entire IDs in the sequence file. + +However, the ">" or "@" ID identifiers of FASTA or FASTQ files, +respectively, can be omitted in the ID list. + +The file type is detected automatically. But, you can set the file +type manually with option **-f**. FASTQ format assumes **four** lines +per read, if this is not the case run the FASTQ file through +[`fastx_fix.pl`](/fastx_fix) or use Heng Li's [`seqtk +seq`](https://github.com/lh3/seqtk): + + seqtk seq -l 0 infile.fq > outfile.fq + +The script can also be used to pull a subset of sequences in the ID +list from the sequence file. Probably best to set option flag **-s** +in this case, see [Optional options](#optional-options) below. But, rather use +[`filter_fastx.pl`](/filter_fastx). + +## Usage + + perl order_fastx.pl -i infile.fq -l order_id_list.txt -s -f fastq > ordered.fq + + perl order_fastx.pl -i infile.fasta -l order_id_list.txt -e > ordered.fasta + +## Options + +### Mandatory options + +- -i, -input + + Input FASTA or FASTQ file + +- -l, -list + + List with sequence IDs in specified order + +### Optional options + +- -h, -help + + Help (perldoc POD) + +- -f, -file_type + + Set the file type manually [fasta|fastq] + +- -e, -error_files + + Write missing IDs in the seq file or the order ID list without an equivalent in the other to error files instead of *STDERR* (see [Output](#output) below) + +- -s, -skip_errors + + Skip missing ID error statements, excludes option **-e** + +- -v, -version + + Print version number to *STDERR* + +## Output + +- *STDOUT* + + The newly ordered sequences are printed to *STDOUT*. Redirect or pipe into another tool as needed. + +- (order_ids_missing.txt) + + If IDs in the order list are missing in the sequence file with option **-e** + +- (seq_ids_missing.txt) + + If IDs in the sequence file are missing in the order ID list with option **-e** + +## Run environment + +The Perl script runs under Windows and UNIX flavors. + +## Author - contact + +Andreas Leimbach (aleimba[at]gmx[dot]de; Microbial Genome Plasticity, Institute of Hygiene, University of Muenster) + +## Citation, installation, and license + +For [citation](https://github.com/aleimba/bac-genomics-scripts#citation), [installation](https://github.com/aleimba/bac-genomics-scripts#installation-recommendations), and [license](https://github.com/aleimba/bac-genomics-scripts#license) information please see the repository main [*README.md*](https://github.com/aleimba/bac-genomics-scripts/blob/master/README.md). + +## Changelog + +- v0.1 (20.11.2014)