Mercurial > repos > peterjc > coverage_stats
changeset 3:57b3ea22aff3 draft
Uploaded v0.1.0 which was already on the Test Tool Shed. Included Python 3 support.
author | peterjc |
---|---|
date | Tue, 11 Aug 2020 18:23:05 -0400 |
parents | 7254ece0c0ff |
children | ed501717f6cd |
files | coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.bam coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular test-data/ex1.bam test-data/ex1.coverage_stats.md50.tabular test-data/ex1.coverage_stats.tabular tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.py tools/coverage_stats/coverage_stats.xml |
diffstat | 16 files changed, 581 insertions(+), 503 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,7 @@ +#identifer length mapped placed min_cov max_cov mean_cov +ref 100 5 0 0 2 0.50 +none 200 0 0 0 0 0.00 +full 10 3 0 2 2 2.00 +head 10 1 0 0 1 0.50 +tail 20 1 0 0 1 0.25 +* 0 0 1 0 0 0.00
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 50 32.06 +chr2 1584 1789 17 0 50 38.70 +* 0 0 0 0 0 0.00
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 63 32.32 +chr2 1584 1789 17 0 64 39.78 +* 0 0 0 0 0 0.00
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,124 @@ +BAM coverage statistics using samtools idxstats and depth +========================================================= + +This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +Internally this tool uses the command-line samtools suite. + +This tool is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + + +Automated Installation +====================== + +This should be straightforward, provided your Galaxy is setup to use Conda and +BioConda for dependencies - Galaxy should then automatically download and +install the expected version of samtools. + + +Manual Installation +=================== + +This expects samtools to be on the ``$PATH``, and was tested using v1.3.1. + +To install the wrapper copy or move the following files under the Galaxy tools +folder, e.g. in a ``tools/coverage_stats`` folder: + +* ``coverage_stats.xml`` (the Galaxy tool definition) +* ``coverage_stats.py`` (the Python wrapper script) +* ``README.rst`` (this file) + +You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer +the tool. Just add the line, perhaps under the NGS tools section:: + + <tool file="coverage_stats/coverage_stats.xml" /> + +If you wish to run the unit tests, also move/copy the ``test-data/`` files +under Galaxy's ``test-data/`` folder. Then:: + + $ ./run_tests.sh -id coverage_stats + +That's it. + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial public release +v0.0.2 - Cope with samtools' default depth limit using modified samtools, + see https://github.com/samtools/samtools/pull/322 +v0.0.3 - Cope with no coverage in final contigs. +v0.0.4 - Reorder XML elements (internal change only). + - Planemo for Tool Shed upload (``.shed.yml``, internal change only). +v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3 + - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a + legacy Tool Shed package. + - Apply the new maximum depth parameter within the script to ensure + excess coverage is clear by getting the max coverage equal to the + max depth setting (the raw output from samtools is more fuzzy). + - Report total length and overall mean coverage in stdout. + - Use ``<command detect_errors="aggressive">`` (internal change only). + - Single quote command line arguments (internal change only). +v0.0.6 - Python 3 compatibility fix. +v0.1.0 - Provide proper command line API for the Python script. +======= ====================================================================== + + +Developers +========== + +Development is on this GitHub repository: +https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats + +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/ + ... + +or:: + + $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/ + ... + +To just build and check the tar ball, use:: + + $ planemo shed_upload --tar_only tools/coverage_stats/ + ... + $ tar -tzf shed_upload.tar.gz + tools/coverage_stats/README.rst + tools/coverage_stats/coverage_stats.xml + tools/coverage_stats/coverage_stats.py + ... + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. + +NOTE: This is the licence for the Galaxy Wrapper only. +samtools is available and licenced separately.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,323 @@ +#!/usr/bin/env python +"""BAM coverage statistics using samtools idxstats and depth. + +This script takes exactly three command line arguments: + * Input BAM filename + * Input BAI filename (via Galaxy metadata) + * Output tabular filename + +Optional fourth argument: + * Max coverage depth (integer) + +This messes about with the filenames to make samtools happy, then +runs "samtools idxstats" and "samtools depth", captures and combines +the output to the desired output tabular file. + +Because "samtools depth" treats the max depth a little fuzzily, this +tool tries to account for this and applies a clear max-depth cut off. +""" + +import os +import subprocess +import sys +import tempfile + +from optparse import OptionParser + +usage = """Example usage: + +$ python coverage_stats.py -b mapped.bam -i mapped.bai -o summary.tsv -d 10000 +""" + +parser = OptionParser(usage=usage) +parser.add_option( + "-b", "--bam", dest="input_bam", default=None, help="Input BAM file", metavar="FILE" +) +parser.add_option( + "-i", + "--bai", + dest="input_bai", + default="-", + help="Input BAI file (BAM index, optional, default or - infer from BAM filename)", + metavar="FILE", +) +parser.add_option( + "-o", + dest="output_tabular", + default="-", + help="Output tabular file (optional, default stdout)", + metavar="FILE", +) +parser.add_option( + "-d", + "--maxdepth", + dest="max_depth", + default=8000, + help="Max coverage depth (integer, default 8000)", + type="int", +) +parser.add_option( + "-v", + "--version", + dest="version", + default=False, + action="store_true", + help="Show version and quit", +) + +options, args = parser.parse_args() + +if options.version: + # Galaxy seems to invert the order of the two lines + print("BAM coverage statistics v0.1.0") + cmd = "samtools 2>&1 | grep -i ^Version" + sys.exit(os.system(cmd)) + +bam_filename = options.input_bam +bai_filename = options.input_bai +tabular_filename = options.output_tabular +max_depth = options.max_depth + +if args: + sys.exit("Sorry, the legacy API has been dropped.") + +if not bam_filename: + sys.exit("Input BAM filename is required.") +if not os.path.isfile(bam_filename): + sys.exit("Input BAM file not found: %s" % bam_filename) +if bai_filename == "-": + if os.path.isfile(bam_filename + ".bai"): + bai_filename = bam_filename + ".bai" +if not os.path.isfile(bai_filename): + if bai_filename == "None": + sys.exit("Error: Galaxy did not index your BAM file") + sys.exit("Input BAI file not found: %s" % bai_filename) + +try: + max_depth = int(max_depth) +except ValueError: + sys.exit("Bad argument for max depth: %r" % max_depth) +if max_depth < 0: + sys.exit("Bad argument for max depth: %r" % max_depth) + +# fuzz factor to ensure can reach max_depth, e.g. region with +# many reads having a deletion present could underestimate the +# coverage by capping the number of reads considered +depth_margin = 100 + +# Assign sensible names with real extensions, and setup symlinks: +tmp_dir = tempfile.mkdtemp() +bam_file = os.path.join(tmp_dir, "temp.bam") +bai_file = os.path.join(tmp_dir, "temp.bam.bai") +idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv") +depth_filename = os.path.join(tmp_dir, "depth.tsv") +os.symlink(os.path.abspath(bam_filename), bam_file) +os.symlink(os.path.abspath(bai_filename), bai_file) +assert os.path.isfile(bam_file), bam_file +assert os.path.isfile(bai_file), bai_file +assert os.path.isfile(bam_file + ".bai"), bam_file + + +def clean_up(): + """Remove our temporary files and directory.""" + os.remove(idxstats_filename) + os.remove(depth_filename) + os.remove(bam_file) + os.remove(bai_file) + os.rmdir(tmp_dir) + + +def samtools_depth_opt_available(): + """Determine if samtools depth supports maximum coverage argument.""" + child = subprocess.Popen( + ["samtools", "depth"], + universal_newlines=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + ) + # Combined stdout/stderr in case samtools is ever inconsistent + output, tmp = child.communicate() + assert tmp is None + # Expect to find this line in the help text, exact wording could change: + # -d/-m <int> maximum coverage depth [8000] + return " -d/-m " in output + + +depth_hack = False +if not samtools_depth_opt_available(): + if max_depth + depth_margin <= 8000: + sys.stderr.write( + "WARNING: The version of samtools depth installed does not " + "support the -d option, however, the requested max-depth " + "is safely under the default of 8000.\n" + ) + depth_hack = True + else: + sys.exit( + "The version of samtools depth installed does not support the -d option." + ) + +# Run samtools idxstats: +cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename) +return_code = os.system(cmd) +if return_code: + clean_up() + sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) + +# Run samtools depth: +# TODO - Parse stdout instead? +if depth_hack: + # Using an old samtools without the -d option, but hard coded default + # of 8000 should be fine even allowing a margin for fuzzy output + cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename) +else: + cmd = 'samtools depth -d %i "%s" > "%s"' % ( + max_depth + depth_margin, + bam_file, + depth_filename, + ) +return_code = os.system(cmd) +if return_code: + clean_up() + sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) + + +def load_total_coverage(depth_handle, identifier, length): + """Parse some of the 'samtools depth' output for coverages. + + Returns min_cov (int), max_cov (int) and mean cov (float). + + Uses global variables to cache the first line of output from the + next reference sequence. + """ + global depth_ref, depth_pos, depth_reads + + # print("====") + # print("%s coverage calculation, length %i, ..." % (identifier, length)) + + if depth_ref is None: + # Right at start of file / new contig + line = depth_handle.readline() + # Are we at the end of the file? + if not line: + # Must be at the end of the file. + # This can happen if the file contig(s) had no reads mapped + return 0, 0, 0.0, 0 + depth_ref, depth_pos, depth_reads = line.rstrip("\n").split() + depth_pos = int(depth_pos) + depth_reads = min(max_depth, int(depth_reads)) + # Can now treat as later references where first line cached + elif identifier != depth_ref: + # Infer that identifier had coverage zero, + # and so was not in the 'samtools depth' + # output. + # print("%s appears to have no coverage at all" % identifier) + return 0, 0, 0.0, 0 + + # Good, at start of expected reference + bases = depth_reads + if depth_pos == 1: + min_cov = depth_reads + else: + # print("%s has no coverage at start" % identifier) + min_cov = 0 + max_cov = depth_reads + + last_pos = depth_pos + depth_ref = None + depth_pos = 0 + depth_reads = 0 + for line in depth_handle: + ref, pos, depth = line.rstrip("\n").split() + pos = int(pos) + depth = min(max_depth, int(depth)) + if ref != identifier: + # Reached the end of this identifier's coverage + # so cache this ready for next identifier + depth_ref, depth_pos, depth_reads = ref, pos, depth + break + bases += depth + if last_pos + 1 < pos: + # print("%s has no coverage between %i and %i" + # % (identifier, last_pos, pos)) + min_cov = 0 + else: + min_cov = min(min_cov, depth) + max_cov = max(max_cov, depth) + last_pos = pos + + # Reach the end of this identifier's coverage or end of file + if last_pos < length: + # print("%s has no coverage at end" % identifier) + min_cov = 0 + mean_cov = bases / float(length) + return min_cov, max_cov, mean_cov, bases + + +# Parse and combine the output +if tabular_filename == "-": + out_handle = sys.stdout +else: + out_handle = open(tabular_filename, "w") +out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n") + +idxstats_handle = open(idxstats_filename) +depth_handle = open(depth_filename) + +depth_ref = None +depth_pos = 0 +depth_reads = 0 + +global_bases = 0 +global_length = 0 +for line in idxstats_handle: + identifier, length, mapped, placed = line.rstrip("\n").split() + length = int(length) + mapped = int(mapped) + placed = int(placed) + if identifier == "*": + min_cov = 0 + max_cov = 0 + mean_cov = 0.0 + bases = 0 + else: + min_cov, max_cov, mean_cov, bases = load_total_coverage( + depth_handle, identifier, length + ) + if max_cov > max_depth: + sys.exit( + "Using max depth %i yet saw max coverage %i for %s" + % (max_depth, max_cov, identifier) + ) + out_handle.write( + "%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n" + % (identifier, length, mapped, placed, min_cov, max_cov, mean_cov) + ) + if not (min_cov <= mean_cov <= max_cov): + out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n") + idxstats_handle.close() + depth_handle.close() + out_handle.close() + clean_up() + sys.exit( + "Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov" + % identifier + ) + global_length += length + global_bases += bases + +idxstats_handle.close() +depth_handle.close() +if tabular_filename != "-": + out_handle.close() + +print( + "Total reference length %i with overall mean coverage %0.2f" + % (global_length, float(global_bases) / global_length) +) + +# Remove the temp symlinks and files: +clean_up() + +if depth_ref is not None: + sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml Tue Aug 11 18:23:05 2020 -0400 @@ -0,0 +1,119 @@ +<tool id="coverage_stats" name="BAM coverage statistics" version="0.1.0"> + <description>using samtools idxstats and depth</description> + <requirements> + <requirement type="package" version="1.4.1">samtools</requirement> + </requirements> + <version_command> +python $__tool_directory__/coverage_stats.py --version + </version_command> + <command detect_errors="aggressive"> +python $__tool_directory__/coverage_stats.py +-b '$input_bam' +-i '${input_bam.metadata.bam_index}' +-o '$out_tabular' +-d '$max_depth' + </command> + <inputs> + <param name="input_bam" type="data" format="bam" label="Input BAM file" /> + <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" /> + </inputs> + <outputs> + <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> + </outputs> + <tests> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="123" /> + <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> + </test> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="50" /> + <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" /> + </test> + <test> + <param name="input_bam" value="coverage_test.bam" ftype="bam" /> + <param name="max_depth" value="123" /> + <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> + </test> + </tests> + <help> +**What it does** + +This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the +SAMtools toolkit, and parses their output to produce a consise summary of the +coverage information for each reference sequence. + +Input is a sorted and indexed BAM file, the output is tabular. The first four +columns match the output from ``samtools idxstats``, the additional columns are +calculated from the ``samtools depth`` output. The final row with a star as the +reference identifier represents unmapped reads, and will have zeros in every +column except columns one and four. + +====== ================================================================================= +Column Description +------ --------------------------------------------------------------------------------- + 1 Reference sequence identifier + 2 Reference sequence length + 3 Number of mapped reads + 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) + 5 Minimum coverage (per base of reference) + 6 Maximum coverage (per base of reference) + 7 Mean coverage (given to 2 dp) +====== ================================================================================= + +Example output from a *de novo* assembly: + +========== ====== ====== ====== ======= ======= ======== +identiifer length mapped placed min_cov max_cov mean_cov +---------- ------ ------ ------ ------- ------- -------- +contig_1 833604 436112 0 1 157 71.95 +contig_2 14820 9954 0 1 152 91.27 +contig_3 272099 142958 0 1 150 72.31 +contig_4 135519 73288 0 1 149 75.23 +contig_5 91245 46759 0 1 157 70.92 +contig_6 175604 95744 0 1 146 75.99 +contig_7 90586 48158 0 1 151 72.93 +contig_9 234347 126458 0 1 159 75.40 +contig_10 121515 60211 0 1 152 68.12 +... ... ... ... ... ... ... +contig_604 712 85 0 1 49 21.97 +\* 0 0 950320 0 0 0.00 +========== ====== ====== ====== ======= ======= ======== + +In this example there were 604 contigs, each with one line in the output table, +plus the final row (labelled with an asterisk) representing 950320 unmapped reads. +In this BAM file, the fourth column was otherwise zero. + +.. class:: warningmark + +**Note**. If using this on a mapping BAM file, beware that the coverage counting is +done per base of the reference. This means if your reference has any extra bases +compared to the reads being mapped, those bases will be skipped by CIGAR D operators +and these "extra" bases can have an extremely low coverage, giving a potentially +misleading ``min_cov`` values. A sliding window coverage may be more appropriate. + +**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the +pileup routine, meaning the reported coverage from ``samtools depth`` would show +maximum coverage depths *around* 8000. This is now a run time option. + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite: + +Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. +Bioinformatics 25(16), 2078-9. +https://doi.org/10.1093/bioinformatics/btp352 + +Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth. +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + </help> + <citations> + <citation type="doi">10.1093/bioinformatics/btp352</citation> + </citations> +</tool>
--- a/test-data/coverage_test.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,7 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -ref 100 5 0 0 2 0.50 -none 200 0 0 0 0 0.00 -full 10 3 0 2 2 2.00 -head 10 1 0 0 1 0.50 -tail 20 1 0 0 1 0.25 -* 0 0 1 0 0 0.00
--- a/test-data/ex1.coverage_stats.md50.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -chr1 1575 1446 18 0 50 32.06 -chr2 1584 1789 17 0 50 38.70 -* 0 0 0 0 0 0.00
--- a/test-data/ex1.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ -#identifer length mapped placed min_cov max_cov mean_cov -chr1 1575 1446 18 0 63 32.32 -chr2 1584 1789 17 0 64 39.78 -* 0 0 0 0 0 0.00
--- a/tools/coverage_stats/README.rst Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,122 +0,0 @@ -BAM coverage statistics using samtools idxstats and depth -========================================================= - -This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute -(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. -See the licence text below. - -Internally this tool uses the command-line samtools suite. - -This tool is available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - - -Automated Installation -====================== - -This should be straightforward, provided your Galaxy is setup to use Conda and -BioConda for dependencies - Galaxy should then automatically download and -install the expected version of samtools. - - -Manual Installation -=================== - -This expects samtools to be on the ``$PATH``, and was tested using v1.3.1. - -To install the wrapper copy or move the following files under the Galaxy tools -folder, e.g. in a ``tools/coverage_stats`` folder: - -* ``coverage_stats.xml`` (the Galaxy tool definition) -* ``coverage_stats.py`` (the Python wrapper script) -* ``README.rst`` (this file) - -You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer -the tool. Just add the line, perhaps under the NGS tools section:: - - <tool file="coverage_stats/coverage_stats.xml" /> - -If you wish to run the unit tests, also move/copy the ``test-data/`` files -under Galaxy's ``test-data/`` folder. Then:: - - $ ./run_tests.sh -id coverage_stats - -That's it. - - -History -======= - -======= ====================================================================== -Version Changes -------- ---------------------------------------------------------------------- -v0.0.1 - Initial public release -v0.0.2 - Cope with samtools' default depth limit using modified samtools, - see https://github.com/samtools/samtools/pull/322 -v0.0.3 - Cope with no coverage in final contigs. -v0.0.4 - Reorder XML elements (internal change only). - - Planemo for Tool Shed upload (``.shed.yml``, internal change only). -v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3 - - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a - legacy Tool Shed package. - - Apply the new maximum depth parameter within the script to ensure - excess coverage is clear by getting the max coverage equal to the - max depth setting (the raw output from samtools is more fuzzy). - - Report total length and overall mean coverage in stdout. - - Use ``<command detect_errors="aggressive">`` (internal change only). - - Single quote command line arguments (internal change only). -======= ====================================================================== - - -Developers -========== - -Development is on this GitHub repository: -https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats - -For pushing a release to the test or main "Galaxy Tool Shed", use the following -Planemo commands (which requires you have set your Tool Shed access details in -``~/.planemo.yml`` and that you have access rights on the Tool Shed):: - - $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/ - ... - -or:: - - $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/ - ... - -To just build and check the tar ball, use:: - - $ planemo shed_upload --tar_only tools/coverage_stats/ - ... - $ tar -tzf shed_upload.tar.gz - tools/coverage_stats/README.rst - tools/coverage_stats/coverage_stats.xml - tools/coverage_stats/coverage_stats.py - ... - - -Licence (MIT) -============= - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in -all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -THE SOFTWARE. - -NOTE: This is the licence for the Galaxy Wrapper only. -samtools is available and licenced separately.
--- a/tools/coverage_stats/coverage_stats.py Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,251 +0,0 @@ -#!/usr/bin/env python -"""BAM coverage statistics using samtools idxstats and depth. - -This script takes exactly three command line arguments: - * Input BAM filename - * Input BAI filename (via Galaxy metadata) - * Output tabular filename - -Optional fourth argument: - * Max coverage depth (integer) - -This messes about with the filenames to make samtools happy, then -runs "samtools idxstats" and "samtools depth", captures and combines -the output to the desired output tabular file. - -Because "samtools depth" treats the max depth a little fuzzily, this -tool tries to account for this and applies a clear max-depth cut off. -""" - -import os -import subprocess -import sys -import tempfile - -if "-v" in sys.argv or "--version" in sys.argv: - # Galaxy seems to invert the order of the two lines - print("BAM coverage statistics v0.0.5") - cmd = "samtools 2>&1 | grep -i ^Version" - sys.exit(os.system(cmd)) - -# TODO - Proper command line API -if len(sys.argv) == 4: - bam_filename, bai_filename, tabular_filename = sys.argv[1:] - max_depth = "8000" -elif len(sys.argv) == 5: - bam_filename, bai_filename, tabular_filename, max_depth = sys.argv[1:] -else: - sys.exit("Require 3 or 4 arguments: BAM, BAI, tabular filename, [max depth]") - -if not os.path.isfile(bam_filename): - sys.exit("Input BAM file not found: %s" % bam_filename) -if bai_filename == "-": - # Make this optional for ease of use at the command line by hand: - if os.path.isfile(bam_filename + ".bai"): - bai_filename = bam_filename + ".bai" -if not os.path.isfile(bai_filename): - if bai_filename == "None": - sys.exit("Error: Galaxy did not index your BAM file") - sys.exit("Input BAI file not found: %s" % bai_filename) - -try: - max_depth = int(max_depth) -except ValueError: - sys.exit("Bad argument for max depth: %r" % max_depth) -if max_depth < 0: - sys.exit("Bad argument for max depth: %r" % max_depth) - -# fuzz factor to ensure can reach max_depth, e.g. region with -# many reads having a deletion present could underestimate the -# coverage by capping the number of reads considered -depth_margin = 100 - -# Assign sensible names with real extensions, and setup symlinks: -tmp_dir = tempfile.mkdtemp() -bam_file = os.path.join(tmp_dir, "temp.bam") -bai_file = os.path.join(tmp_dir, "temp.bam.bai") -idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv") -depth_filename = os.path.join(tmp_dir, "depth.tsv") -os.symlink(os.path.abspath(bam_filename), bam_file) -os.symlink(os.path.abspath(bai_filename), bai_file) -assert os.path.isfile(bam_file), bam_file -assert os.path.isfile(bai_file), bai_file -assert os.path.isfile(bam_file + ".bai"), bam_file - - -def clean_up(): - """Remove our temporary files and directory.""" - os.remove(idxstats_filename) - os.remove(depth_filename) - os.remove(bam_file) - os.remove(bai_file) - os.rmdir(tmp_dir) - - -def samtools_depth_opt_available(): - """Determine if samtools depth supports maximum coverage argument.""" - child = subprocess.Popen(["samtools", "depth"], - stdout=subprocess.PIPE, stderr=subprocess.STDOUT) - # Combined stdout/stderr in case samtools is ever inconsistent - output, tmp = child.communicate() - assert tmp is None - # Expect to find this line in the help text, exact wording could change: - # -d/-m <int> maximum coverage depth [8000] - return " -d/-m " in output - - -depth_hack = False -if not samtools_depth_opt_available(): - if max_depth + depth_margin <= 8000: - sys.stderr.write("WARNING: The version of samtools depth installed does not " - "support the -d option, however, the requested max-depth " - "is safely under the default of 8000.\n") - depth_hack = True - else: - sys.exit("The version of samtools depth installed does not support the -d option.") - -# Run samtools idxstats: -cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename) -return_code = os.system(cmd) -if return_code: - clean_up() - sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) - -# Run samtools depth: -# TODO - Parse stdout instead? -if depth_hack: - # Using an old samtools without the -d option, but hard coded default - # of 8000 should be fine even allowing a margin for fuzzy output - cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename) -else: - cmd = 'samtools depth -d %i "%s" > "%s"' % (max_depth + depth_margin, bam_file, depth_filename) -return_code = os.system(cmd) -if return_code: - clean_up() - sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) - - -def load_total_coverage(depth_handle, identifier, length): - """Parse some of the 'samtools depth' output for coverages. - - Returns min_cov (int), max_cov (int) and mean cov (float). - - Uses global variables to cache the first line of output from the - next reference sequence. - """ - global depth_ref, depth_pos, depth_reads - - # print("====") - # print("%s coverage calculation, length %i, ..." % (identifier, length)) - - if depth_ref is None: - # Right at start of file / new contig - line = depth_handle.readline() - # Are we at the end of the file? - if not line: - # Must be at the end of the file. - # This can happen if the file contig(s) had no reads mapped - return 0, 0, 0.0, 0 - depth_ref, depth_pos, depth_reads = line.rstrip("\n").split() - depth_pos = int(depth_pos) - depth_reads = min(max_depth, int(depth_reads)) - # Can now treat as later references where first line cached - elif identifier != depth_ref: - # Infer that identifier had coverage zero, - # and so was not in the 'samtools depth' - # output. - # print("%s appears to have no coverage at all" % identifier) - return 0, 0, 0.0, 0 - - # Good, at start of expected reference - bases = depth_reads - if depth_pos == 1: - min_cov = depth_reads - else: - # print("%s has no coverage at start" % identifier) - min_cov = 0 - max_cov = depth_reads - - last_pos = depth_pos - depth_ref = None - depth_pos = 0 - depth_reads = 0 - for line in depth_handle: - ref, pos, depth = line.rstrip("\n").split() - pos = int(pos) - depth = min(max_depth, int(depth)) - if ref != identifier: - # Reached the end of this identifier's coverage - # so cache this ready for next identifier - depth_ref, depth_pos, depth_reads = ref, pos, depth - break - bases += depth - if last_pos + 1 < pos: - # print("%s has no coverage between %i and %i" % (identifier, last_pos, pos)) - min_cov = 0 - else: - min_cov = min(min_cov, depth) - max_cov = max(max_cov, depth) - last_pos = pos - - # Reach the end of this identifier's coverage or end of file - if last_pos < length: - # print("%s has no coverage at end" % identifier) - min_cov = 0 - mean_cov = bases / float(length) - return min_cov, max_cov, mean_cov, bases - - -# Parse and combine the output -out_handle = open(tabular_filename, "w") -out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n") - -idxstats_handle = open(idxstats_filename) -depth_handle = open(depth_filename) - -depth_ref = None -depth_pos = 0 -depth_reads = 0 - -global_bases = 0 -global_length = 0 -for line in idxstats_handle: - identifier, length, mapped, placed = line.rstrip("\n").split() - length = int(length) - mapped = int(mapped) - placed = int(placed) - if identifier == "*": - min_cov = 0 - max_cov = 0 - mean_cov = 0.0 - bases = 0 - else: - min_cov, max_cov, mean_cov, bases = load_total_coverage(depth_handle, identifier, length) - if max_cov > max_depth: - sys.exit("Using max depth %i yet saw max coverage %i for %s" - % (max_depth, max_cov, identifier)) - out_handle.write("%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n" - % (identifier, length, mapped, placed, - min_cov, max_cov, mean_cov)) - if not (min_cov <= mean_cov <= max_cov): - out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n") - idxstats_handle.close() - depth_handle.close() - out_handle.close() - clean_up() - sys.exit("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov" - % identifier) - global_length += length - global_bases += bases - -idxstats_handle.close() -depth_handle.close() -out_handle.close() - -print("Total reference length %i with overall mean coverage %0.2f" % (global_length, float(global_bases) / global_length)) - -# Remove the temp symlinks and files: -clean_up() - -if depth_ref is not None: - sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
--- a/tools/coverage_stats/coverage_stats.xml Thu May 11 12:16:10 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,115 +0,0 @@ -<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.5"> - <description>using samtools idxstats and depth</description> - <requirements> - <requirement type="package" version="1.4.1">samtools</requirement> - </requirements> - <version_command> -python $__tool_directory__/coverage_stats.py --version - </version_command> - <command detect_errors="aggressive"> -python $__tool_directory__/coverage_stats.py '$input_bam' '${input_bam.metadata.bam_index}' '$out_tabular' '$max_depth' - </command> - <inputs> - <param name="input_bam" type="data" format="bam" label="Input BAM file" /> - <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" /> - </inputs> - <outputs> - <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> - </outputs> - <tests> - <test> - <param name="input_bam" value="ex1.bam" ftype="bam" /> - <param name="max_depth" value="123" /> - <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> - </test> - <test> - <param name="input_bam" value="ex1.bam" ftype="bam" /> - <param name="max_depth" value="50" /> - <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" /> - </test> - <test> - <param name="input_bam" value="coverage_test.bam" ftype="bam" /> - <param name="max_depth" value="123" /> - <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> - </test> - </tests> - <help> -**What it does** - -This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the -SAMtools toolkit, and parses their output to produce a consise summary of the -coverage information for each reference sequence. - -Input is a sorted and indexed BAM file, the output is tabular. The first four -columns match the output from ``samtools idxstats``, the additional columns are -calculated from the ``samtools depth`` output. The final row with a star as the -reference identifier represents unmapped reads, and will have zeros in every -column except columns one and four. - -====== ================================================================================= -Column Description ------- --------------------------------------------------------------------------------- - 1 Reference sequence identifier - 2 Reference sequence length - 3 Number of mapped reads - 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) - 5 Minimum coverage (per base of reference) - 6 Maximum coverage (per base of reference) - 7 Mean coverage (given to 2 dp) -====== ================================================================================= - -Example output from a *de novo* assembly: - -========== ====== ====== ====== ======= ======= ======== -identiifer length mapped placed min_cov max_cov mean_cov ----------- ------ ------ ------ ------- ------- -------- -contig_1 833604 436112 0 1 157 71.95 -contig_2 14820 9954 0 1 152 91.27 -contig_3 272099 142958 0 1 150 72.31 -contig_4 135519 73288 0 1 149 75.23 -contig_5 91245 46759 0 1 157 70.92 -contig_6 175604 95744 0 1 146 75.99 -contig_7 90586 48158 0 1 151 72.93 -contig_9 234347 126458 0 1 159 75.40 -contig_10 121515 60211 0 1 152 68.12 -... ... ... ... ... ... ... -contig_604 712 85 0 1 49 21.97 -\* 0 0 950320 0 0 0.00 -========== ====== ====== ====== ======= ======= ======== - -In this example there were 604 contigs, each with one line in the output table, -plus the final row (labelled with an asterisk) representing 950320 unmapped reads. -In this BAM file, the fourth column was otherwise zero. - -.. class:: warningmark - -**Note**. If using this on a mapping BAM file, beware that the coverage counting is -done per base of the reference. This means if your reference has any extra bases -compared to the reads being mapped, those bases will be skipped by CIGAR D operators -and these "extra" bases can have an extremely low coverage, giving a potentially -misleading ``min_cov`` values. A sliding window coverage may be more appropriate. - -**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the -pileup routine, meaning the reported coverage from ``samtools depth`` would show -maximum coverage depths *around* 8000. This is now a run time option. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite: - -Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. -Bioinformatics 25(16), 2078-9. -http://dx.doi.org/10.1093/bioinformatics/btp352 - -Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth. -http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats - </help> - <citations> - <citation type="doi">10.1093/bioinformatics/btp352</citation> - </citations> -</tool>