Mercurial > repos > peterjc > coverage_stats
changeset 2:7254ece0c0ff draft
v0.0.5 - Supports max coverage depth in recent samtools. Expects samtools 1.4.1 via Conda, not via Tool Shed.
author | peterjc |
---|---|
date | Thu, 11 May 2017 12:16:10 -0400 |
parents | d1fdfaae5dbe |
children | 57b3ea22aff3 |
files | test-data/ex1.coverage_stats.md50.tabular tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.py tools/coverage_stats/coverage_stats.xml tools/coverage_stats/tool_dependencies.xml |
diffstat | 5 files changed, 161 insertions(+), 62 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/ex1.coverage_stats.md50.tabular Thu May 11 12:16:10 2017 -0400 @@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 50 32.06 +chr2 1584 1789 17 0 50 38.70 +* 0 0 0 0 0 0.00
--- a/tools/coverage_stats/README.rst Fri Nov 21 09:43:58 2014 -0500 +++ b/tools/coverage_stats/README.rst Thu May 11 12:16:10 2017 -0400 @@ -1,7 +1,7 @@ BAM coverage statistics using samtools idxstats and depth ========================================================= -This tool is copyright 2014 by Peter Cock, The James Hutton Institute +This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute (formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. See the licence text below. @@ -14,14 +14,15 @@ Automated Installation ====================== -This should be straightforward, Galaxy should automatically download and install -samtools 0.1.19 if required. +This should be straightforward, provided your Galaxy is setup to use Conda and +BioConda for dependencies - Galaxy should then automatically download and +install the expected version of samtools. Manual Installation =================== -This expects samtools to be on the ``$PATH``, and was tested using v0.1.19. +This expects samtools to be on the ``$PATH``, and was tested using v1.3.1. To install the wrapper copy or move the following files under the Galaxy tools folder, e.g. in a ``tools/coverage_stats`` folder: @@ -50,6 +51,20 @@ Version Changes ------- ---------------------------------------------------------------------- v0.0.1 - Initial public release +v0.0.2 - Cope with samtools' default depth limit using modified samtools, + see https://github.com/samtools/samtools/pull/322 +v0.0.3 - Cope with no coverage in final contigs. +v0.0.4 - Reorder XML elements (internal change only). + - Planemo for Tool Shed upload (``.shed.yml``, internal change only). +v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3 + - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a + legacy Tool Shed package. + - Apply the new maximum depth parameter within the script to ensure + excess coverage is clear by getting the max coverage equal to the + max depth setting (the raw output from samtools is more fuzzy). + - Report total length and overall mean coverage in stdout. + - Use ``<command detect_errors="aggressive">`` (internal change only). + - Single quote command line arguments (internal change only). ======= ====================================================================== @@ -59,22 +74,27 @@ Development is on this GitHub repository: https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats -For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use -the following command from the Galaxy root folder:: +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/ + ... + +or:: - $ tar -czf coverage_stats.tar.gz tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.xml tools/coverage_stats/coverage_stats.py tools/coverage_stats/tool_dependencies.xml test-data/ex1.bam test-data/ex1.coverage_stats.tabular test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular + $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/ + ... -Check this worked:: +To just build and check the tar ball, use:: - $ tar -tzf coverage_stats.tar.gz + $ planemo shed_upload --tar_only tools/coverage_stats/ + ... + $ tar -tzf shed_upload.tar.gz tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.xml tools/coverage_stats/coverage_stats.py - tools/coverage_stats/tool_dependencies.xml - test-data/ex1.bam - test-data/ex1.coverage_stats.tabular - test-data/coverage_test.bam - test-data/coverage_test.coverage_stats.tabular + ... Licence (MIT)
--- a/tools/coverage_stats/coverage_stats.py Fri Nov 21 09:43:58 2014 -0500 +++ b/tools/coverage_stats/coverage_stats.py Thu May 11 12:16:10 2017 -0400 @@ -6,39 +6,61 @@ * Input BAI filename (via Galaxy metadata) * Output tabular filename +Optional fourth argument: + * Max coverage depth (integer) + This messes about with the filenames to make samtools happy, then runs "samtools idxstats" and "samtools depth", captures and combines the output to the desired output tabular file. + +Because "samtools depth" treats the max depth a little fuzzily, this +tool tries to account for this and applies a clear max-depth cut off. """ -import sys + import os import subprocess +import sys import tempfile if "-v" in sys.argv or "--version" in sys.argv: - #Galaxy seems to invert the order of the two lines - print("BAM coverage statistics v0.0.1") + # Galaxy seems to invert the order of the two lines + print("BAM coverage statistics v0.0.5") cmd = "samtools 2>&1 | grep -i ^Version" sys.exit(os.system(cmd)) -def stop_err(msg, error_level=1): - """Print error message to stdout and quit with given error level.""" - sys.stderr.write("%s\n" % msg) - sys.exit(error_level) - -if len(sys.argv) != 4: - stop_err("Require three arguments: BAM, BAI, tabular filenames") - -bam_filename, bai_filename, tabular_filename = sys.argv[1:] +# TODO - Proper command line API +if len(sys.argv) == 4: + bam_filename, bai_filename, tabular_filename = sys.argv[1:] + max_depth = "8000" +elif len(sys.argv) == 5: + bam_filename, bai_filename, tabular_filename, max_depth = sys.argv[1:] +else: + sys.exit("Require 3 or 4 arguments: BAM, BAI, tabular filename, [max depth]") if not os.path.isfile(bam_filename): - stop_err("Input BAM file not found: %s" % bam_filename) + sys.exit("Input BAM file not found: %s" % bam_filename) +if bai_filename == "-": + # Make this optional for ease of use at the command line by hand: + if os.path.isfile(bam_filename + ".bai"): + bai_filename = bam_filename + ".bai" if not os.path.isfile(bai_filename): if bai_filename == "None": - stop_err("Error: Galaxy did not index your BAM file") - stop_err("Input BAI file not found: %s" % bai_filename) + sys.exit("Error: Galaxy did not index your BAM file") + sys.exit("Input BAI file not found: %s" % bai_filename) -#Assign sensible names with real extensions, and setup symlinks: +try: + max_depth = int(max_depth) +except ValueError: + sys.exit("Bad argument for max depth: %r" % max_depth) +if max_depth < 0: + sys.exit("Bad argument for max depth: %r" % max_depth) + +# fuzz factor to ensure can reach max_depth, e.g. region with +# many reads having a deletion present could underestimate the +# coverage by capping the number of reads considered +depth_margin = 100 + +# Assign sensible names with real extensions, and setup symlinks: tmp_dir = tempfile.mkdtemp() bam_file = os.path.join(tmp_dir, "temp.bam") bai_file = os.path.join(tmp_dir, "temp.bam.bai") @@ -50,27 +72,58 @@ assert os.path.isfile(bai_file), bai_file assert os.path.isfile(bam_file + ".bai"), bam_file + def clean_up(): + """Remove our temporary files and directory.""" os.remove(idxstats_filename) os.remove(depth_filename) os.remove(bam_file) os.remove(bai_file) os.rmdir(tmp_dir) + +def samtools_depth_opt_available(): + """Determine if samtools depth supports maximum coverage argument.""" + child = subprocess.Popen(["samtools", "depth"], + stdout=subprocess.PIPE, stderr=subprocess.STDOUT) + # Combined stdout/stderr in case samtools is ever inconsistent + output, tmp = child.communicate() + assert tmp is None + # Expect to find this line in the help text, exact wording could change: + # -d/-m <int> maximum coverage depth [8000] + return " -d/-m " in output + + +depth_hack = False +if not samtools_depth_opt_available(): + if max_depth + depth_margin <= 8000: + sys.stderr.write("WARNING: The version of samtools depth installed does not " + "support the -d option, however, the requested max-depth " + "is safely under the default of 8000.\n") + depth_hack = True + else: + sys.exit("The version of samtools depth installed does not support the -d option.") + # Run samtools idxstats: cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename) return_code = os.system(cmd) if return_code: clean_up() - stop_err("Return code %i from command:\n%s" % (return_code, cmd)) + sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) # Run samtools depth: # TODO - Parse stdout instead? -cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename) +if depth_hack: + # Using an old samtools without the -d option, but hard coded default + # of 8000 should be fine even allowing a margin for fuzzy output + cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename) +else: + cmd = 'samtools depth -d %i "%s" > "%s"' % (max_depth + depth_margin, bam_file, depth_filename) return_code = os.system(cmd) if return_code: clean_up() - stop_err("Return code %i from command:\n%s" % (return_code, cmd)) + sys.exit("Return code %i from command:\n%s" % (return_code, cmd)) + def load_total_coverage(depth_handle, identifier, length): """Parse some of the 'samtools depth' output for coverages. @@ -86,18 +139,23 @@ # print("%s coverage calculation, length %i, ..." % (identifier, length)) if depth_ref is None: - # Right at start of file! + # Right at start of file / new contig line = depth_handle.readline() + # Are we at the end of the file? + if not line: + # Must be at the end of the file. + # This can happen if the file contig(s) had no reads mapped + return 0, 0, 0.0, 0 depth_ref, depth_pos, depth_reads = line.rstrip("\n").split() depth_pos = int(depth_pos) - depth_reads = int(depth_reads) + depth_reads = min(max_depth, int(depth_reads)) # Can now treat as later references where first line cached elif identifier != depth_ref: # Infer that identifier had coverage zero, # and so was not in the 'samtools depth' # output. # print("%s appears to have no coverage at all" % identifier) - return 0, 0, 0.0 + return 0, 0, 0.0, 0 # Good, at start of expected reference bases = depth_reads @@ -115,7 +173,7 @@ for line in depth_handle: ref, pos, depth = line.rstrip("\n").split() pos = int(pos) - depth = int(depth) + depth = min(max_depth, int(depth)) if ref != identifier: # Reached the end of this identifier's coverage # so cache this ready for next identifier @@ -135,7 +193,8 @@ # print("%s has no coverage at end" % identifier) min_cov = 0 mean_cov = bases / float(length) - return min_cov, max_cov, mean_cov + return min_cov, max_cov, mean_cov, bases + # Parse and combine the output out_handle = open(tabular_filename, "w") @@ -148,6 +207,8 @@ depth_pos = 0 depth_reads = 0 +global_bases = 0 +global_length = 0 for line in idxstats_handle: identifier, length, mapped, placed = line.rstrip("\n").split() length = int(length) @@ -157,8 +218,12 @@ min_cov = 0 max_cov = 0 mean_cov = 0.0 + bases = 0 else: - min_cov, max_cov, mean_cov = load_total_coverage(depth_handle, identifier, length) + min_cov, max_cov, mean_cov, bases = load_total_coverage(depth_handle, identifier, length) + if max_cov > max_depth: + sys.exit("Using max depth %i yet saw max coverage %i for %s" + % (max_depth, max_cov, identifier)) out_handle.write("%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n" % (identifier, length, mapped, placed, min_cov, max_cov, mean_cov)) @@ -168,15 +233,19 @@ depth_handle.close() out_handle.close() clean_up() - stop_err("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov" + sys.exit("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov" % identifier) + global_length += length + global_bases += bases idxstats_handle.close() depth_handle.close() out_handle.close() +print("Total reference length %i with overall mean coverage %0.2f" % (global_length, float(global_bases) / global_length)) + # Remove the temp symlinks and files: clean_up() if depth_ref is not None: - stop_err("Left over output from 'samtools depth'? %r" % depth_ref) + sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
--- a/tools/coverage_stats/coverage_stats.xml Fri Nov 21 09:43:58 2014 -0500 +++ b/tools/coverage_stats/coverage_stats.xml Thu May 11 12:16:10 2017 -0400 @@ -1,29 +1,35 @@ -<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.1"> +<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.5"> <description>using samtools idxstats and depth</description> <requirements> - <requirement type="binary">samtools</requirement> - <requirement type="package" version="0.1.19">samtools</requirement> + <requirement type="package" version="1.4.1">samtools</requirement> </requirements> - <version_command interpreter="python">coverage_stats.py --version</version_command> - <command interpreter="python">coverage_stats.py "$input_bam" "${input_bam.metadata.bam_index}" "$out_tabular"</command> + <version_command> +python $__tool_directory__/coverage_stats.py --version + </version_command> + <command detect_errors="aggressive"> +python $__tool_directory__/coverage_stats.py '$input_bam' '${input_bam.metadata.bam_index}' '$out_tabular' '$max_depth' + </command> <inputs> <param name="input_bam" type="data" format="bam" label="Input BAM file" /> + <param name="max_depth" type="integer" min="0" max="10000000" label="Max depth" value="8000" /> </inputs> <outputs> <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> </outputs> - <stdio> - <!-- Assume anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - </stdio> <tests> <test> <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="123" /> <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> </test> <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <param name="max_depth" value="50" /> + <output name="out_tabular" file="ex1.coverage_stats.md50.tabular" ftype="tabular" /> + </test> + <test> <param name="input_bam" value="coverage_test.bam" ftype="bam" /> + <param name="max_depth" value="123" /> <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> </test> </tests> @@ -47,8 +53,8 @@ 2 Reference sequence length 3 Number of mapped reads 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) - 5 Minimum coverage - 6 Maximum coverage + 5 Minimum coverage (per base of reference) + 6 Maximum coverage (per base of reference) 7 Mean coverage (given to 2 dp) ====== ================================================================================= @@ -77,9 +83,15 @@ .. class:: warningmark -**Note**. There is an internal hard limit of 8000 for the pileup routine in -samtools, meaning the reported coverage from ``samtools depth`` will show -maximum coverage depths *around* 8000. +**Note**. If using this on a mapping BAM file, beware that the coverage counting is +done per base of the reference. This means if your reference has any extra bases +compared to the reads being mapped, those bases will be skipped by CIGAR D operators +and these "extra" bases can have an extremely low coverage, giving a potentially +misleading ``min_cov`` values. A sliding window coverage may be more appropriate. + +**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the +pileup routine, meaning the reported coverage from ``samtools depth`` would show +maximum coverage depths *around* 8000. This is now a run time option. **Citation**
--- a/tools/coverage_stats/tool_dependencies.xml Fri Nov 21 09:43:58 2014 -0500 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,6 +0,0 @@ -<?xml version="1.0"?> -<tool_dependency> - <package name="samtools" version="0.1.19"> - <repository changeset_revision="923adc89c666" name="package_samtools_0_1_19" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" /> - </package> -</tool_dependency>