# HG changeset patch
# User peterjc
# Date 1597184585 14400
# Node ID 57b3ea22aff38db96d1d06aa99d008d040ca5a9c
# Parent 7254ece0c0ff1bbb1328df0f3635602fbce36eed
Uploaded v0.1.0 which was already on the Test Tool Shed. Included Python 3 support.
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam
Binary file coverage_stats-1b5523d3d2c2/test-data/coverage_test.bam has changed
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/coverage_test.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,7 @@
+#identifer length mapped placed min_cov max_cov mean_cov
+ref 100 5 0 0 2 0.50
+none 200 0 0 0 0 0.00
+full 10 3 0 2 2 2.00
+head 10 1 0 0 1 0.50
+tail 20 1 0 0 1 0.25
+* 0 0 1 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.bam
Binary file coverage_stats-1b5523d3d2c2/test-data/ex1.bam has changed
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.md50.tabular Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,4 @@
+#identifer length mapped placed min_cov max_cov mean_cov
+chr1 1575 1446 18 0 50 32.06
+chr2 1584 1789 17 0 50 38.70
+* 0 0 0 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/test-data/ex1.coverage_stats.tabular Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,4 @@
+#identifer length mapped placed min_cov max_cov mean_cov
+chr1 1575 1446 18 0 63 32.32
+chr2 1584 1789 17 0 64 39.78
+* 0 0 0 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/README.rst Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,124 @@
+BAM coverage statistics using samtools idxstats and depth
+=========================================================
+
+This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the licence text below.
+
+Internally this tool uses the command-line samtools suite.
+
+This tool is available from the Galaxy Tool Shed at:
+http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+
+
+Automated Installation
+======================
+
+This should be straightforward, provided your Galaxy is setup to use Conda and
+BioConda for dependencies - Galaxy should then automatically download and
+install the expected version of samtools.
+
+
+Manual Installation
+===================
+
+This expects samtools to be on the ``$PATH``, and was tested using v1.3.1.
+
+To install the wrapper copy or move the following files under the Galaxy tools
+folder, e.g. in a ``tools/coverage_stats`` folder:
+
+* ``coverage_stats.xml`` (the Galaxy tool definition)
+* ``coverage_stats.py`` (the Python wrapper script)
+* ``README.rst`` (this file)
+
+You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
+the tool. Just add the line, perhaps under the NGS tools section::
+
+
+
+If you wish to run the unit tests, also move/copy the ``test-data/`` files
+under Galaxy's ``test-data/`` folder. Then::
+
+ $ ./run_tests.sh -id coverage_stats
+
+That's it.
+
+
+History
+=======
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.1 - Initial public release
+v0.0.2 - Cope with samtools' default depth limit using modified samtools,
+ see https://github.com/samtools/samtools/pull/322
+v0.0.3 - Cope with no coverage in final contigs.
+v0.0.4 - Reorder XML elements (internal change only).
+ - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
+v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3
+ - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a
+ legacy Tool Shed package.
+ - Apply the new maximum depth parameter within the script to ensure
+ excess coverage is clear by getting the max coverage equal to the
+ max depth setting (the raw output from samtools is more fuzzy).
+ - Report total length and overall mean coverage in stdout.
+ - Use ```` (internal change only).
+ - Single quote command line arguments (internal change only).
+v0.0.6 - Python 3 compatibility fix.
+v0.1.0 - Provide proper command line API for the Python script.
+======= ======================================================================
+
+
+Developers
+==========
+
+Development is on this GitHub repository:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+ $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/
+ ...
+
+or::
+
+ $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/
+ ...
+
+To just build and check the tar ball, use::
+
+ $ planemo shed_upload --tar_only tools/coverage_stats/
+ ...
+ $ tar -tzf shed_upload.tar.gz
+ tools/coverage_stats/README.rst
+ tools/coverage_stats/coverage_stats.xml
+ tools/coverage_stats/coverage_stats.py
+ ...
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+
+NOTE: This is the licence for the Galaxy Wrapper only.
+samtools is available and licenced separately.
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.py Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,323 @@
+#!/usr/bin/env python
+"""BAM coverage statistics using samtools idxstats and depth.
+
+This script takes exactly three command line arguments:
+ * Input BAM filename
+ * Input BAI filename (via Galaxy metadata)
+ * Output tabular filename
+
+Optional fourth argument:
+ * Max coverage depth (integer)
+
+This messes about with the filenames to make samtools happy, then
+runs "samtools idxstats" and "samtools depth", captures and combines
+the output to the desired output tabular file.
+
+Because "samtools depth" treats the max depth a little fuzzily, this
+tool tries to account for this and applies a clear max-depth cut off.
+"""
+
+import os
+import subprocess
+import sys
+import tempfile
+
+from optparse import OptionParser
+
+usage = """Example usage:
+
+$ python coverage_stats.py -b mapped.bam -i mapped.bai -o summary.tsv -d 10000
+"""
+
+parser = OptionParser(usage=usage)
+parser.add_option(
+ "-b", "--bam", dest="input_bam", default=None, help="Input BAM file", metavar="FILE"
+)
+parser.add_option(
+ "-i",
+ "--bai",
+ dest="input_bai",
+ default="-",
+ help="Input BAI file (BAM index, optional, default or - infer from BAM filename)",
+ metavar="FILE",
+)
+parser.add_option(
+ "-o",
+ dest="output_tabular",
+ default="-",
+ help="Output tabular file (optional, default stdout)",
+ metavar="FILE",
+)
+parser.add_option(
+ "-d",
+ "--maxdepth",
+ dest="max_depth",
+ default=8000,
+ help="Max coverage depth (integer, default 8000)",
+ type="int",
+)
+parser.add_option(
+ "-v",
+ "--version",
+ dest="version",
+ default=False,
+ action="store_true",
+ help="Show version and quit",
+)
+
+options, args = parser.parse_args()
+
+if options.version:
+ # Galaxy seems to invert the order of the two lines
+ print("BAM coverage statistics v0.1.0")
+ cmd = "samtools 2>&1 | grep -i ^Version"
+ sys.exit(os.system(cmd))
+
+bam_filename = options.input_bam
+bai_filename = options.input_bai
+tabular_filename = options.output_tabular
+max_depth = options.max_depth
+
+if args:
+ sys.exit("Sorry, the legacy API has been dropped.")
+
+if not bam_filename:
+ sys.exit("Input BAM filename is required.")
+if not os.path.isfile(bam_filename):
+ sys.exit("Input BAM file not found: %s" % bam_filename)
+if bai_filename == "-":
+ if os.path.isfile(bam_filename + ".bai"):
+ bai_filename = bam_filename + ".bai"
+if not os.path.isfile(bai_filename):
+ if bai_filename == "None":
+ sys.exit("Error: Galaxy did not index your BAM file")
+ sys.exit("Input BAI file not found: %s" % bai_filename)
+
+try:
+ max_depth = int(max_depth)
+except ValueError:
+ sys.exit("Bad argument for max depth: %r" % max_depth)
+if max_depth < 0:
+ sys.exit("Bad argument for max depth: %r" % max_depth)
+
+# fuzz factor to ensure can reach max_depth, e.g. region with
+# many reads having a deletion present could underestimate the
+# coverage by capping the number of reads considered
+depth_margin = 100
+
+# Assign sensible names with real extensions, and setup symlinks:
+tmp_dir = tempfile.mkdtemp()
+bam_file = os.path.join(tmp_dir, "temp.bam")
+bai_file = os.path.join(tmp_dir, "temp.bam.bai")
+idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")
+depth_filename = os.path.join(tmp_dir, "depth.tsv")
+os.symlink(os.path.abspath(bam_filename), bam_file)
+os.symlink(os.path.abspath(bai_filename), bai_file)
+assert os.path.isfile(bam_file), bam_file
+assert os.path.isfile(bai_file), bai_file
+assert os.path.isfile(bam_file + ".bai"), bam_file
+
+
+def clean_up():
+ """Remove our temporary files and directory."""
+ os.remove(idxstats_filename)
+ os.remove(depth_filename)
+ os.remove(bam_file)
+ os.remove(bai_file)
+ os.rmdir(tmp_dir)
+
+
+def samtools_depth_opt_available():
+ """Determine if samtools depth supports maximum coverage argument."""
+ child = subprocess.Popen(
+ ["samtools", "depth"],
+ universal_newlines=True,
+ stdout=subprocess.PIPE,
+ stderr=subprocess.STDOUT,
+ )
+ # Combined stdout/stderr in case samtools is ever inconsistent
+ output, tmp = child.communicate()
+ assert tmp is None
+ # Expect to find this line in the help text, exact wording could change:
+ # -d/-m maximum coverage depth [8000]
+ return " -d/-m " in output
+
+
+depth_hack = False
+if not samtools_depth_opt_available():
+ if max_depth + depth_margin <= 8000:
+ sys.stderr.write(
+ "WARNING: The version of samtools depth installed does not "
+ "support the -d option, however, the requested max-depth "
+ "is safely under the default of 8000.\n"
+ )
+ depth_hack = True
+ else:
+ sys.exit(
+ "The version of samtools depth installed does not support the -d option."
+ )
+
+# Run samtools idxstats:
+cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename)
+return_code = os.system(cmd)
+if return_code:
+ clean_up()
+ sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
+
+# Run samtools depth:
+# TODO - Parse stdout instead?
+if depth_hack:
+ # Using an old samtools without the -d option, but hard coded default
+ # of 8000 should be fine even allowing a margin for fuzzy output
+ cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename)
+else:
+ cmd = 'samtools depth -d %i "%s" > "%s"' % (
+ max_depth + depth_margin,
+ bam_file,
+ depth_filename,
+ )
+return_code = os.system(cmd)
+if return_code:
+ clean_up()
+ sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
+
+
+def load_total_coverage(depth_handle, identifier, length):
+ """Parse some of the 'samtools depth' output for coverages.
+
+ Returns min_cov (int), max_cov (int) and mean cov (float).
+
+ Uses global variables to cache the first line of output from the
+ next reference sequence.
+ """
+ global depth_ref, depth_pos, depth_reads
+
+ # print("====")
+ # print("%s coverage calculation, length %i, ..." % (identifier, length))
+
+ if depth_ref is None:
+ # Right at start of file / new contig
+ line = depth_handle.readline()
+ # Are we at the end of the file?
+ if not line:
+ # Must be at the end of the file.
+ # This can happen if the file contig(s) had no reads mapped
+ return 0, 0, 0.0, 0
+ depth_ref, depth_pos, depth_reads = line.rstrip("\n").split()
+ depth_pos = int(depth_pos)
+ depth_reads = min(max_depth, int(depth_reads))
+ # Can now treat as later references where first line cached
+ elif identifier != depth_ref:
+ # Infer that identifier had coverage zero,
+ # and so was not in the 'samtools depth'
+ # output.
+ # print("%s appears to have no coverage at all" % identifier)
+ return 0, 0, 0.0, 0
+
+ # Good, at start of expected reference
+ bases = depth_reads
+ if depth_pos == 1:
+ min_cov = depth_reads
+ else:
+ # print("%s has no coverage at start" % identifier)
+ min_cov = 0
+ max_cov = depth_reads
+
+ last_pos = depth_pos
+ depth_ref = None
+ depth_pos = 0
+ depth_reads = 0
+ for line in depth_handle:
+ ref, pos, depth = line.rstrip("\n").split()
+ pos = int(pos)
+ depth = min(max_depth, int(depth))
+ if ref != identifier:
+ # Reached the end of this identifier's coverage
+ # so cache this ready for next identifier
+ depth_ref, depth_pos, depth_reads = ref, pos, depth
+ break
+ bases += depth
+ if last_pos + 1 < pos:
+ # print("%s has no coverage between %i and %i"
+ # % (identifier, last_pos, pos))
+ min_cov = 0
+ else:
+ min_cov = min(min_cov, depth)
+ max_cov = max(max_cov, depth)
+ last_pos = pos
+
+ # Reach the end of this identifier's coverage or end of file
+ if last_pos < length:
+ # print("%s has no coverage at end" % identifier)
+ min_cov = 0
+ mean_cov = bases / float(length)
+ return min_cov, max_cov, mean_cov, bases
+
+
+# Parse and combine the output
+if tabular_filename == "-":
+ out_handle = sys.stdout
+else:
+ out_handle = open(tabular_filename, "w")
+out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n")
+
+idxstats_handle = open(idxstats_filename)
+depth_handle = open(depth_filename)
+
+depth_ref = None
+depth_pos = 0
+depth_reads = 0
+
+global_bases = 0
+global_length = 0
+for line in idxstats_handle:
+ identifier, length, mapped, placed = line.rstrip("\n").split()
+ length = int(length)
+ mapped = int(mapped)
+ placed = int(placed)
+ if identifier == "*":
+ min_cov = 0
+ max_cov = 0
+ mean_cov = 0.0
+ bases = 0
+ else:
+ min_cov, max_cov, mean_cov, bases = load_total_coverage(
+ depth_handle, identifier, length
+ )
+ if max_cov > max_depth:
+ sys.exit(
+ "Using max depth %i yet saw max coverage %i for %s"
+ % (max_depth, max_cov, identifier)
+ )
+ out_handle.write(
+ "%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n"
+ % (identifier, length, mapped, placed, min_cov, max_cov, mean_cov)
+ )
+ if not (min_cov <= mean_cov <= max_cov):
+ out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n")
+ idxstats_handle.close()
+ depth_handle.close()
+ out_handle.close()
+ clean_up()
+ sys.exit(
+ "Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"
+ % identifier
+ )
+ global_length += length
+ global_bases += bases
+
+idxstats_handle.close()
+depth_handle.close()
+if tabular_filename != "-":
+ out_handle.close()
+
+print(
+ "Total reference length %i with overall mean coverage %0.2f"
+ % (global_length, float(global_bases) / global_length)
+)
+
+# Remove the temp symlinks and files:
+clean_up()
+
+if depth_ref is not None:
+ sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
diff -r 7254ece0c0ff -r 57b3ea22aff3 coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/coverage_stats-1b5523d3d2c2/tools/coverage_stats/coverage_stats.xml Tue Aug 11 18:23:05 2020 -0400
@@ -0,0 +1,119 @@
+
+ using samtools idxstats and depth
+
+ samtools
+
+
+python $__tool_directory__/coverage_stats.py --version
+
+
+python $__tool_directory__/coverage_stats.py
+-b '$input_bam'
+-i '${input_bam.metadata.bam_index}'
+-o '$out_tabular'
+-d '$max_depth'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the
+SAMtools toolkit, and parses their output to produce a consise summary of the
+coverage information for each reference sequence.
+
+Input is a sorted and indexed BAM file, the output is tabular. The first four
+columns match the output from ``samtools idxstats``, the additional columns are
+calculated from the ``samtools depth`` output. The final row with a star as the
+reference identifier represents unmapped reads, and will have zeros in every
+column except columns one and four.
+
+====== =================================================================================
+Column Description
+------ ---------------------------------------------------------------------------------
+ 1 Reference sequence identifier
+ 2 Reference sequence length
+ 3 Number of mapped reads
+ 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads)
+ 5 Minimum coverage (per base of reference)
+ 6 Maximum coverage (per base of reference)
+ 7 Mean coverage (given to 2 dp)
+====== =================================================================================
+
+Example output from a *de novo* assembly:
+
+========== ====== ====== ====== ======= ======= ========
+identiifer length mapped placed min_cov max_cov mean_cov
+---------- ------ ------ ------ ------- ------- --------
+contig_1 833604 436112 0 1 157 71.95
+contig_2 14820 9954 0 1 152 91.27
+contig_3 272099 142958 0 1 150 72.31
+contig_4 135519 73288 0 1 149 75.23
+contig_5 91245 46759 0 1 157 70.92
+contig_6 175604 95744 0 1 146 75.99
+contig_7 90586 48158 0 1 151 72.93
+contig_9 234347 126458 0 1 159 75.40
+contig_10 121515 60211 0 1 152 68.12
+... ... ... ... ... ... ...
+contig_604 712 85 0 1 49 21.97
+\* 0 0 950320 0 0 0.00
+========== ====== ====== ====== ======= ======= ========
+
+In this example there were 604 contigs, each with one line in the output table,
+plus the final row (labelled with an asterisk) representing 950320 unmapped reads.
+In this BAM file, the fourth column was otherwise zero.
+
+.. class:: warningmark
+
+**Note**. If using this on a mapping BAM file, beware that the coverage counting is
+done per base of the reference. This means if your reference has any extra bases
+compared to the reads being mapped, those bases will be skipped by CIGAR D operators
+and these "extra" bases can have an extremely low coverage, giving a potentially
+misleading ``min_cov`` values. A sliding window coverage may be more appropriate.
+
+**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the
+pileup routine, meaning the reported coverage from ``samtools depth`` would show
+maximum coverage depths *around* 8000. This is now a run time option.
+
+
+**Citation**
+
+If you use this Galaxy tool in work leading to a scientific publication please
+cite:
+
+Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools.
+Bioinformatics 25(16), 2078-9.
+https://doi.org/10.1093/bioinformatics/btp352
+
+Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth.
+http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+
+This wrapper is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
+
+
+ 10.1093/bioinformatics/btp352
+
+
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/coverage_test.bam
Binary file test-data/coverage_test.bam has changed
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/coverage_test.coverage_stats.tabular
--- a/test-data/coverage_test.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,7 +0,0 @@
-#identifer length mapped placed min_cov max_cov mean_cov
-ref 100 5 0 0 2 0.50
-none 200 0 0 0 0 0.00
-full 10 3 0 2 2 2.00
-head 10 1 0 0 1 0.50
-tail 20 1 0 0 1 0.25
-* 0 0 1 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.bam
Binary file test-data/ex1.bam has changed
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.coverage_stats.md50.tabular
--- a/test-data/ex1.coverage_stats.md50.tabular Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#identifer length mapped placed min_cov max_cov mean_cov
-chr1 1575 1446 18 0 50 32.06
-chr2 1584 1789 17 0 50 38.70
-* 0 0 0 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 test-data/ex1.coverage_stats.tabular
--- a/test-data/ex1.coverage_stats.tabular Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,4 +0,0 @@
-#identifer length mapped placed min_cov max_cov mean_cov
-chr1 1575 1446 18 0 63 32.32
-chr2 1584 1789 17 0 64 39.78
-* 0 0 0 0 0 0.00
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/README.rst
--- a/tools/coverage_stats/README.rst Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,122 +0,0 @@
-BAM coverage statistics using samtools idxstats and depth
-=========================================================
-
-This tool is copyright 2014-2017 by Peter Cock, The James Hutton Institute
-(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
-See the licence text below.
-
-Internally this tool uses the command-line samtools suite.
-
-This tool is available from the Galaxy Tool Shed at:
-http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-
-
-Automated Installation
-======================
-
-This should be straightforward, provided your Galaxy is setup to use Conda and
-BioConda for dependencies - Galaxy should then automatically download and
-install the expected version of samtools.
-
-
-Manual Installation
-===================
-
-This expects samtools to be on the ``$PATH``, and was tested using v1.3.1.
-
-To install the wrapper copy or move the following files under the Galaxy tools
-folder, e.g. in a ``tools/coverage_stats`` folder:
-
-* ``coverage_stats.xml`` (the Galaxy tool definition)
-* ``coverage_stats.py`` (the Python wrapper script)
-* ``README.rst`` (this file)
-
-You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
-the tool. Just add the line, perhaps under the NGS tools section::
-
-
-
-If you wish to run the unit tests, also move/copy the ``test-data/`` files
-under Galaxy's ``test-data/`` folder. Then::
-
- $ ./run_tests.sh -id coverage_stats
-
-That's it.
-
-
-History
-=======
-
-======= ======================================================================
-Version Changes
-------- ----------------------------------------------------------------------
-v0.0.1 - Initial public release
-v0.0.2 - Cope with samtools' default depth limit using modified samtools,
- see https://github.com/samtools/samtools/pull/322
-v0.0.3 - Cope with no coverage in final contigs.
-v0.0.4 - Reorder XML elements (internal change only).
- - Planemo for Tool Shed upload (``.shed.yml``, internal change only).
-v0.0.5 - Expose new ``samtools depth -d ...`` argument added in samtools v1.3
- - Depends on samtools v1.4.1 via Conda, which IS NOT AVAILABLE as a
- legacy Tool Shed package.
- - Apply the new maximum depth parameter within the script to ensure
- excess coverage is clear by getting the max coverage equal to the
- max depth setting (the raw output from samtools is more fuzzy).
- - Report total length and overall mean coverage in stdout.
- - Use ```` (internal change only).
- - Single quote command line arguments (internal change only).
-======= ======================================================================
-
-
-Developers
-==========
-
-Development is on this GitHub repository:
-https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats
-
-For pushing a release to the test or main "Galaxy Tool Shed", use the following
-Planemo commands (which requires you have set your Tool Shed access details in
-``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
-
- $ planemo shed_update -t testtoolshed --check_diff tools/coverage_stats/
- ...
-
-or::
-
- $ planemo shed_update -t toolshed --check_diff tools/coverage_stats/
- ...
-
-To just build and check the tar ball, use::
-
- $ planemo shed_upload --tar_only tools/coverage_stats/
- ...
- $ tar -tzf shed_upload.tar.gz
- tools/coverage_stats/README.rst
- tools/coverage_stats/coverage_stats.xml
- tools/coverage_stats/coverage_stats.py
- ...
-
-
-Licence (MIT)
-=============
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in
-all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-THE SOFTWARE.
-
-NOTE: This is the licence for the Galaxy Wrapper only.
-samtools is available and licenced separately.
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/coverage_stats.py
--- a/tools/coverage_stats/coverage_stats.py Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,251 +0,0 @@
-#!/usr/bin/env python
-"""BAM coverage statistics using samtools idxstats and depth.
-
-This script takes exactly three command line arguments:
- * Input BAM filename
- * Input BAI filename (via Galaxy metadata)
- * Output tabular filename
-
-Optional fourth argument:
- * Max coverage depth (integer)
-
-This messes about with the filenames to make samtools happy, then
-runs "samtools idxstats" and "samtools depth", captures and combines
-the output to the desired output tabular file.
-
-Because "samtools depth" treats the max depth a little fuzzily, this
-tool tries to account for this and applies a clear max-depth cut off.
-"""
-
-import os
-import subprocess
-import sys
-import tempfile
-
-if "-v" in sys.argv or "--version" in sys.argv:
- # Galaxy seems to invert the order of the two lines
- print("BAM coverage statistics v0.0.5")
- cmd = "samtools 2>&1 | grep -i ^Version"
- sys.exit(os.system(cmd))
-
-# TODO - Proper command line API
-if len(sys.argv) == 4:
- bam_filename, bai_filename, tabular_filename = sys.argv[1:]
- max_depth = "8000"
-elif len(sys.argv) == 5:
- bam_filename, bai_filename, tabular_filename, max_depth = sys.argv[1:]
-else:
- sys.exit("Require 3 or 4 arguments: BAM, BAI, tabular filename, [max depth]")
-
-if not os.path.isfile(bam_filename):
- sys.exit("Input BAM file not found: %s" % bam_filename)
-if bai_filename == "-":
- # Make this optional for ease of use at the command line by hand:
- if os.path.isfile(bam_filename + ".bai"):
- bai_filename = bam_filename + ".bai"
-if not os.path.isfile(bai_filename):
- if bai_filename == "None":
- sys.exit("Error: Galaxy did not index your BAM file")
- sys.exit("Input BAI file not found: %s" % bai_filename)
-
-try:
- max_depth = int(max_depth)
-except ValueError:
- sys.exit("Bad argument for max depth: %r" % max_depth)
-if max_depth < 0:
- sys.exit("Bad argument for max depth: %r" % max_depth)
-
-# fuzz factor to ensure can reach max_depth, e.g. region with
-# many reads having a deletion present could underestimate the
-# coverage by capping the number of reads considered
-depth_margin = 100
-
-# Assign sensible names with real extensions, and setup symlinks:
-tmp_dir = tempfile.mkdtemp()
-bam_file = os.path.join(tmp_dir, "temp.bam")
-bai_file = os.path.join(tmp_dir, "temp.bam.bai")
-idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv")
-depth_filename = os.path.join(tmp_dir, "depth.tsv")
-os.symlink(os.path.abspath(bam_filename), bam_file)
-os.symlink(os.path.abspath(bai_filename), bai_file)
-assert os.path.isfile(bam_file), bam_file
-assert os.path.isfile(bai_file), bai_file
-assert os.path.isfile(bam_file + ".bai"), bam_file
-
-
-def clean_up():
- """Remove our temporary files and directory."""
- os.remove(idxstats_filename)
- os.remove(depth_filename)
- os.remove(bam_file)
- os.remove(bai_file)
- os.rmdir(tmp_dir)
-
-
-def samtools_depth_opt_available():
- """Determine if samtools depth supports maximum coverage argument."""
- child = subprocess.Popen(["samtools", "depth"],
- stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
- # Combined stdout/stderr in case samtools is ever inconsistent
- output, tmp = child.communicate()
- assert tmp is None
- # Expect to find this line in the help text, exact wording could change:
- # -d/-m maximum coverage depth [8000]
- return " -d/-m " in output
-
-
-depth_hack = False
-if not samtools_depth_opt_available():
- if max_depth + depth_margin <= 8000:
- sys.stderr.write("WARNING: The version of samtools depth installed does not "
- "support the -d option, however, the requested max-depth "
- "is safely under the default of 8000.\n")
- depth_hack = True
- else:
- sys.exit("The version of samtools depth installed does not support the -d option.")
-
-# Run samtools idxstats:
-cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename)
-return_code = os.system(cmd)
-if return_code:
- clean_up()
- sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
-
-# Run samtools depth:
-# TODO - Parse stdout instead?
-if depth_hack:
- # Using an old samtools without the -d option, but hard coded default
- # of 8000 should be fine even allowing a margin for fuzzy output
- cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename)
-else:
- cmd = 'samtools depth -d %i "%s" > "%s"' % (max_depth + depth_margin, bam_file, depth_filename)
-return_code = os.system(cmd)
-if return_code:
- clean_up()
- sys.exit("Return code %i from command:\n%s" % (return_code, cmd))
-
-
-def load_total_coverage(depth_handle, identifier, length):
- """Parse some of the 'samtools depth' output for coverages.
-
- Returns min_cov (int), max_cov (int) and mean cov (float).
-
- Uses global variables to cache the first line of output from the
- next reference sequence.
- """
- global depth_ref, depth_pos, depth_reads
-
- # print("====")
- # print("%s coverage calculation, length %i, ..." % (identifier, length))
-
- if depth_ref is None:
- # Right at start of file / new contig
- line = depth_handle.readline()
- # Are we at the end of the file?
- if not line:
- # Must be at the end of the file.
- # This can happen if the file contig(s) had no reads mapped
- return 0, 0, 0.0, 0
- depth_ref, depth_pos, depth_reads = line.rstrip("\n").split()
- depth_pos = int(depth_pos)
- depth_reads = min(max_depth, int(depth_reads))
- # Can now treat as later references where first line cached
- elif identifier != depth_ref:
- # Infer that identifier had coverage zero,
- # and so was not in the 'samtools depth'
- # output.
- # print("%s appears to have no coverage at all" % identifier)
- return 0, 0, 0.0, 0
-
- # Good, at start of expected reference
- bases = depth_reads
- if depth_pos == 1:
- min_cov = depth_reads
- else:
- # print("%s has no coverage at start" % identifier)
- min_cov = 0
- max_cov = depth_reads
-
- last_pos = depth_pos
- depth_ref = None
- depth_pos = 0
- depth_reads = 0
- for line in depth_handle:
- ref, pos, depth = line.rstrip("\n").split()
- pos = int(pos)
- depth = min(max_depth, int(depth))
- if ref != identifier:
- # Reached the end of this identifier's coverage
- # so cache this ready for next identifier
- depth_ref, depth_pos, depth_reads = ref, pos, depth
- break
- bases += depth
- if last_pos + 1 < pos:
- # print("%s has no coverage between %i and %i" % (identifier, last_pos, pos))
- min_cov = 0
- else:
- min_cov = min(min_cov, depth)
- max_cov = max(max_cov, depth)
- last_pos = pos
-
- # Reach the end of this identifier's coverage or end of file
- if last_pos < length:
- # print("%s has no coverage at end" % identifier)
- min_cov = 0
- mean_cov = bases / float(length)
- return min_cov, max_cov, mean_cov, bases
-
-
-# Parse and combine the output
-out_handle = open(tabular_filename, "w")
-out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n")
-
-idxstats_handle = open(idxstats_filename)
-depth_handle = open(depth_filename)
-
-depth_ref = None
-depth_pos = 0
-depth_reads = 0
-
-global_bases = 0
-global_length = 0
-for line in idxstats_handle:
- identifier, length, mapped, placed = line.rstrip("\n").split()
- length = int(length)
- mapped = int(mapped)
- placed = int(placed)
- if identifier == "*":
- min_cov = 0
- max_cov = 0
- mean_cov = 0.0
- bases = 0
- else:
- min_cov, max_cov, mean_cov, bases = load_total_coverage(depth_handle, identifier, length)
- if max_cov > max_depth:
- sys.exit("Using max depth %i yet saw max coverage %i for %s"
- % (max_depth, max_cov, identifier))
- out_handle.write("%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n"
- % (identifier, length, mapped, placed,
- min_cov, max_cov, mean_cov))
- if not (min_cov <= mean_cov <= max_cov):
- out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n")
- idxstats_handle.close()
- depth_handle.close()
- out_handle.close()
- clean_up()
- sys.exit("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov"
- % identifier)
- global_length += length
- global_bases += bases
-
-idxstats_handle.close()
-depth_handle.close()
-out_handle.close()
-
-print("Total reference length %i with overall mean coverage %0.2f" % (global_length, float(global_bases) / global_length))
-
-# Remove the temp symlinks and files:
-clean_up()
-
-if depth_ref is not None:
- sys.exit("Left over output from 'samtools depth'? %r" % depth_ref)
diff -r 7254ece0c0ff -r 57b3ea22aff3 tools/coverage_stats/coverage_stats.xml
--- a/tools/coverage_stats/coverage_stats.xml Thu May 11 12:16:10 2017 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,115 +0,0 @@
-
- using samtools idxstats and depth
-
- samtools
-
-
-python $__tool_directory__/coverage_stats.py --version
-
-
-python $__tool_directory__/coverage_stats.py '$input_bam' '${input_bam.metadata.bam_index}' '$out_tabular' '$max_depth'
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-**What it does**
-
-This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the
-SAMtools toolkit, and parses their output to produce a consise summary of the
-coverage information for each reference sequence.
-
-Input is a sorted and indexed BAM file, the output is tabular. The first four
-columns match the output from ``samtools idxstats``, the additional columns are
-calculated from the ``samtools depth`` output. The final row with a star as the
-reference identifier represents unmapped reads, and will have zeros in every
-column except columns one and four.
-
-====== =================================================================================
-Column Description
------- ---------------------------------------------------------------------------------
- 1 Reference sequence identifier
- 2 Reference sequence length
- 3 Number of mapped reads
- 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads)
- 5 Minimum coverage (per base of reference)
- 6 Maximum coverage (per base of reference)
- 7 Mean coverage (given to 2 dp)
-====== =================================================================================
-
-Example output from a *de novo* assembly:
-
-========== ====== ====== ====== ======= ======= ========
-identiifer length mapped placed min_cov max_cov mean_cov
----------- ------ ------ ------ ------- ------- --------
-contig_1 833604 436112 0 1 157 71.95
-contig_2 14820 9954 0 1 152 91.27
-contig_3 272099 142958 0 1 150 72.31
-contig_4 135519 73288 0 1 149 75.23
-contig_5 91245 46759 0 1 157 70.92
-contig_6 175604 95744 0 1 146 75.99
-contig_7 90586 48158 0 1 151 72.93
-contig_9 234347 126458 0 1 159 75.40
-contig_10 121515 60211 0 1 152 68.12
-... ... ... ... ... ... ...
-contig_604 712 85 0 1 49 21.97
-\* 0 0 950320 0 0 0.00
-========== ====== ====== ====== ======= ======= ========
-
-In this example there were 604 contigs, each with one line in the output table,
-plus the final row (labelled with an asterisk) representing 950320 unmapped reads.
-In this BAM file, the fourth column was otherwise zero.
-
-.. class:: warningmark
-
-**Note**. If using this on a mapping BAM file, beware that the coverage counting is
-done per base of the reference. This means if your reference has any extra bases
-compared to the reads being mapped, those bases will be skipped by CIGAR D operators
-and these "extra" bases can have an extremely low coverage, giving a potentially
-misleading ``min_cov`` values. A sliding window coverage may be more appropriate.
-
-**Note**. Up until samtools 1.2, there was an internal hard limit of 8000 for the
-pileup routine, meaning the reported coverage from ``samtools depth`` would show
-maximum coverage depths *around* 8000. This is now a run time option.
-
-
-**Citation**
-
-If you use this Galaxy tool in work leading to a scientific publication please
-cite:
-
-Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools.
-Bioinformatics 25(16), 2078-9.
-http://dx.doi.org/10.1093/bioinformatics/btp352
-
-Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth.
-http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-
-This wrapper is available to install into other Galaxy Instances via the Galaxy
-Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats
-
-
- 10.1093/bioinformatics/btp352
-
-