Mercurial > repos > peterjc > coverage_stats
changeset 0:ca8f63f2f7d4 draft
Uploaded v0.0.1
author | peterjc |
---|---|
date | Fri, 21 Nov 2014 09:28:29 -0500 |
parents | |
children | d1fdfaae5dbe |
files | test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular test-data/ex1.bam test-data/ex1.coverage_stats.tabular tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.py tools/coverage_stats/coverage_stats.xml tools/coverage_stats/tool_dependencies.xml |
diffstat | 8 files changed, 398 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/coverage_test.coverage_stats.tabular Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,7 @@ +#identifer length mapped placed min_cov max_cov mean_cov +ref 100 5 0 0 2 0.50 +none 200 0 0 0 0 0.00 +full 10 3 0 2 2 2.00 +head 10 1 0 0 1 0.50 +tail 20 1 0 0 1 0.25 +* 0 0 1 0 0 0.00
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/ex1.coverage_stats.tabular Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,4 @@ +#identifer length mapped placed min_cov max_cov mean_cov +chr1 1575 1446 18 0 63 32.32 +chr2 1584 1789 17 0 64 39.78 +* 0 0 0 0 0 0.00
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/coverage_stats/README.rst Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,102 @@ +BAM coverage statistics using samtools idxstats and depth +========================================================= + +This tool is copyright 2014 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +Internally this tool uses the command-line samtools suite. + +This tool is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + + +Automated Installation +====================== + +This should be straightforward, Galaxy should automatically download and install +samtools 0.1.19 if required. + + +Manual Installation +=================== + +This expects samtools to be on the ``$PATH``, and was tested using v0.1.19. + +To install the wrapper copy or move the following files under the Galaxy tools +folder, e.g. in a ``tools/coverage_stats`` folder: + +* ``coverage_stats.xml`` (the Galaxy tool definition) +* ``coverage_stats.py`` (the Python wrapper script) +* ``README.rst`` (this file) + +You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer +the tool. Just add the line, perhaps under the NGS tools section:: + + <tool file="coverage_stats/coverage_stats.xml" /> + +If you wish to run the unit tests, also move/copy the ``test-data/`` files +under Galaxy's ``test-data/`` folder. Then:: + + $ ./run_tests.sh -id coverage_stats + +That's it. + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial public release +======= ====================================================================== + + +Developers +========== + +Development is on this GitHub repository: +https://github.com/peterjc/pico_galaxy/tree/master/tools/coverage_stats + +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use +the following command from the Galaxy root folder:: + + $ tar -czf coverage_stats.tar.gz tools/coverage_stats/README.rst tools/coverage_stats/coverage_stats.xml tools/coverage_stats/coverage_stats.py tools/coverage_stats/tool_dependencies.xml test-data/ex1.bam test-data/ex1.coverage_stats.tabular test-data/coverage_test.bam test-data/coverage_test.coverage_stats.tabular + +Check this worked:: + + $ tar -tzf coverage_stats.tar.gz + tools/coverage_stats/README.rst + tools/coverage_stats/coverage_stats.xml + tools/coverage_stats/coverage_stats.py + tools/coverage_stats/tool_dependencies.xml + test-data/ex1.bam + test-data/ex1.coverage_stats.tabular + test-data/coverage_test.bam + test-data/coverage_test.coverage_stats.tabular + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. + +NOTE: This is the licence for the Galaxy Wrapper only. +samtools is available and licenced separately.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/coverage_stats/coverage_stats.py Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,182 @@ +#!/usr/bin/env python +"""BAM coverage statistics using samtools idxstats and depth. + +This script takes exactly three command line arguments: + * Input BAM filename + * Input BAI filename (via Galaxy metadata) + * Output tabular filename + +This messes about with the filenames to make samtools happy, then +runs "samtools idxstats" and "samtools depth", captures and combines +the output to the desired output tabular file. +""" +import sys +import os +import subprocess +import tempfile + +if "-v" in sys.argv or "--version" in sys.argv: + #Galaxy seems to invert the order of the two lines + print("BAM coverage statistics v0.0.1") + cmd = "samtools 2>&1 | grep -i ^Version" + sys.exit(os.system(cmd)) + +def stop_err(msg, error_level=1): + """Print error message to stdout and quit with given error level.""" + sys.stderr.write("%s\n" % msg) + sys.exit(error_level) + +if len(sys.argv) != 4: + stop_err("Require three arguments: BAM, BAI, tabular filenames") + +bam_filename, bai_filename, tabular_filename = sys.argv[1:] + +if not os.path.isfile(bam_filename): + stop_err("Input BAM file not found: %s" % bam_filename) +if not os.path.isfile(bai_filename): + if bai_filename == "None": + stop_err("Error: Galaxy did not index your BAM file") + stop_err("Input BAI file not found: %s" % bai_filename) + +#Assign sensible names with real extensions, and setup symlinks: +tmp_dir = tempfile.mkdtemp() +bam_file = os.path.join(tmp_dir, "temp.bam") +bai_file = os.path.join(tmp_dir, "temp.bam.bai") +idxstats_filename = os.path.join(tmp_dir, "idxstats.tsv") +depth_filename = os.path.join(tmp_dir, "depth.tsv") +os.symlink(os.path.abspath(bam_filename), bam_file) +os.symlink(os.path.abspath(bai_filename), bai_file) +assert os.path.isfile(bam_file), bam_file +assert os.path.isfile(bai_file), bai_file +assert os.path.isfile(bam_file + ".bai"), bam_file + +def clean_up(): + os.remove(idxstats_filename) + os.remove(depth_filename) + os.remove(bam_file) + os.remove(bai_file) + os.rmdir(tmp_dir) + +# Run samtools idxstats: +cmd = 'samtools idxstats "%s" > "%s"' % (bam_file, idxstats_filename) +return_code = os.system(cmd) +if return_code: + clean_up() + stop_err("Return code %i from command:\n%s" % (return_code, cmd)) + +# Run samtools depth: +# TODO - Parse stdout instead? +cmd = 'samtools depth "%s" > "%s"' % (bam_file, depth_filename) +return_code = os.system(cmd) +if return_code: + clean_up() + stop_err("Return code %i from command:\n%s" % (return_code, cmd)) + +def load_total_coverage(depth_handle, identifier, length): + """Parse some of the 'samtools depth' output for coverages. + + Returns min_cov (int), max_cov (int) and mean cov (float). + + Uses global variables to cache the first line of output from the + next reference sequence. + """ + global depth_ref, depth_pos, depth_reads + + # print("====") + # print("%s coverage calculation, length %i, ..." % (identifier, length)) + + if depth_ref is None: + # Right at start of file! + line = depth_handle.readline() + depth_ref, depth_pos, depth_reads = line.rstrip("\n").split() + depth_pos = int(depth_pos) + depth_reads = int(depth_reads) + # Can now treat as later references where first line cached + elif identifier != depth_ref: + # Infer that identifier had coverage zero, + # and so was not in the 'samtools depth' + # output. + # print("%s appears to have no coverage at all" % identifier) + return 0, 0, 0.0 + + # Good, at start of expected reference + bases = depth_reads + if depth_pos == 1: + min_cov = depth_reads + else: + # print("%s has no coverage at start" % identifier) + min_cov = 0 + max_cov = depth_reads + + last_pos = depth_pos + depth_ref = None + depth_pos = 0 + depth_reads = 0 + for line in depth_handle: + ref, pos, depth = line.rstrip("\n").split() + pos = int(pos) + depth = int(depth) + if ref != identifier: + # Reached the end of this identifier's coverage + # so cache this ready for next identifier + depth_ref, depth_pos, depth_reads = ref, pos, depth + break + bases += depth + if last_pos + 1 < pos: + # print("%s has no coverage between %i and %i" % (identifier, last_pos, pos)) + min_cov = 0 + else: + min_cov = min(min_cov, depth) + max_cov = max(max_cov, depth) + last_pos = pos + + # Reach the end of this identifier's coverage or end of file + if last_pos < length: + # print("%s has no coverage at end" % identifier) + min_cov = 0 + mean_cov = bases / float(length) + return min_cov, max_cov, mean_cov + +# Parse and combine the output +out_handle = open(tabular_filename, "w") +out_handle.write("#identifer\tlength\tmapped\tplaced\tmin_cov\tmax_cov\tmean_cov\n") + +idxstats_handle = open(idxstats_filename) +depth_handle = open(depth_filename) + +depth_ref = None +depth_pos = 0 +depth_reads = 0 + +for line in idxstats_handle: + identifier, length, mapped, placed = line.rstrip("\n").split() + length = int(length) + mapped = int(mapped) + placed = int(placed) + if identifier == "*": + min_cov = 0 + max_cov = 0 + mean_cov = 0.0 + else: + min_cov, max_cov, mean_cov = load_total_coverage(depth_handle, identifier, length) + out_handle.write("%s\t%i\t%i\t%i\t%i\t%i\t%0.2f\n" + % (identifier, length, mapped, placed, + min_cov, max_cov, mean_cov)) + if not (min_cov <= mean_cov <= max_cov): + out_handle.write("ERROR, expect min_cov <= mean_cov <= max_cov\n") + idxstats_handle.close() + depth_handle.close() + out_handle.close() + clean_up() + stop_err("Problem with coverage for %s, expect min_cov <= mean_cov <= max_cov" + % identifier) + +idxstats_handle.close() +depth_handle.close() +out_handle.close() + +# Remove the temp symlinks and files: +clean_up() + +if depth_ref is not None: + stop_err("Left over output from 'samtools depth'? %r" % depth_ref)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/coverage_stats/coverage_stats.xml Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,97 @@ +<tool id="coverage_stats" name="BAM coverage statistics" version="0.0.1"> + <description>using samtools idxstats and depth</description> + <requirements> + <requirement type="binary">samtools</requirement> + <requirement type="package" version="0.1.19">samtools</requirement> + </requirements> + <version_command interpreter="python">coverage_stats.py --version</version_command> + <command interpreter="python">coverage_stats.py "$input_bam" "${input_bam.metadata.bam_index}" "$out_tabular"</command> + <inputs> + <param name="input_bam" type="data" format="bam" label="Input BAM file" /> + </inputs> + <outputs> + <data name="out_tabular" format="tabular" label="$input_bam.name (coverage stats)" /> + </outputs> + <stdio> + <!-- Assume anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + </stdio> + <tests> + <test> + <param name="input_bam" value="ex1.bam" ftype="bam" /> + <output name="out_tabular" file="ex1.coverage_stats.tabular" ftype="tabular" /> + </test> + <test> + <param name="input_bam" value="coverage_test.bam" ftype="bam" /> + <output name="out_tabular" file="coverage_test.coverage_stats.tabular" ftype="tabular" /> + </test> + </tests> + <help> +**What it does** + +This tool runs the commands ``samtools idxstats`` and ``samtools depth`` from the +SAMtools toolkit, and parses their output to produce a consise summary of the +coverage information for each reference sequence. + +Input is a sorted and indexed BAM file, the output is tabular. The first four +columns match the output from ``samtools idxstats``, the additional columns are +calculated from the ``samtools depth`` output. The final row with a star as the +reference identifier represents unmapped reads, and will have zeros in every +column except columns one and four. + +====== ================================================================================= +Column Description +------ --------------------------------------------------------------------------------- + 1 Reference sequence identifier + 2 Reference sequence length + 3 Number of mapped reads + 4 Number of placed but unmapped reads (typically unmapped partners of mapped reads) + 5 Minimum coverage + 6 Maximum coverage + 7 Mean coverage (given to 2 dp) +====== ================================================================================= + +Example output from a *de novo* assembly: + +========== ====== ====== ====== ======= ======= ======== +identiifer length mapped placed min_cov max_cov mean_cov +---------- ------ ------ ------ ------- ------- -------- +contig_1 833604 436112 0 1 157 71.95 +contig_2 14820 9954 0 1 152 91.27 +contig_3 272099 142958 0 1 150 72.31 +contig_4 135519 73288 0 1 149 75.23 +contig_5 91245 46759 0 1 157 70.92 +contig_6 175604 95744 0 1 146 75.99 +contig_7 90586 48158 0 1 151 72.93 +contig_9 234347 126458 0 1 159 75.40 +contig_10 121515 60211 0 1 152 68.12 +... ... ... ... ... ... ... +contig_604 712 85 0 1 49 21.97 +\* 0 0 950320 0 0 0.00 +========== ====== ====== ====== ======= ======= ======== + +In this example there were 604 contigs, each with one line in the output table, +plus the final row (labelled with an asterisk) representing 950320 unmapped reads. +In this BAM file, the fourth column was otherwise zero. + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite: + +Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools. +Bioinformatics 25(16), 2078-9. +http://dx.doi.org/10.1093/bioinformatics/btp352 + +Peter J.A. Cock (2013), BAM coverage statistics using samtools idxstats and depth. +http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/coverage_stats + </help> + <citations> + <citation type="doi">10.1093/bioinformatics/btp352</citation> + </citations> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/coverage_stats/tool_dependencies.xml Fri Nov 21 09:28:29 2014 -0500 @@ -0,0 +1,6 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="samtools" version="0.1.19"> + <repository changeset_revision="923adc89c666" name="package_samtools_0_1_19" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" /> + </package> +</tool_dependency>