# HG changeset patch
# User peterjc
# Date 1485951026 18000
# Node ID 95efbdb729615513c1db0e6fef9da8276e72f35c
v0.0.4 - Previously only on Test Tool Shed
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam
Binary file test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam has changed
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam.bai
Binary file test-data/SRR639755_mito_pairs_vs_NC_010642_clc.bam.bai has changed
diff -r 000000000000 -r 95efbdb72961 test-data/SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/SRR639755_mito_pairs_vs_NC_010642_clc.count-1695-1725.tabular Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,4 @@
+Variant Count Percentage
+AGCCCATGAGATGGGAAGCAATGGGCTACA 14 87.50
+AGCCCATGAGATGGGAAGCAATGGGCTACG 1 6.25
+AGCGCATGAGATGGGAAGCAATGGGCTACG 1 6.25
diff -r 000000000000 -r 95efbdb72961 test-data/coverage_test.bam
Binary file test-data/coverage_test.bam has changed
diff -r 000000000000 -r 95efbdb72961 test-data/coverage_test.count_roi_variants.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/coverage_test.count_roi_variants.tabular Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,2 @@
+Variant Count Percentage
+TAAAGG 1 100.00
diff -r 000000000000 -r 95efbdb72961 test-data/ex1.bam
Binary file test-data/ex1.bam has changed
diff -r 000000000000 -r 95efbdb72961 test-data/ex1.count_roi_variants.tabular
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/ex1.count_roi_variants.tabular Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,3 @@
+Variant Count Percentage
+CTATCTA 39 97.50
+ATATATA 1 2.50
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/README.rst
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/count_roi_variants/README.rst Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,138 @@
+Count sequence variants in region of interest in BAM file
+=========================================================
+
+This tool is copyright 2016 by Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the licence text below.
+
+This tool runs the command ``samtools view`` (taking advantage of an
+indexed BAM file) to access only those reads mapped to the region of
+interest (ROI), and then counts the different sequence variants found.
+
+Internally this tool uses the command-line samtools suite.
+
+This tool is available from the Galaxy Tool Shed at:
+http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants
+
+
+Use outside of Galaxy
+=====================
+
+You just need the ``count_roi_variants.py`` script and to have samtools
+on the ``$PATH``. If you move/copy the script somewhere on your ``$PATH``
+and then you can run it like this::
+
+ $ count_roi_variants.py --help
+
+Or, call the script at an explicit path::
+
+ $ /path/to/my/stuff/count_roi_variants.py --help
+
+Run like this it will use the current default Python. This was written and
+tested under Python 2.7, but should also work under Python 2.6. e.g.::
+
+ $ python2.6 /path/to/my/stuff/count_roi_variants.py --help
+
+This does not yet support Python 3.
+
+The sample data and tests are designed to be run via Galaxy.
+
+
+Automated Galaxy Installation
+=============================
+
+This should be straightforward, Galaxy should automatically download and install
+samtools if required.
+
+
+Manual Galaxy Installation
+==========================
+
+This expects samtools to be on the ``$PATH``, and was tested using v0.1.3
+
+To install the wrapper copy or move the following files under the Galaxy tools
+folder, e.g. in a ``tools/count_roi_variants`` folder:
+
+* ``count_roi_variants.xml`` (the Galaxy tool definition)
+* ``count_roi_variants.py`` (the Python wrapper script)
+* ``README.rst`` (this file)
+
+You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer
+the tool. Just add the line, perhaps under the NGS tools section::
+
+
+
+If you wish to run the unit tests, also move/copy the ``test-data/`` files
+under Galaxy's ``test-data/`` folder. Then::
+
+ $ ./run_tests.sh -id count_roi_variants
+
+That's it.
+
+
+History
+=======
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.1 - Initial public release
+v0.0.2 - Cope with pipes in reference name (e.g. NCBI style FASTA naming)
+v0.0.3 - Include a functional test for using an unrecognised reference.
+v0.0.4 - Improved usage text and README for use outside of Galaxy.
+======= ======================================================================
+
+
+Developers
+==========
+
+Development is on this GitHub repository:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/count_roi_variants
+
+For pushing a release to the test or main "Galaxy Tool Shed", use the following
+Planemo commands (which requires you have set your Tool Shed access details in
+``~/.planemo.yml`` and that you have access rights on the Tool Shed)::
+
+ $ planemo shed_update -t testtoolshed --check_diff ~/repositories/pico_galaxy/tools/count_roi_variants/
+ ...
+
+or::
+
+ $ planemo shed_update -t toolshed --check_diff ~/repositories/pico_galaxy/tools/count_roi_variants/
+ ...
+
+To just build and check the tar ball, use::
+
+ $ planemo shed_upload --tar_only ~/repositories/pico_galaxy/tools/count_roi_variants/
+ ...
+ $ tar -tzf shed_upload.tar.gz
+ tools/count_roi_variants/README.rst
+ tools/count_roi_variants/count_roi_variants.xml
+ tools/count_roi_variants/count_roi_variants.py
+ tools/count_roi_variants/tool_dependencies.xml
+ ...
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
+
+NOTE: This is the licence for the Galaxy Wrapper only.
+samtools is available and licenced separately.
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/count_roi_variants.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/count_roi_variants/count_roi_variants.py Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,240 @@
+#!/usr/bin/env python
+"""Count sequence variants in region of interest in a BAM file.
+
+This script takes exactly four command line arguments:
+ * Input BAM filename
+ * Input BAI filename (via Galaxy metadata)
+ * Output tabular filename
+ * Region of interest (reference:start-end as used in samtools)
+
+This messes about with the filenames to make samtools happy, then
+runs "samtools view" and parses the reads mapped to the ROI, counts
+the observed variants spanning the ROI, and outputs this as a
+tabular file.
+"""
+import sys
+import os
+import subprocess
+import tempfile
+
+if "-v" in sys.argv or "--version" in sys.argv:
+ # Galaxy seems to invert the order of the two lines
+ print("BAM coverage statistics v0.0.4 (using samtools)")
+ cmd = "samtools 2>&1 | grep -i ^Version"
+ sys.exit(os.system(cmd))
+
+# TODO - Proper command line API
+usage = """Requires 4 arguments: BAM, BAI, tabular filename, samtools-style region
+
+For ease of use, you can use a minus sign as the BAI filename which will use the BAM
+filename with the suffix .bai added to it. Example using one of the test-data files:
+
+$ count_roi_variants.py "SRR639755_mito_pairs_vs_NC_010642_clc.bam" "-" "counts.txt" "gi|187250362|ref|NC_010642.1|:1695-1725"
+Counted 3 variants from 16 reads spanning gi|187250362|ref|NC_010642.1|:1695-1725
+$ cat counts.txt
+Variant Count Percentage
+AGCCCATGAGATGGGAAGCAATGGGCTACA 14 87.50
+AGCCCATGAGATGGGAAGCAATGGGCTACG 1 6.25
+AGCGCATGAGATGGGAAGCAATGGGCTACG 1 6.25
+"""
+if len(sys.argv) == 5:
+ bam_filename, bai_filename, tabular_filename, region = sys.argv[1:]
+else:
+ sys.exit(usage)
+
+if not os.path.isfile(bam_filename):
+ sys.exit("Input BAM file not found: %s" % bam_filename)
+if bai_filename == "-":
+ # Make this optional for ease of use at the command line by hand:
+ if os.path.isfile(bam_filename + ".bai"):
+ bai_filename = bam_filename + ".bai"
+if not os.path.isfile(bai_filename):
+ if bai_filename == "None":
+ sys.exit("Error: Galaxy did not index your BAM file")
+ sys.exit("Input BAI file not found: %s" % bai_filename)
+
+try:
+ # Sanity check we have "ref:start-end" to give clear error message
+ # Note can have semi-colon in the reference name
+ # Note can have thousand separator commas in the start/end
+ ref, start_end = region.rsplit(":", 1)
+ start, end = start_end.split("-")
+ start = int(start.replace(",", ""))
+ end = int(end.replace(",", ""))
+except ValueError:
+ sys.exit("Bad argument for region: %r" % region)
+
+
+# Assign sensible names with real extensions, and setup symlinks:
+tmp_dir = tempfile.mkdtemp()
+bam_file = os.path.join(tmp_dir, "temp.bam")
+bai_file = os.path.join(tmp_dir, "temp.bam.bai")
+os.symlink(os.path.abspath(bam_filename), bam_file)
+os.symlink(os.path.abspath(bai_filename), bai_file)
+assert os.path.isfile(bam_file), bam_file
+assert os.path.isfile(bai_file), bai_file
+assert os.path.isfile(bam_file + ".bai"), bam_file
+
+
+def clean_up():
+ os.remove(bam_file)
+ os.remove(bai_file)
+ os.rmdir(tmp_dir)
+
+
+def decode_cigar(cigar):
+ """Returns a list of 2-tuples, integer count and operator char."""
+ count = ""
+ answer = []
+ for letter in cigar:
+ if letter.isdigit():
+ count += letter # string addition
+ elif letter in "MIDNSHP=X":
+ answer.append((int(count), letter))
+ count = ""
+ else:
+ raise ValueError("Invalid character %s in CIGAR %s" % (letter, cigar))
+ return answer
+
+
+assert decode_cigar("14S15M1P1D3P54M1D34M5S") == [(14, 'S'), (15, 'M'), (1, 'P'), (1, 'D'), (3, 'P'), (54, 'M'), (1, 'D'), (34, 'M'), (5, 'S')]
+
+
+def align_len(cigar_ops):
+ """Sums the CIGAR M/=/X/D/N operators."""
+ return sum(count for count, op in cigar_ops if op in "M=XDN")
+
+
+def expand_cigar(seq, cigar_ops):
+ """Yields (ref_offset, seq_base) pairs."""
+ ref_offset = 0
+ seq_offset = 0
+ for count, op in cigar_ops:
+ if op in "MX=":
+ for (i, base) in enumerate(seq[seq_offset:seq_offset + count]):
+ yield ref_offset + i, base
+ ref_offset += count
+ seq_offset += count
+ elif op == "I":
+ # Give them all an in-between reference position
+ # (Python lets us mix integers and floats, wouldn't work in C)
+ for (i, base) in enumerate(seq[seq_offset:seq_offset + count]):
+ yield ref_offset - 0.5, base
+ # Does not change ref_offset
+ seq_offset += count
+ elif op in "DN":
+ # Deletion/skip,
+ # TODO: Option to return gap characters
+ # for i in range(count):
+ # yield ref_offset + i, "-"
+ ref_offset += count
+ elif op == "S":
+ # Soft clipping, silently discard the bases (OK?)
+ seq_offset += count
+ elif op in "HP":
+ # Hard trimming or pad, can ignore
+ # TODO: Yield "-" if later offer to report deletions
+ pass
+ else:
+ raise NotImplementedError("Unexpected CIGAR operator %s" % op)
+
+assert list(expand_cigar("ACGT", decode_cigar("4M"))) == [(0, "A"), (1, "C"), (2, "G"), (3, "T")]
+assert list(expand_cigar("ACGT", decode_cigar("2=1X1="))) == [(0, "A"), (1, "C"), (2, "G"), (3, "T")]
+assert list(expand_cigar("ACGT", decode_cigar("2M1D2M"))) == [(0, "A"), (1, "C"), (3, "G"), (4, "T")]
+assert list(expand_cigar("ACtGT", decode_cigar("2M1I2M"))) == [(0, "A"), (1, "C"), (1.5, "t"), (2, "G"), (3, "T")]
+assert list(expand_cigar("tACGT", decode_cigar("1I4M"))) == [(-0.5, 't'), (0, 'A'), (1, 'C'), (2, 'G'), (3, 'T')]
+assert list(expand_cigar("ACGTt", decode_cigar("4M1I"))) == [(0, 'A'), (1, 'C'), (2, 'G'), (3, 'T'), (3.5, 't')]
+assert list(expand_cigar("AAAAGGGGTTTT", decode_cigar("12M"))) == [(0, 'A'), (1, 'A'), (2, 'A'), (3, 'A'), (4, 'G'), (5, 'G'), (6, 'G'), (7, 'G'), (8, 'T'), (9, 'T'), (10, 'T'), (11, 'T')]
+assert list(expand_cigar("AAAAcGGGGTTTT", decode_cigar("4M1I8M"))) == [(0, 'A'), (1, 'A'), (2, 'A'), (3, 'A'), (3.5, 'c'), (4, 'G'), (5, 'G'), (6, 'G'), (7, 'G'), (8, 'T'), (9, 'T'), (10, 'T'), (11, 'T')]
+assert list(expand_cigar("AAAAGGGGcTTTT", decode_cigar("8M1I4M"))) == [(0, 'A'), (1, 'A'), (2, 'A'), (3, 'A'), (4, 'G'), (5, 'G'), (6, 'G'), (7, 'G'), (7.5, "c"), (8, 'T'), (9, 'T'), (10, 'T'), (11, 'T')]
+assert list(expand_cigar("AAAAcGGGGcTTTT", decode_cigar("4M1I4M1I4M"))) == [(0, 'A'), (1, 'A'), (2, 'A'), (3, 'A'), (3.5, 'c'), (4, 'G'), (5, 'G'), (6, 'G'), (7, 'G'), (7.5, 'c'), (8, 'T'), (9, 'T'), (10, 'T'), (11, 'T')]
+
+
+def get_roi(seq, cigar_ops, start, end):
+ """Extract region of seq mapping to the ROI.
+
+ Expect start and end to be zero based Python style end points.
+
+ i.e. The ROI relative to the mapping start recorded in the POS field.
+ Will return part of the SAM/BAM value SEQ based on interpretting the
+ passed CIGAR operators.
+ """
+ if len(cigar_ops) == 1 and cigar_ops[0][1] in "M=X":
+ # Easy case, note start/end/pos all one-based
+ assert cigar_ops[0][0] == len(seq)
+ return seq[start:end]
+ # Would use "start <= i < end" if they were all integers, but
+ # want to exclude e.g. 3.5 and 7.5 when given start 4 and end 8.
+ return "".join(base for i, base in expand_cigar(seq, cigar_ops) if start <= i <= end - 1)
+
+assert "GGGG" == get_roi("AAAAGGGGTTTT", decode_cigar("12M"), 4, 8)
+assert "GGGG" == get_roi("AAAAcGGGGTTTT", decode_cigar("4M1I8M"), 4, 8)
+assert "GGGG" == get_roi("AAAAGGGGcTTTT", decode_cigar("8M1I4M"), 4, 8)
+assert "GGGG" == get_roi("AAAAcGGGGcTTTT", decode_cigar("4M1I4M1I4M"), 4, 8)
+assert "GGaGG" == get_roi("AAAAGGaGGTTTT", decode_cigar("6M1I6M"), 4, 8)
+assert "GGGgA" == get_roi("AAAAGGGgATTTT", decode_cigar("7M1I5M"), 4, 8)
+
+
+def count_region():
+ # Could recreate the region string (with no commas in start/end)?
+ # region = "%s:%i-%i" % (ref, start, end)
+
+ tally = dict()
+
+ # Call samtools view, don't need header so no -h added.
+ # Only want mapped reads, thus flag filter -F 4.
+ child = subprocess.Popen(["samtools", "view", "-F", "4", bam_file, region],
+ stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+ for line in child.stdout:
+ assert line[0] != "@", "Got unexpected SAM header line: %s" % line
+ qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, rest = line.split("\t", 10)
+ pos = int(pos) # one-based
+ if start < pos:
+ # Does not span the ROI
+ continue
+ cigar_ops = decode_cigar(cigar)
+ if pos + align_len(cigar_ops) - 1 < end:
+ # Does not span the ROI
+ continue
+ # All of start/end/pos are currently one-based, making offsets Python style....
+ roi_seq = get_roi(seq, cigar_ops, start - pos, end - pos + 1)
+ assert roi_seq, "Error, empty ROI sequence for: %s" % line
+ try:
+ tally[roi_seq] += 1
+ except KeyError:
+ tally[roi_seq] = 1
+
+ stderr = child.stderr.read()
+ child.stdout.close()
+ child.stderr.close()
+ return_code = child.wait()
+ if return_code:
+ sys.exit("Got return code %i from samtools view" % return_code)
+ elif "specifies an unknown reference name. Continue anyway." in stderr:
+ sys.exit(stderr.strip() + "\n\nERROR: samtools did not recognise the region requested, can't count any variants.")
+
+ return tally
+
+
+def record_counts():
+
+ tally = count_region()
+ total = sum(tally.values())
+
+ # Using negative count to get sort with highest count first,
+ # while tie-breaking by the ROI sequence alphabetically.
+ table = sorted((-count, roi_seq) for (roi_seq, count) in tally.items())
+ del tally
+
+ with open(tabular_filename, "w") as handle:
+ handle.write("Variant\tCount\tPercentage\n")
+ for count, roi_seq in table:
+ handle.write("%s\t%i\t%0.2f\n" % (roi_seq, -count, -count * 100.0 / total))
+
+ print("Counted %i variants from %i reads spanning %s" % (len(table), total, region))
+
+
+# Run it!
+record_counts()
+# Remove the temp symlinks and files:
+clean_up()
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/count_roi_variants.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/count_roi_variants/count_roi_variants.xml Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,112 @@
+
+ using samtools view
+
+ samtools
+ samtools
+
+
+
+
+
+
+ count_roi_variants.py --version
+ count_roi_variants.py "$input_bam" "${input_bam.metadata.bam_index}" "$out_tabular" "$region"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+This tool runs the command ``samtools view`` from the SAMtools toolkit, getting
+all the reads in your BAM file mapped to the given region of interest (ROI).
+It then counts all the different sequence variants in reads spanning that ROI,
+which are returned as a tab-separated table.
+
+Reads mapped to the ROI but not spanning it completely are ignored.
+
+Input is a sorted and indexed BAM file, the output is tabular. The first column
+is the observed sequence variants within the ROI, the second column is the number
+of reads with that sequence, and the third column gives this as a percentage of
+the reads spanning the ROI.
+
+====== =================================================================================
+Column Description
+------ ---------------------------------------------------------------------------------
+ 1 Sequence variant from ROI
+ 2 Number of reads with that sequence variant
+ 3 Percentage of reads with that sequence variant (2 dp)
+====== =================================================================================
+
+
+**Citation**
+
+If you use this Galaxy tool in work leading to a scientific publication please
+cite:
+
+Heng Li et al (2009). The Sequence Alignment/Map format and SAMtools.
+Bioinformatics 25(16), 2078-9.
+http://dx.doi.org/10.1093/bioinformatics/btp352
+
+Peter J.A. Cock (2016), Count sequence variants in region of interest in BAM file.
+http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants
+
+This wrapper is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/count_roi_variants
+
+
+ 10.1093/bioinformatics/btp352
+
+
diff -r 000000000000 -r 95efbdb72961 tools/count_roi_variants/tool_dependencies.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/count_roi_variants/tool_dependencies.xml Wed Feb 01 07:10:26 2017 -0500
@@ -0,0 +1,6 @@
+
+
+
+
+
+