Mercurial > repos > iuc > duplex_family_size_distribution

--- a/fsd.py	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd.py	Wed Dec 04 16:32:40 2019 -0500
@@ -25,7 +25,7 @@

 def readFileReferenceFree(file):
     with open(file, 'r') as dest_f:
-        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype='string')
+        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype=str)
         return(data_array)


@@ -274,7 +274,7 @@
         fig.suptitle('Family Size Distribution (FSD) based on families', fontsize=14)
         ax = fig.add_subplot(1, 1, 1)
         ticks = numpy.arange(1, 22, 1)
-        ticks1 = map(str, ticks)
+        ticks1 = [str(_) for _ in ticks]
         ticks1[len(ticks1) - 1] = ">20"
         ax.set_xticks([], [])
         if rel_freq:
@@ -299,7 +299,7 @@
         fig2.suptitle('Family Size Distribution (FSD) based on PE reads', fontsize=14)
         ax2 = fig2.add_subplot(1, 1, 1)
         ticks = numpy.arange(1, 22)
-        ticks1 = map(str, ticks)
+        ticks1 = [str(_) for _ in ticks]
         ticks1[len(ticks1) - 1] = ">20"
         reads = []
         reads_rel = []
@@ -468,7 +468,7 @@

             # tick labels of x axis
             ticks = numpy.arange(1, 22, 1)
-            ticks1 = map(str, ticks)
+            ticks1 = [str(_) for _ in ticks]
             ticks1[len(ticks1) - 1] = ">20"
             plt.xticks(numpy.array(ticks), ticks1)
             singl = len(data_o[data_o == 1])
@@ -549,7 +549,7 @@
             fig3.suptitle("{}: FSD based on PE reads".format(name_file), fontsize=14)
             ax2 = fig3.add_subplot(1, 1, 1)
             ticks = numpy.arange(1, 22)
-            ticks1 = map(str, ticks)
+            ticks1 = [str(_) for _ in ticks]
             ticks1[len(ticks1) - 1] = ">20"
             reads = []
             reads_rel = []
--- a/fsd.xml	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd.xml	Wed Dec 04 16:32:40 2019 -0500
@@ -1,21 +1,19 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<tool id="fsd" name="FSD:" version="1.0.0" profile="19.01">
+<tool id="fsd" name="FSD:" version="1.0.2" profile="19.01">
     <description>Family Size Distribution of duplex sequencing tags</description>
     <macros>
         <import>fsd_macros.xml</import>
     </macros>
-    <requirements>
-        <requirement type="package" version="2.7">python</requirement>
-        <requirement type="package" version="1.4.0">matplotlib</requirement>
-    </requirements>
+    <expand macro="requirements" />
     <command>
 python '$__tool_directory__/fsd.py'
 #for $i, $s in enumerate($series)
-    --inputFile${i + 1} '${s.file}'
-    --inputName${i + 1} '${s.file.element_identifier}'
+    #set $file=$s.file
+    --inputFile${i + 1} '${file}'
+    --inputName${i + 1} @ESCAPE_IDENTIFIER@
 #end for
-$log_axis
-$rel_freq
+$log_axis
+$rel_freq
 --output_pdf '$output_pdf'
 --output_tabular '$output_tabular'
     </command>
@@ -63,18 +61,18 @@
     </tests>
     <help><![CDATA[
 **What it does**
-
+
 This tool provides a computationally very fast insight into the distribution of the family sizes of ALL tags from a Duplex Sequencing experiment (DS) and gives a first assessment of the distribution of PE-reads in families with 1 member up to >20 members. This information is very useful in early decision steps of the analysis parameters, such as the minimum number of PE-reads to build the single stranded consensus sequence (SSCS). Moreover, this tool can compare several datasets or different steps in the analysis pipeline to monitor data loss or gain (e.g families re-united with barcode correction tool from the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_). In an extension of this tool, each family is stratified into SSCS (ab/ba) and DSC and visualizes the allocation of DCSs respective to SSCS-ab and SSCS-ba. This is quite handy to better understand the relationship of SSCS to DCS per family and identify sources of bias (e.g. more SSCS to DCS in a particular family size, or more forward ab than reverse ba reads).
-
+
 **Input**
-
-This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands. At least one input file must be provided and up to 4 different datasets can be compared with this tool. All input files are produced in the same way (see section **How to generate the input**)::
+
+This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands. At least one input file must be provided and up to 4 different datasets can be compared with this tool. All input files are produced in the same way (see section **How to generate the input**)::

  1 2                        3
  -----------------------------
  1 AAAAAAAAAAAAAAAAATGGTATG ba
  3 AAAAAAAAAAAAAATGGTATGGAC ab
-
+
 .. class:: infomark

 **How to generate the input**
@@ -90,7 +88,7 @@

 We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -99,9 +97,9 @@

 Next, the tags are sorted in ascending or descending order using the **Sort** tool::

- 1                        2
+ 1                        2
  ---------------------------
- AAAAAAAAAAAAAAAAATGGTATG ba
+ AAAAAAAAAAAAAAAAATGGTATG ba
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -114,15 +112,15 @@
  3 AAAAAAAAAAAAAATGGTATGGAC ab

 These data can now be used in this tool.
-
+
 **Output**
-
+
 The output is a PDF file with a plot and a summary of the data in the plot. The tool analyzes the family size distribution, that is the number of PE-reads per tag or family. This information is presented graphically as a histogram. The output can interpreted the following way:

 The **first part** of the output from this tool is “FSD after tags” and “FSD after PE reads” depending whether the data is compared to the total number of unique tags/families or the total number of PE-reads. The data is presented in a histogram showing the frequency of family sizes (FS) ranging from FS=1 to FS>20 members in a step-wise manner. Multiple datasets or steps of an analysis can be merged in one histogram (e.g. tags before and after barcode correction). The family size distribution gives the user an insight into the allocation of DCS in the libary; information that can be used to decide on the optimal family size parameter for building SSCSs.

 The **second part** of the output shows the family size distribution (FSD) for each of the datasets in more detail. The tags are categorized based on whether they form DCS (duplex = complementary tags in the ab and ba direction) or form only SSCS-ab or SSCS-ba with no matching complement.
-]]>
+]]>
 </help>
  <expand macro="citation" />
 </tool>
--- a/fsd_beforevsafter.py	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd_beforevsafter.py	Wed Dec 04 16:32:40 2019 -0500
@@ -27,7 +27,7 @@

 def readFileReferenceFree(file, delim):
     with open(file, 'r') as dest_f:
-        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype='string')
+        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype=str)
         return data_array


@@ -202,11 +202,11 @@

         counts = plt.hist(list1, bins=range(-1, maximumX + 1), stacked=False, label=labels, color=colors, align="left", alpha=1, edgecolor="black", linewidth=1)
         ticks = numpy.arange(0, maximumX, 1)
-        ticks1 = map(str, ticks)
+        ticks1 = [str(_) for _ in ticks]
         ticks1[len(ticks1) - 1] = ">20"
         plt.xticks(numpy.array(ticks), ticks1)
         if ref_genome is not None:
-            count = numpy.array([v for k, v in sorted(Counter(quant_ab_ref).iteritems())])  # count all family sizes from all ab strands
+            count = numpy.array([v for k, v in sorted(Counter(quant_ab_ref).items())])  # count all family sizes from all ab strands
             legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)"
             plt.text(0.1, 0.085, legend, size=11, transform=plt.gcf().transFigure)

@@ -216,7 +216,7 @@
             plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure)

             count2 = numpy.array(
-                [v for k, v in sorted(Counter(quant_ba_ref).iteritems())])  # count all family sizes from all ba strands
+                [v for k, v in sorted(Counter(quant_ba_ref).items())])  # count all family sizes from all ba strands
             legend = "BA\n{}\n{}\n{:.5f}" \
                 .format(max(quant_ba_ref), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2))
             plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure)
--- a/fsd_beforevsafter.xml	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd_beforevsafter.xml	Wed Dec 04 16:32:40 2019 -0500
@@ -1,19 +1,17 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<tool id="fsd_beforevsafter" name="FSD Before/After:" version="1.0.0" profile="19.01">
+<tool id="fsd_beforevsafter" name="FSD Before/After:" version="1.0.2" profile="19.01">
     <description>Family Size Distribution of duplex sequencing tags during Du Novo analysis</description>
     <macros>
         <import>fsd_macros.xml</import>
     </macros>
-    <requirements>
-        <requirement type="package" version="2.7">python</requirement>
-        <requirement type="package" version="1.4">matplotlib</requirement>
+    <expand macro="requirements">
         <requirement type="package" version="1.71">biopython</requirement>
         <requirement type="package" version="0.15">pysam</requirement>
-    </requirements>
+    </expand>
     <command>
-python '$__tool_directory__/fsd_beforevsafter.py'
---inputFile_SSCS '$file1'
---inputName1 '$file1.element_identifier'
+python '$__tool_directory__/fsd_beforevsafter.py'
+--inputFile_SSCS '$file'
+--inputName1 @ESCAPE_IDENTIFIER@
 --makeDCS '$makeDCS'
 #if $afterTrimming:
     --afterTrimming '$afterTrimming'
@@ -21,11 +19,11 @@
 #if $bamFile:
 --bamFile '$bamFile'
 #end if
---output_pdf '$output_pdf'
+--output_pdf '$output_pdf'
 --output_tabular '$output_tabular'
     </command>
     <inputs>
-        <param name="file1" type="data" format="tabular" label="Input tags of SSCSs" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
+        <param name="file" type="data" format="tabular" label="Input tags of SSCSs" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
         <param name="makeDCS" type="data" format="fasta" label="Input tags after making DCSs" help="Input in fasta format with the tags of the reads, which were aligned to DCSs. This file is produced by the 'Make consensus reads' tool."/>
         <param name="afterTrimming" type="data" format="fasta" optional="true" label="Input tags after trimming" help="Input in fasta format with the tags of the reads, which were not filtered out after trimming. This file is produced by the 'Sequence Content Trimmer'."/>
         <param name="bamFile" type="data" format="bam" optional="true" label="Input tags aligned to the reference genome" help="Input in BAM format with the reads that were aligned to the reference genome."/>
@@ -36,7 +34,7 @@
     </outputs>
     <tests>
         <test>
-            <param name="file1" value="fsd_ba_data.tab"/>
+            <param name="file" value="fsd_ba_data.tab"/>
             <param name="makeDCS" value="fsd_ba_DCS.fna"/>
             <param name="afterTrimming" value="fsd_ba_trimmed.fna"/>
             <param name="bamFile" value="fsd_ba.bam"/>
@@ -47,18 +45,18 @@
     <help><![CDATA[

 **What it does**
-
+
 This tool will create a distribution of family sizes from tags of various steps of the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_.

 **Input**
-
-**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.::
+
+**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.::

  1 2                        3
  -----------------------------
  1 AAAAAAAAAAAAAAAAATGGTATG ba
  3 AAAAAAAAAAAAAATGGTATGGAC ab
-
+
 .. class:: infomark

 **How to generate the input**
@@ -74,7 +72,7 @@

 We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -83,7 +81,7 @@

 Next, the tags are sorted in ascending or descending order using the **Sort** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAAAATGGTATG ba
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -98,25 +96,25 @@
  3 AAAAAAAAAAAAAATGGTATGGAC ab

 These data can now be used in this tool.
-
+
 **Dataset 2:** A fasta file is required with all tags and the associated family size of both strands (forward and reverse) in the header and the read itself in the next line. This input file can be obtained by the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ with the tool **Make consensus reads**.

 **Dataset 3 (optional):** In addition, the fasta file with all tags, which were not filtered out after trimming, can be included. This file can also be obtained by the `Du Novo Analysis Pipeline` with the tool **Sequence Content Trimmer**.

 For both input files (dataset 2 and 3), only the DCSs of one mate are necessary these tools give information on both mates in the output file), since both files have the same tags and family sizes, but different DCSs, which are not required in this tool::
-
+
  >AAAAAAAAATAGATCATAGACTCT 7-10
  CTAGACTCACTGGCGTTACTGACTGCGAGACCCTCCACGTG
  >AAAAAAAAGGCAGAAGATATACGC 11-3
  CNCNGGCCCCCCGCTCCGTGCACAGACGNNGCNACTGACAA
-
+
 **Dataset 4 (optional):** BAM file of the aligned reads. This file can be obtained by the tool `Map with BWA-MEM <https://arxiv.org/abs/1303.3997>`_.

 **Output**
-
+
 The output is a PDF file with a plot and a summary of the data of the plots. The tool compares various datasets from the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ and helps in decision making of various parameters (e.g family size, minimum read length, etc). For example: Re-running trimming with different parameters allows to recover reads that would be lost due to stringent filtering by read length. This tool also allows to assess reads on target. The tool extracts the tags of reads and their family sizes before SSCS building, after DCS building, after trimming and finally after the alignment to the reference genome. In the plot, the family sizes for both SSCS-ab and SSCS-ba are shown; whereas the summary represents only counts either of SSCS-ab or of SSCS-ba.

-]]>
+]]>
     </help>
     <expand macro="citation" />
 </tool>
--- a/fsd_macros.xml	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd_macros.xml	Wed Dec 04 16:32:40 2019 -0500
@@ -1,4 +1,5 @@
 <macros>
+    <token name="@ESCAPE_IDENTIFIER@"><![CDATA[#silent import re#'#echo re.sub('[^\s\w\-\.]', '_', str($file.element_identifier))#']]></token>
     <xml name="citation">
     <citations>
         <citation type="bibtex">
@@ -9,5 +10,12 @@
          }
         </citation>
     </citations>
-</xml>
-</macros>
\ No newline at end of file
+    </xml>
+
+    <xml name="requirements">
+    <requirements>
+        <requirement type="package" version="3.1.2">matplotlib</requirement>
+        <yield/>
+    </requirements>
+    </xml>
+</macros>
--- a/fsd_regions.py	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd_regions.py	Wed Dec 04 16:32:40 2019 -0500
@@ -1,270 +1,270 @@
-#!/usr/bin/env python
-
-# Family size distribution of tags which were aligned to the reference genome
-#
-# Author: Monika Heinzl & Gundula Povysil, Johannes-Kepler University Linz (Austria)
-# Contact: monika.heinzl@edumail.at
-#
-# Takes at least one TABULAR file with tags before the alignment to the SSCS,
-# a BAM file with tags of reads that overlap the regions of the reference genome and
-# an optional BED file with chromosome, start and stop position of the regions as input.
-# The program produces a plot which shows the distribution of family sizes of the tags from the input files and
-# a tabular file with the data of the plot.
-
-# USAGE: python FSD_regions.py --inputFile filenameSSCS --inputName1 filenameSSCS
-# --bamFile DCSbamFile --rangesFile BEDfile --output_tabular outptufile_name_tabular
-# --output_pdf outputfile_name_pdf
-
-import argparse
-import collections
-import re
-import sys
-
-import matplotlib.pyplot as plt
-import numpy as np
-import pysam
-from matplotlib.backends.backend_pdf import PdfPages
-
-plt.switch_backend('agg')
-
-
-def readFileReferenceFree(file, delim):
-    with open(file, 'r') as dest_f:
-        data_array = np.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype='string')
-        return data_array
-
-
-def make_argparser():
-    parser = argparse.ArgumentParser(description='Family Size Distribution of tags which were aligned to regions of the reference genome')
-    parser.add_argument('--inputFile', help='Tabular File with three columns: ab or ba, tag and family size.')
-    parser.add_argument('--inputName1')
-    parser.add_argument('--bamFile', help='BAM file with aligned reads.')
-    parser.add_argument('--rangesFile', default=None, help='BED file with chromosome, start and stop positions.')
-    parser.add_argument('--output_pdf', default="data.pdf", type=str, help='Name of the pdf and tabular file.')
-    parser.add_argument('--output_tabular', default="data.tabular", type=str, help='Name of the pdf and tabular file.')
-    return parser
-
-
-def compare_read_families_refGenome(argv):
-    parser = make_argparser()
-    args = parser.parse_args(argv[1:])
-
-    firstFile = args.inputFile
-    name1 = args.inputName1
-    name1 = name1.split(".tabular")[0]
-    bamFile = args.bamFile
-
-    rangesFile = args.rangesFile
-    title_file = args.output_pdf
-    title_file2 = args.output_tabular
-    sep = "\t"
-
-    with open(title_file2, "w") as output_file, PdfPages(title_file) as pdf:
-        data_array = readFileReferenceFree(firstFile, "\t")
-        pysam.index(bamFile)
-
-        bam = pysam.AlignmentFile(bamFile, "rb")
-        qname_dict = collections.OrderedDict()
-
-        if rangesFile is not None:
-            with open(rangesFile, 'r') as regs:
-                range_array = np.genfromtxt(regs, skip_header=0, delimiter='\t', comments='#', dtype='string')
-
-            if range_array.ndim == 0:
-                print("Error: file has 0 lines")
-                exit(2)
-
-            if range_array.ndim == 1:
-                chrList = range_array[0]
-                start_posList = range_array[1].astype(int)
-                stop_posList = range_array[2].astype(int)
-                chrList = [chrList.tolist()]
-                start_posList = [start_posList.tolist()]
-                stop_posList = [stop_posList.tolist()]
-            else:
-                chrList = range_array[:, 0]
-                start_posList = range_array[:, 1].astype(int)
-                stop_posList = range_array[:, 2].astype(int)
-
-            if len(start_posList) != len(stop_posList):
-                print("start_positions and end_positions do not have the same length")
-                exit(3)
-
-            chrList = np.array(chrList)
-            start_posList = np.array(start_posList).astype(int)
-            stop_posList = np.array(stop_posList).astype(int)
-            for chr, start_pos, stop_pos in zip(chrList, start_posList, stop_posList):
-                chr_start_stop = "{}_{}_{}".format(chr, start_pos, stop_pos)
-                qname_dict[chr_start_stop] = []
-                for read in bam.fetch(chr.tobytes(), start_pos, stop_pos):
-                    if not read.is_unmapped:
-                        if re.search('_', read.query_name):
-                            tags = re.split('_', read.query_name)[0]
-                        else:
-                            tags = read.query_name
-                        qname_dict[chr_start_stop].append(tags)
-
-        else:
-            for read in bam.fetch():
-                if not read.is_unmapped:
-                    if re.search(r'_', read.query_name):
-                        tags = re.split('_', read.query_name)[0]
-                    else:
-                        tags = read.query_name
-
-                    if read.reference_name not in qname_dict.keys():
-                        qname_dict[read.reference_name] = [tags]
-                    else:
-                        qname_dict[read.reference_name].append(tags)
-
-        seq = np.array(data_array[:, 1])
-        tags = np.array(data_array[:, 2])
-        quant = np.array(data_array[:, 0]).astype(int)
-        group = np.array(qname_dict.keys())
-
-        all_ab = seq[np.where(tags == "ab")[0]]
-        all_ba = seq[np.where(tags == "ba")[0]]
-        quant_ab = quant[np.where(tags == "ab")[0]]
-        quant_ba = quant[np.where(tags == "ba")[0]]
-
-        seqDic_ab = dict(zip(all_ab, quant_ab))
-        seqDic_ba = dict(zip(all_ba, quant_ba))
-
-        lst_ab = []
-        lst_ba = []
-        quantAfterRegion = []
-        length_regions = 0
-        for i in group:
-            lst_ab_r = []
-            lst_ba_r = []
-            seq_mut = qname_dict[i]
-            if rangesFile is None:
-                seq_mut, seqMut_index = np.unique(np.array(seq_mut), return_index=True)
-            length_regions = length_regions + len(seq_mut) * 2
-            for r in seq_mut:
-                count_ab = seqDic_ab.get(r)
-                count_ba = seqDic_ba.get(r)
-                lst_ab_r.append(count_ab)
-                lst_ab.append(count_ab)
-                lst_ba_r.append(count_ba)
-                lst_ba.append(count_ba)
-
-            dataAB = np.array(lst_ab_r)
-            dataBA = np.array(lst_ba_r)
-            bigFamilies = np.where(dataAB > 20)[0]
-            dataAB[bigFamilies] = 22
-            bigFamilies = np.where(dataBA > 20)[0]
-            dataBA[bigFamilies] = 22
-
-            quantAll = np.concatenate((dataAB, dataBA))
-            quantAfterRegion.append(quantAll)
-
-        quant_ab = np.array(lst_ab)
-        quant_ba = np.array(lst_ba)
-
-        maximumX = np.amax(np.concatenate(quantAfterRegion))
-        minimumX = np.amin(np.concatenate(quantAfterRegion))
-
-        # PLOT
-        plt.rc('figure', figsize=(11.69, 8.27))  # A4 format
-        plt.rcParams['axes.facecolor'] = "E0E0E0"  # grey background color
-        plt.rcParams['xtick.labelsize'] = 14
-        plt.rcParams['ytick.labelsize'] = 14
-        plt.rcParams['patch.edgecolor'] = "black"
-        fig = plt.figure()
-        plt.subplots_adjust(bottom=0.3)
-
-        colors = ["#6E6E6E", "#0431B4", "#5FB404", "#B40431", "#F4FA58", "#DF7401", "#81DAF5"]
-
-        col = []
-        for i in range(0, len(group)):
-            col.append(colors[i])
-
-        counts = plt.hist(quantAfterRegion, bins=range(minimumX, maximumX + 1), stacked=False, label=group,
-                          align="left", alpha=1, color=col, edgecolor="black", linewidth=1)
-        ticks = np.arange(minimumX - 1, maximumX, 1)
-
-        ticks1 = map(str, ticks)
-        ticks1[len(ticks1) - 1] = ">20"
-        plt.xticks(np.array(ticks), ticks1)
-        count = np.bincount(map(int, quant_ab))  # original counts
-
-        legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)"
-        plt.text(0.15, 0.085, legend, size=11, transform=plt.gcf().transFigure)
-
-        legend = "AB\n{}\n{}\n{:.5f}\n\n{:,}".format(max(map(int, quant_ab)), count[len(count) - 1], float(count[len(count) - 1]) / sum(count), sum(np.array(data_array[:, 0]).astype(int)))
-        plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure)
-
-        count2 = np.bincount(map(int, quant_ba))  # original counts
-
-        legend = "BA\n{}\n{}\n{:.5f}" \
-            .format(max(map(int, quant_ba)), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2))
-        plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure)
-
-        plt.text(0.55, 0.2125, "total nr. of tags:", size=11, transform=plt.gcf().transFigure)
-        plt.text(0.8, 0.2125, "{:,} ({:,})".format(length_regions, length_regions / 2), size=11,
-                 transform=plt.gcf().transFigure)
-
-        legend4 = "* In the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n"
-        plt.text(0.1, 0.01, legend4, size=11, transform=plt.gcf().transFigure)
-
-        space = 0
-        for i, count in zip(group, quantAfterRegion):
-            plt.text(0.55, 0.15 - space, "{}:\n".format(i), size=11, transform=plt.gcf().transFigure)
-            plt.text(0.8, 0.15 - space, "{:,}\n".format(len(count) / 2), size=11, transform=plt.gcf().transFigure)
-            space = space + 0.02
-
-        plt.legend(loc='upper right', fontsize=14, bbox_to_anchor=(0.9, 1), frameon=True)
-        plt.xlabel("Family size", fontsize=14)
-        plt.ylabel("Absolute Frequency", fontsize=14)
-        plt.grid(b=True, which="major", color="#424242", linestyle=":")
-        plt.margins(0.01, None)
-
-        pdf.savefig(fig, bbox_inch="tight")
-        plt.close()
-
-        output_file.write("Dataset:{}{}\n".format(sep, name1))
-        output_file.write("{}AB{}BA\n".format(sep, sep))
-        output_file.write("max. family size:{}{}{}{}\n".format(sep, max(map(int, quant_ab)), sep, max(map(int, quant_ba))))
-        output_file.write("absolute frequency:{}{}{}{}\n".format(sep, count[len(count) - 1], sep, count2[len(count2) - 1]))
-        output_file.write("relative frequency:{}{:.3f}{}{:.3f}\n\n".format(sep, float(count[len(count) - 1]) / sum(count), sep, float(count2[len(count2) - 1]) / sum(count2)))
-        output_file.write("total nr. of reads{}{}\n".format(sep, sum(np.array(data_array[:, 0]).astype(int))))
-        output_file.write("total nr. of tags{}{} ({})\n".format(sep, length_regions, length_regions / 2))
-        output_file.write("\n\nValues from family size distribution\n")
-        output_file.write("{}".format(sep))
-        for i in group:
-            output_file.write("{}{}".format(i, sep))
-        output_file.write("\n")
-
-        j = 0
-        for fs in counts[1][0:len(counts[1]) - 1]:
-            if fs == 21:
-                fs = ">20"
-            else:
-                fs = "={}".format(fs)
-            output_file.write("FS{}{}".format(fs, sep))
-            if len(group) == 1:
-                output_file.write("{}{}".format(int(counts[0][j]), sep))
-            else:
-                for n in range(len(group)):
-                    output_file.write("{}{}".format(int(counts[0][n][j]), sep))
-            output_file.write("\n")
-            j += 1
-        output_file.write("sum{}".format(sep))
-
-        if len(group) == 1:
-            output_file.write("{}{}".format(int(sum(counts[0])), sep))
-        else:
-            for i in counts[0]:
-                output_file.write("{}{}".format(int(sum(i)), sep))
-        output_file.write("\n")
-        output_file.write("\n\nIn the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n")
-        output_file.write("Region{}total nr. of tags per region\n".format(sep))
-        for i, count in zip(group, quantAfterRegion):
-            output_file.write("{}{}{}\n".format(i, sep, len(count) / 2))
-
-    print("Files successfully created!")
-
-
-if __name__ == '__main__':
-    sys.exit(compare_read_families_refGenome(sys.argv))
+#!/usr/bin/env python
+
+# Family size distribution of tags which were aligned to the reference genome
+#
+# Author: Monika Heinzl & Gundula Povysil, Johannes-Kepler University Linz (Austria)
+# Contact: monika.heinzl@edumail.at
+#
+# Takes at least one TABULAR file with tags before the alignment to the SSCS,
+# a BAM file with tags of reads that overlap the regions of the reference genome and
+# an optional BED file with chromosome, start and stop position of the regions as input.
+# The program produces a plot which shows the distribution of family sizes of the tags from the input files and
+# a tabular file with the data of the plot.
+
+# USAGE: python FSD_regions.py --inputFile filenameSSCS --inputName1 filenameSSCS
+# --bamFile DCSbamFile --rangesFile BEDfile --output_tabular outptufile_name_tabular
+# --output_pdf outputfile_name_pdf
+
+import argparse
+import collections
+import re
+import sys
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pysam
+from matplotlib.backends.backend_pdf import PdfPages
+
+plt.switch_backend('agg')
+
+
+def readFileReferenceFree(file, delim):
+    with open(file, 'r') as dest_f:
+        data_array = np.genfromtxt(dest_f, skip_header=0, delimiter=delim, comments='#', dtype=str)
+        return data_array
+
+
+def make_argparser():
+    parser = argparse.ArgumentParser(description='Family Size Distribution of tags which were aligned to regions of the reference genome')
+    parser.add_argument('--inputFile', help='Tabular File with three columns: ab or ba, tag and family size.')
+    parser.add_argument('--inputName1')
+    parser.add_argument('--bamFile', help='BAM file with aligned reads.')
+    parser.add_argument('--rangesFile', default=None, help='BED file with chromosome, start and stop positions.')
+    parser.add_argument('--output_pdf', default="data.pdf", type=str, help='Name of the pdf and tabular file.')
+    parser.add_argument('--output_tabular', default="data.tabular", type=str, help='Name of the pdf and tabular file.')
+    return parser
+
+
+def compare_read_families_refGenome(argv):
+    parser = make_argparser()
+    args = parser.parse_args(argv[1:])
+
+    firstFile = args.inputFile
+    name1 = args.inputName1
+    name1 = name1.split(".tabular")[0]
+    bamFile = args.bamFile
+
+    rangesFile = args.rangesFile
+    title_file = args.output_pdf
+    title_file2 = args.output_tabular
+    sep = "\t"
+
+    with open(title_file2, "w") as output_file, PdfPages(title_file) as pdf:
+        data_array = readFileReferenceFree(firstFile, "\t")
+        pysam.index(bamFile)
+
+        bam = pysam.AlignmentFile(bamFile, "rb")
+        qname_dict = collections.OrderedDict()
+
+        if rangesFile is not None:
+            with open(rangesFile, 'r') as regs:
+                range_array = np.genfromtxt(regs, skip_header=0, delimiter='\t', comments='#', dtype=str)
+
+            if range_array.ndim == 0:
+                print("Error: file has 0 lines")
+                exit(2)
+
+            if range_array.ndim == 1:
+                chrList = range_array[0]
+                start_posList = range_array[1].astype(int)
+                stop_posList = range_array[2].astype(int)
+                chrList = [chrList.tolist()]
+                start_posList = [start_posList.tolist()]
+                stop_posList = [stop_posList.tolist()]
+            else:
+                chrList = range_array[:, 0]
+                start_posList = range_array[:, 1].astype(int)
+                stop_posList = range_array[:, 2].astype(int)
+
+            if len(start_posList) != len(stop_posList):
+                print("start_positions and end_positions do not have the same length")
+                exit(3)
+
+            chrList = np.array(chrList)
+            start_posList = np.array(start_posList).astype(int)
+            stop_posList = np.array(stop_posList).astype(int)
+            for chr, start_pos, stop_pos in zip(chrList, start_posList, stop_posList):
+                chr_start_stop = "{}_{}_{}".format(chr, start_pos, stop_pos)
+                qname_dict[chr_start_stop] = []
+                for read in bam.fetch(chr, start_pos, stop_pos):
+                    if not read.is_unmapped:
+                        if re.search('_', read.query_name):
+                            tags = re.split('_', read.query_name)[0]
+                        else:
+                            tags = read.query_name
+                        qname_dict[chr_start_stop].append(tags)
+
+        else:
+            for read in bam.fetch():
+                if not read.is_unmapped:
+                    if re.search(r'_', read.query_name):
+                        tags = re.split('_', read.query_name)[0]
+                    else:
+                        tags = read.query_name
+
+                    if read.reference_name not in qname_dict:
+                        qname_dict[read.reference_name] = [tags]
+                    else:
+                        qname_dict[read.reference_name].append(tags)
+
+        seq = np.array(data_array[:, 1])
+        tags = np.array(data_array[:, 2])
+        quant = np.array(data_array[:, 0]).astype(int)
+        group = np.array(list(qname_dict.keys()))
+
+        all_ab = seq[np.where(tags == "ab")[0]]
+        all_ba = seq[np.where(tags == "ba")[0]]
+        quant_ab = quant[np.where(tags == "ab")[0]]
+        quant_ba = quant[np.where(tags == "ba")[0]]
+
+        seqDic_ab = dict(zip(all_ab, quant_ab))
+        seqDic_ba = dict(zip(all_ba, quant_ba))
+
+        lst_ab = []
+        lst_ba = []
+        quantAfterRegion = []
+        length_regions = 0
+        for i in group:
+            lst_ab_r = []
+            lst_ba_r = []
+            seq_mut = qname_dict[i]
+            if rangesFile is None:
+                seq_mut, seqMut_index = np.unique(np.array(seq_mut), return_index=True)
+            length_regions = length_regions + len(seq_mut) * 2
+            for r in seq_mut:
+                count_ab = seqDic_ab.get(r)
+                count_ba = seqDic_ba.get(r)
+                lst_ab_r.append(count_ab)
+                lst_ab.append(count_ab)
+                lst_ba_r.append(count_ba)
+                lst_ba.append(count_ba)
+
+            dataAB = np.array(lst_ab_r)
+            dataBA = np.array(lst_ba_r)
+            bigFamilies = np.where(dataAB > 20)[0]
+            dataAB[bigFamilies] = 22
+            bigFamilies = np.where(dataBA > 20)[0]
+            dataBA[bigFamilies] = 22
+
+            quantAll = np.concatenate((dataAB, dataBA))
+            quantAfterRegion.append(quantAll)
+
+        quant_ab = np.array(lst_ab)
+        quant_ba = np.array(lst_ba)
+
+        maximumX = np.amax(np.concatenate(quantAfterRegion))
+        minimumX = np.amin(np.concatenate(quantAfterRegion))
+
+        # PLOT
+        plt.rc('figure', figsize=(11.69, 8.27))  # A4 format
+        plt.rcParams['axes.facecolor'] = "E0E0E0"  # grey background color
+        plt.rcParams['xtick.labelsize'] = 14
+        plt.rcParams['ytick.labelsize'] = 14
+        plt.rcParams['patch.edgecolor'] = "black"
+        fig = plt.figure()
+        plt.subplots_adjust(bottom=0.3)
+
+        colors = ["#6E6E6E", "#0431B4", "#5FB404", "#B40431", "#F4FA58", "#DF7401", "#81DAF5"]
+
+        col = []
+        for i in range(0, len(group)):
+            col.append(colors[i])
+
+        counts = plt.hist(quantAfterRegion, bins=range(minimumX, maximumX + 1), stacked=False, label=group,
+                          align="left", alpha=1, color=col, edgecolor="black", linewidth=1)
+        ticks = np.arange(minimumX - 1, maximumX, 1)
+
+        ticks1 = [str(_) for _ in ticks]
+        ticks1[len(ticks1) - 1] = ">20"
+        plt.xticks(np.array(ticks), ticks1)
+        count = np.bincount([int(_) for _ in quant_ab])  # original counts
+
+        legend = "max. family size:\nabsolute frequency:\nrelative frequency:\n\ntotal nr. of reads:\n(before SSCS building)"
+        plt.text(0.15, 0.085, legend, size=11, transform=plt.gcf().transFigure)
+
+        legend = "AB\n{}\n{}\n{:.5f}\n\n{:,}".format(max(map(int, quant_ab)), count[len(count) - 1], float(count[len(count) - 1]) / sum(count), sum(np.array(data_array[:, 0]).astype(int)))
+        plt.text(0.35, 0.105, legend, size=11, transform=plt.gcf().transFigure)
+
+        count2 = np.bincount([int(_) for _ in quant_ba])  # original counts
+
+        legend = "BA\n{}\n{}\n{:.5f}" \
+            .format(max(map(int, quant_ba)), count2[len(count2) - 1], float(count2[len(count2) - 1]) / sum(count2))
+        plt.text(0.45, 0.1475, legend, size=11, transform=plt.gcf().transFigure)
+
+        plt.text(0.55, 0.2125, "total nr. of tags:", size=11, transform=plt.gcf().transFigure)
+        plt.text(0.8, 0.2125, "{:,} ({:,})".format(length_regions, length_regions / 2), size=11,
+                 transform=plt.gcf().transFigure)
+
+        legend4 = "* In the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n"
+        plt.text(0.1, 0.01, legend4, size=11, transform=plt.gcf().transFigure)
+
+        space = 0
+        for i, count in zip(group, quantAfterRegion):
+            plt.text(0.55, 0.15 - space, "{}:\n".format(i), size=11, transform=plt.gcf().transFigure)
+            plt.text(0.8, 0.15 - space, "{:,}\n".format(len(count) / 2), size=11, transform=plt.gcf().transFigure)
+            space = space + 0.02
+
+        plt.legend(loc='upper right', fontsize=14, bbox_to_anchor=(0.9, 1), frameon=True)
+        plt.xlabel("Family size", fontsize=14)
+        plt.ylabel("Absolute Frequency", fontsize=14)
+        plt.grid(b=True, which="major", color="#424242", linestyle=":")
+        plt.margins(0.01, None)
+
+        pdf.savefig(fig, bbox_inch="tight")
+        plt.close()
+
+        output_file.write("Dataset:{}{}\n".format(sep, name1))
+        output_file.write("{}AB{}BA\n".format(sep, sep))
+        output_file.write("max. family size:{}{}{}{}\n".format(sep, max(map(int, quant_ab)), sep, max(map(int, quant_ba))))
+        output_file.write("absolute frequency:{}{}{}{}\n".format(sep, count[len(count) - 1], sep, count2[len(count2) - 1]))
+        output_file.write("relative frequency:{}{:.3f}{}{:.3f}\n\n".format(sep, float(count[len(count) - 1]) / sum(count), sep, float(count2[len(count2) - 1]) / sum(count2)))
+        output_file.write("total nr. of reads{}{}\n".format(sep, sum(np.array(data_array[:, 0]).astype(int))))
+        output_file.write("total nr. of tags{}{} ({})\n".format(sep, length_regions, length_regions / 2))
+        output_file.write("\n\nValues from family size distribution\n")
+        output_file.write("{}".format(sep))
+        for i in group:
+            output_file.write("{}{}".format(i, sep))
+        output_file.write("\n")
+
+        j = 0
+        for fs in counts[1][0:len(counts[1]) - 1]:
+            if fs == 21:
+                fs = ">20"
+            else:
+                fs = "={}".format(fs)
+            output_file.write("FS{}{}".format(fs, sep))
+            if len(group) == 1:
+                output_file.write("{}{}".format(int(counts[0][j]), sep))
+            else:
+                for n in range(len(group)):
+                    output_file.write("{}{}".format(int(counts[0][n][j]), sep))
+            output_file.write("\n")
+            j += 1
+        output_file.write("sum{}".format(sep))
+
+        if len(group) == 1:
+            output_file.write("{}{}".format(int(sum(counts[0])), sep))
+        else:
+            for i in counts[0]:
+                output_file.write("{}{}".format(int(sum(i)), sep))
+        output_file.write("\n")
+        output_file.write("\n\nIn the plot, both family sizes of the ab and ba strands were used.\nWhereas the total numbers indicate only the single count of the tags per region.\n")
+        output_file.write("Region{}total nr. of tags per region\n".format(sep))
+        for i, count in zip(group, quantAfterRegion):
+            output_file.write("{}{}{}\n".format(i, sep, len(count) / 2))
+
+    print("Files successfully created!")
+
+
+if __name__ == '__main__':
+    sys.exit(compare_read_families_refGenome(sys.argv))
--- a/fsd_regions.xml	Thu Oct 24 09:36:09 2019 -0400
+++ b/fsd_regions.xml	Wed Dec 04 16:32:40 2019 -0500
@@ -1,27 +1,25 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<tool id="fsd_regions" name="FSD regions:" version="1.0.0" profile="19.01">
+<tool id="fsd_regions" name="FSD regions:" version="1.0.1" profile="19.01">
     <description>Family size distribution of user-specified regions in the reference genome</description>
     <macros>
         <import>fsd_macros.xml</import>
     </macros>
-	<requirements>
-        <requirement type="package" version="2.7">python</requirement>
-        <requirement type="package" version="1.4.0">matplotlib</requirement>
+    <expand macro="requirements" >
         <requirement type="package" version="0.15">pysam</requirement>
-    </requirements>
+    </expand>
     <command>
-python '$__tool_directory__/fsd_regions.py'
---inputFile '$file1'
---inputName1 '$file1.element_identifier'
+python '$__tool_directory__/fsd_regions.py'
+--inputFile '$file'
+--inputName1 @ESCAPE_IDENTIFIER@
 --bamFile '$file2'
 #if $file3:
     --rangesFile '$file3'
 #end if
---output_pdf '$output_pdf'
+--output_pdf '$output_pdf'
 --output_tabular '$output_tabular'
     </command>
     <inputs>
-        <param name="file1" type="data" format="tabular" label="Input tags of whole dataset" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
+        <param name="file" type="data" format="tabular" label="Input tags of whole dataset" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
         <param name="file2" type="data" format="bam" label="BAM file of aligned reads." help="Input in BAM format with the reads that were aligned to the reference genome."/>
         <param name="file3" type="data" format="bed" label="BED file with chromsome, start and stop positions of the targetted regions." optional="true" help="BED file with start and stop positions of regions in the reference genome."/>
     </inputs>
@@ -31,11 +29,11 @@
     </outputs>
     <tests>
         <test>
-            <param name="file1" value="fsd_reg.tab"/>
+            <param name="file" value="fsd_reg.tab"/>
             <param name="file2" value="fsd_reg.bam"/>
             <param name="file3" value="fsd_reg_ranges.bed"/>
             <output name="output_pdf" file="fsd_reg_output.pdf" lines_diff="136"/>
-            <output name="output_tabular" file="fsd_reg_output.tab" lines_diff="2"/>
+            <output name="output_tabular" file="fsd_reg_output.tab" lines_diff="6"/>
         </test>
     </tests>
     <help> <![CDATA[
@@ -43,7 +41,7 @@

 This tool provides a computationally very fast insight into the distribution of the family sizes of ALL tags from a Duplex Sequencing (DS) experiment that were aligned to different regions targeted in the reference genome.

-**Input**
+**Input**

 **Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands::

@@ -51,7 +49,7 @@
  -----------------------------
  1 AAAAAAAAAAAAAAAAATGGTATG ba
  3 AAAAAAAAAAAAAATGGTATGGAC ab
-
+
 .. class:: infomark

 **How to generate the input**
@@ -67,7 +65,7 @@

 We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -76,7 +74,7 @@

 Next, the tags are sorted in ascending or descending order using the **Sort** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAAAATGGTATG ba
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -106,7 +104,7 @@

 The output is a PDF file with the plot and the summarized data of the plot. The tool creates a distribution of family sizes of tags that were aligned to the reference genome. Note that tags that overlap different regions of the reference genome are counted for each region. This tool is useful to examine differences in coverage among targeted regions. The plot includes both complementary SSCS-ab and SSCS-ba that form a DCS; whereas the table shows only single counts of the tags per region.

-    ]]>
+    ]]>
     </help>
     <expand macro="citation" />
 </tool>
--- a/td.py	Thu Oct 24 09:36:09 2019 -0400
+++ b/td.py	Wed Dec 04 16:32:40 2019 -0500
@@ -18,7 +18,6 @@
 #        --nr_above_bars True/False --output_tabular outptufile_name_tabular

 import argparse
-import itertools
 import operator
 import sys
 from collections import Counter, defaultdict
@@ -70,7 +69,7 @@
     plt.suptitle(subtitle, y=1, x=0.5, fontsize=14)
     plt.xlabel("Family size", fontsize=14)
     ticks = numpy.arange(0, maximumXFS + 1, 1)
-    ticks1 = map(str, ticks)
+    ticks1 = [str(_) for _ in ticks]
     if maximumXFS >= 20:
         ticks1[len(ticks1) - 1] = ">=20"
     plt.xticks(numpy.array(ticks), ticks1)
@@ -102,7 +101,7 @@

     fig = plt.figure(figsize=(6, 8))
     plt.subplots_adjust(bottom=0.1)
-    p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).iteritems())])
+    p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).items())])
     maximumY = numpy.amax(p1)
     if relative is True:  # relative difference
         bin1 = numpy.arange(-1, maximumX + 0.2, 0.1)
@@ -129,7 +128,7 @@
         plt.ylim((0, maximumY * 1.2))
         plt.ylabel("Absolute Frequency", fontsize=14)
         bins = counts[1]  # width of bins
-        counts = numpy.array(map(int, counts[0][5]))
+        counts = numpy.array([int(_) for _ in counts[0][5]])

     plt.legend(loc='upper right', fontsize=14, frameon=True, bbox_to_anchor=(1.45, 1))
     plt.suptitle(subtitle, y=1, x=0.5, fontsize=14)
@@ -177,7 +176,7 @@
     step = 1
     fig = plt.figure(figsize=(6, 8))
     plt.subplots_adjust(bottom=0.1)
-    p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).iteritems())])
+    p1 = numpy.array([v for k, v in sorted(Counter(numpy.concatenate(list1)).items())])
     maximumY = numpy.amax(p1)
     bin1 = maximumX + 1
     if rel_freq:
@@ -188,7 +187,7 @@
         plt.ylim((0, 1.07))
         plt.ylabel("Relative Frequency", fontsize=14)
         bins = counts[1]  # width of bins
-        counts = numpy.array(map(float, counts[0][2]))
+        counts = numpy.array([float(_) for _ in counts[0][2]])

     else:
         counts = plt.hist(list1, bins=bin1, edgecolor='black', linewidth=1,
@@ -197,7 +196,7 @@
         plt.ylim((0, maximumY * 1.2))
         plt.ylabel("Absolute Frequency", fontsize=14)
         bins = counts[1]  # width of bins
-        counts = numpy.array(map(int, counts[0][2]))
+        counts = numpy.array([int(_) for _ in counts[0][2]])

     plt.legend(loc='upper right', fontsize=14, frameon=True, bbox_to_anchor=(1.45, 1))
     plt.suptitle(subtitle, y=1, x=0.5, fontsize=14)
@@ -581,7 +580,7 @@
     i = 0
     array2 = numpy.unique(array2)  # remove duplicate sequences to decrease running time
     for a in array1:
-        dist = numpy.array([sum(itertools.imap(operator.ne, a, b)) for b in array2])  # fastest
+        dist = numpy.array([sum(map(operator.ne, a, b)) for b in array2])  # fastest
         res[i] = numpy.amin(dist[dist > 0])  # pick min distance greater than zero
         i += 1
     return res
@@ -589,10 +588,10 @@

 def hamming_difference(array1, array2, mate_b):
     array2 = numpy.unique(array2)  # remove duplicate sequences to decrease running time
-    array1_half = numpy.array([i[0:(len(i)) / 2] for i in array1])  # mate1 part1
-    array1_half2 = numpy.array([i[len(i) / 2:len(i)] for i in array1])  # mate1 part 2
-    array2_half = numpy.array([i[0:(len(i)) / 2] for i in array2])  # mate2 part1
-    array2_half2 = numpy.array([i[len(i) / 2:len(i)] for i in array2])  # mate2 part2
+    array1_half = numpy.array([i[0:int(len(i) / 2)] for i in array1])  # mate1 part1
+    array1_half2 = numpy.array([i[int(len(i) / 2):len(i)] for i in array1])  # mate1 part 2
+    array2_half = numpy.array([i[0:int(len(i) / 2)] for i in array2])  # mate2 part1
+    array2_half2 = numpy.array([i[int(len(i) / 2):len(i)] for i in array2])  # mate2 part2

     diff11 = []
     relativeDiffList = []
@@ -628,7 +627,7 @@
         array2_half2_withoutSame = half2_mate2[index_withoutSame]
         array2_withoutSame = array2[index_withoutSame]  # whole tag (=not splitted into 2 halfs)
         # calculate HD of "a" in the tag to all "a's" or "b" in the tag to all "b's"
-        dist = numpy.array([sum(itertools.imap(operator.ne, a, c)) for c in
+        dist = numpy.array([sum(map(operator.ne, a, c)) for c in
                             array2_half_withoutSame])
         min_index = numpy.where(dist == dist.min())[0]  # get index of min HD
         min_value = dist.min()
@@ -636,7 +635,7 @@
         min_tag_half2 = array2_half2_withoutSame[min_index]
         min_tag_array2 = array2_withoutSame[min_index]  # get whole tag with min HD

-        dist_second_half = numpy.array([sum(itertools.imap(operator.ne, b, e)) for e in
+        dist_second_half = numpy.array([sum(map(operator.ne, b, e)) for e in
                                         min_tag_half2])  # calculate HD of "b" to all "b's" or "a" to all "a's"
         max_value = dist_second_half.max()
         max_index = numpy.where(dist_second_half == dist_second_half.max())[0]  # get index of max HD
@@ -674,7 +673,7 @@

 def readFileReferenceFree(file):
     with open(file, 'r') as dest_f:
-        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype='string')
+        data_array = numpy.genfromtxt(dest_f, skip_header=0, delimiter='\t', comments='#', dtype=str)
         integers = numpy.array(data_array[:, 0]).astype(int)
         return (integers, data_array)

@@ -948,8 +947,8 @@

         # HD analysis for a subset of the tag
         if subset > 0:
-            tag1 = numpy.array([i[0:(len(i)) / 2] for i in data_array[:, 1]])
-            tag2 = numpy.array([i[len(i) / 2:len(i)] for i in data_array[:, 1]])
+            tag1 = numpy.array([i[0:int(len(i) / 2)] for i in data_array[:, 1]])
+            tag2 = numpy.array([i[int(len(i) / 2):len(i)] for i in data_array[:, 1]])

             flanking_region_float = float((len(tag1[0]) - subset)) / 2
             flanking_region = int(flanking_region_float)
@@ -1068,8 +1067,8 @@
                 if tag1 in checked_tags:  # skip tag if already written to file
                     continue

-                sample_half_a = tag1[0:(len(tag1)) / 2]
-                sample_half_b = tag1[len(tag1) / 2:len(tag1)]
+                sample_half_a = tag1[0:int(len(tag1) / 2)]
+                sample_half_b = tag1[int(len(tag1) / 2):len(tag1)]

                 max_tags = data_chimeraAnalysis[i, 1]
                 if len(max_tags) > 1 and len(max_tags) != len(data_array[0, 1]) and type(max_tags) is not numpy.ndarray:
@@ -1079,8 +1078,8 @@

                 info_maxTags = [data_array[data_array[:, 1] == t, :] for t in max_tags]

-                chimera_half_a = numpy.array([t[0:(len(t)) / 2] for t in max_tags])  # mate1 part1
-                chimera_half_b = numpy.array([t[len(t) / 2:len(t)] for t in max_tags])  # mate1 part 2
+                chimera_half_a = numpy.array([t[0:int(len(t) / 2)] for t in max_tags])  # mate1 part1
+                chimera_half_b = numpy.array([t[int(len(t) / 2):len(t)] for t in max_tags])  # mate1 part 2

                 new_format = []
                 for j in range(len(max_tags)):
@@ -1268,7 +1267,7 @@
             "For simplicity, we used the maximum value of the relative differences and the respective delta HD.\n"
             "Note that when only tags that can form a DCS are included in the analysis, the family sizes for both directions (ab and ba) of the strand will be included in the plots.\n")

-        output_file.write("\nlength of one half of the tag{}{}\n\n".format(sep, len(data_array[0, 1]) / 2))
+        output_file.write("\nlength of one half of the tag{}{}\n\n".format(sep, int(len(data_array[0, 1]) / 2)))

         createFileHDwithinTag(summary9, sumCol9, overallSum9, output_file,
                               "Tag distance of each half in the tag", sep)
--- a/td.xml	Thu Oct 24 09:36:09 2019 -0400
+++ b/td.xml	Wed Dec 04 16:32:40 2019 -0500
@@ -1,32 +1,29 @@
 <?xml version="1.0" encoding="UTF-8"?>
-<tool id="td" name="TD:" version="1.0.0" profile="19.01">
+<tool id="td" name="TD:" version="1.0.2" profile="19.01">
     <description>Tag distance analysis of duplex tags</description>
     <macros>
         <import>fsd_macros.xml</import>
     </macros>
-    <requirements>
-        <requirement type="package" version="2.7">python</requirement>
-        <requirement type="package" version="1.4.0">matplotlib</requirement>
-    </requirements>
+    <expand macro="requirements" />
     <command><![CDATA[
-        python '$__tool_directory__/td.py'
-        --inputFile '${inputFile}'
-        --inputName1 '${inputFile.element_identifier}'
+        python '$__tool_directory__/td.py'
+        --inputFile '${file}'
+        --inputName1 @ESCAPE_IDENTIFIER@
         --sample_size '${sampleSize}'
         --subset_tag '${subsetTag}'
-        --nproc "\${GALAXY_SLOTS:-1}"
-        $onlyDCS
+        --nproc "\${GALAXY_SLOTS:-1}"
+        $onlyDCS
         $rel_freq
-        --minFS '${minFS}'
+        --minFS '${minFS}'
         --maxFS '${maxFS}'
-        $nr_above_bars
+        $nr_above_bars
         --output_pdf '${output_pdf}'
         --output_tabular '${output_tabular}'
-        --output_chimeras_tabular '${output_chimeras_tabular}'
+        --output_chimeras_tabular '${output_chimeras_tabular}'
     ]]>
     </command>
     <inputs>
-        <param name="inputFile" type="data" format="tabular" label="Input tags" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
+        <param name="file" type="data" format="tabular" label="Input tags" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
         <param name="sampleSize" type="integer" label="Number of tags to sample" value="1000" min="0" help="A typical duplex experiment contains very large number of tags. To reduce the runtime of this tool it can sample a subset of tags from the dataset. This parameter specifies how many tags to sample. 1000 is a good starting default. Set to '0' to sample all tags."/>
         <param name="minFS" type="integer" label="Minimum family size" min="1" value="1" help="Tags forming families of size smaller than this value will be filtered out. Default = 1"/>
         <param name="maxFS" type="integer" label="Maximum family size" min="0" value="0" help="Tags forming families of size larger than this value will be filtered out. Set to '0' to turn off this restriction. Default = 0"/>
@@ -43,11 +40,11 @@
     </outputs>
     <tests>
         <test>
-            <param name="inputFile" value="td_data.tab"/>
+            <param name="file" value="td_data.tab"/>
             <param name="sampleSize" value="0"/>
             <output name="output_pdf" file="td_output.pdf" lines_diff="136" />
-            <output name="output_tabular" file="td_output.tab"/>
-            <output name="output_chimeras_tabular" file="td_chimeras_output.tab"/>
+            <output name="output_tabular" file="td_output.tab" />
+            <output name="output_chimeras_tabular" file="td_chimeras_output.tab" lines_diff="2"/>
         </test>
     </tests>
     <help> <![CDATA[
@@ -57,9 +54,9 @@
 ≫ 1,000,000). Because of such excess it is highly unlikely to tag distinct input DNA molecules with highly similar barcodes.

 This tool calculates the number of nucleotide differences among tags (tag distance), also known as `Hamming distance <https://en.wikipedia.org/wiki/Hamming_distance>`_. In this context the Hamming distance is simply the number of differences between two tags. The tool compares in a randomly selected subset of tags (default n=1000), the difference between each tag of the subset with the tags of the complete dataset. Each tag will differ by a certain number of nucleotides with the other tags; yet the tool uses the smallest difference observed with any other tag.
-
+
 **Input**
-
+
 This tools expects a tabular file with the tags of all families, the family sizes and information about forward (ab) and reverse (ba) strands::

  1 2                        3
@@ -82,7 +79,7 @@

 We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAATAGCTCGAT ab
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -91,7 +88,7 @@

 Next, the tags are sorted in ascending or descending order using the **Sort** tool::

- 1                        2
+ 1                        2
  ---------------------------
  AAAAAAAAAAAAAAAAATGGTATG ba
  AAAAAAAAAAAAAAATAGCTCGAT ab
@@ -106,17 +103,17 @@
  3 AAAAAAAAAAAAAATGGTATGGAC ab

 These data can now be used in this tool.
-
+
 **Output**
-
+
 The output is one PDF file with various plots of the tag distance, a tabular file with the summarized data of the plots and a tabular file with the chimeras. The PDF file contains several pages:

  1. This first page contains a graph representing the minimum tag distance (smallest number of differences) categorized after the family sizes.
-
+
  2. The second page contains the same information as the first page, but plots the family size categorized by the minimum tag distance.
-
+
  3. The third page contains the **first step** of the **chimera analysis**, which examines the differences between the tags at both ends of a read (a/b). Chimeras can be distinguished by carrying the same tag at one end combined with multiple different tags at the other end of a read. Here, we describe the calculation of the TDs for only one tag in detail, but the process is repeated for each tag in the sample (default n=1000). First, the tool splits the tag into its upstream and downstream part (named a and b) and compares it with all other a parts of the families in the dataset. Next, the tool estimates the sequence differences (TD) among the a parts and extracts those tags with the smallest difference (TD a.min) and calculates the TD of the b part. The tags with the largest differences are extracted to estimate the maximum TD (TD b.max). The process is repeated starting with the b part instead and estimates TD a.max and TD b.min. Next, we calculate the sum of TD a.min and TD b.max.
-
+
  4. The fourth page contains the **second step** of the **chimera analysis**: the absolute difference (=delta TD) between the partial TDs (TD a.min & TD b.max and TD b.min & TD a.max). The partial TDs of chimeric tags are normally very different which means that multiple combinations of the same a part with different b parts are likely. But it is possible that small delta TDs occur due to a half of a tag that is identical to other halves in the data. For this purpose, the relative difference between the partial TDs is estimated in the next step.

  5. The fifth page contains the **third step** of the **chimera analysis**: the relative differences of the partial TDs (=relative delta TD). These are calculated as the absolute difference between TD a.min and TD b.max equal to TD delta. Since it is not known whether the absolute difference originates due to a low and a very large TD within a tag or an identical half (TD=0), the tool estimates the relative TD delta as the ratio of the difference to the sum of the partial TDs. In a chimera, it is expected that only one end of the tag contributes to the TD of the whole tag. In other words, if the same a part is observed in combination with several different b parts, then one end will have a TD = 0. Thus, the TD difference between the parts (TD a.min - TD b.max, TD a.max - TD b.min) is the same as the sum of the parts (TD a.min + TD b.max, TD a.max + TD b.min) or the ratio of the difference to the sum (relative delta TD = TD a.min - TD b.max / (TD a.min + TD b.max or TD a.max - TD b.min / (TD a.max + TD b.min)) will equal 1 in chimeric families. Here, the maximum value of the relative delta TD and the respective delta TD (in step 4) are selected and the plot can be interpreted as the following:
@@ -130,7 +127,7 @@

  .. class:: infomark

-**Note:**
+**Note:**
 Chimeras can be identical in the first or second part of the tag and can have an identical TD with mutliple tags. Therefore, the second column of the output file can have multiple tag entries. The file also contains the family sizes and the direction of the read (ab, ba). The asterisks mark the identical part of the tag::

  1                                      2
@@ -138,7 +135,7 @@
  GAAAGGGAGG GCGCTTCACG	1 ba            GCAATCGACG *GCGCTTCACG* 1 ba
  CCCTCCCTGA GGTTCGTTAT	1 ba            CGTCCTTTTC *GGTTCGTTAT* 1 ba, GCACCTCCTT *GGTTCGTTAT* 1  ba
  ATGCTGATCT CGAATGCATA	55 ba, 59 ab    AGGTGCCGCC *CGAATGCATA* 27 ba, *ATGCTGATCT* GAATGTTTAC 1 ba
-   ]]>
+   ]]>
     </help>
     <expand macro="citation" />
 </tool>