Mercurial > repos > iuc > duplex_family_size_distribution

<?xml version="1.0" encoding="UTF-8"?>
<tool id="fsd_regions" name="FSD regions:" version="1.0.0" profile="19.01">
    <description>Family size distribution of user-specified regions in the reference genome</description>
    <macros>
        <import>fsd_macros.xml</import>
    </macros>
	<requirements>
        <requirement type="package" version="2.7">python</requirement>
        <requirement type="package" version="1.4.0">matplotlib</requirement>
        <requirement type="package" version="0.15">pysam</requirement>
    </requirements>
    <command>
python '$__tool_directory__/fsd_regions.py'
--inputFile '$file1'
--inputName1 '$file1.element_identifier'
--bamFile '$file2'
#if $file3:
    --rangesFile '$file3'
#end if
--output_pdf '$output_pdf'
--output_tabular '$output_tabular'
    </command>
    <inputs>
        <param name="file1" type="data" format="tabular" label="Input tags of whole dataset" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
        <param name="file2" type="data" format="bam" label="BAM file of aligned reads." help="Input in BAM format with the reads that were aligned to the reference genome."/>
        <param name="file3" type="data" format="bed" label="BED file with chromsome, start and stop positions of the targetted regions." optional="true" help="BED file with start and stop positions of regions in the reference genome."/>
    </inputs>
    <outputs>
        <data name="output_pdf" format="pdf" label="${tool.name} on ${on_string}: PDF"/>
        <data name="output_tabular" format="tabular" label="${tool.name} on ${on_string}: Summary"/>
    </outputs>
    <tests>
        <test>
            <param name="file1" value="fsd_reg.tab"/>
            <param name="file2" value="fsd_reg.bam"/>
            <param name="file3" value="fsd_reg_ranges.bed"/>
            <output name="output_pdf" file="fsd_reg_output.pdf" lines_diff="136"/>
            <output name="output_tabular" file="fsd_reg_output.tab" lines_diff="2"/>
        </test>
    </tests>
    <help> <![CDATA[
**What it does**

This tool provides a computationally very fast insight into the distribution of the family sizes of ALL tags from a Duplex Sequencing (DS) experiment that were aligned to different regions targeted in the reference genome.

**Input**

**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

.. class:: infomark

**How to generate the input**

The first step of the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ is the **Make Families** tool or the **Correct Barcodes** tool that produces output in this form::

 1                        2  3     4
 ------------------------------------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab read1 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAATAGCTCGAT ab read2 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAATAGCTCGAT ab read3 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAAAATGGTATG ba read3 CGCTACGTGACTAAAACATG

We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAAAATGGTATG ba

Next, the tags are sorted in ascending or descending order using the **Sort** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAAAATGGTATG ba
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab

Finally, unique occurencies of each tag are counted. This is done using **Unique lines** tool that adds an additional column with the counts that also represent the family size (column 1)::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

These data can now be used in this tool.

**Dataset 2:** BAM file of the aligned reads. This file can be obtained by the tool `Map with BWA-MEM <https://arxiv.org/abs/1303.3997>`_.

**Dataset 3 (optional):** BED file with start and stop positions of the regions. If it is not provided, then all aligned reads of the BAM file are used in the distribution of family sizes::

 1        2      3
 ---------------------
 ACH_TDII   90    633
 ACH_TDII  659   1140
 ACH_TDII 1144   1561

**Output**

The output is a PDF file with the plot and the summarized data of the plot. The tool creates a distribution of family sizes of tags that were aligned to the reference genome. Note that tags that overlap different regions of the reference genome are counted for each region. This tool is useful to examine differences in coverage among targeted regions. The plot includes both complementary SSCS-ab and SSCS-ba that form a DCS; whereas the table shows only single counts of the tags per region.

    ]]>
    </help>
    <expand macro="citation" />
</tool>
author	iuc
date	Thu, 24 Oct 2019 09:36:09 -0400
parents
children	ec5a92514113