Mercurial > repos > iuc > duplex_family_size_distribution

<?xml version="1.0" encoding="UTF-8"?>
<tool id="fsd_beforevsafter" name="FSD Before/After:" version="1.0.2" profile="19.01">
    <description>Family Size Distribution of duplex sequencing tags during Du Novo analysis</description>
    <macros>
        <import>fsd_macros.xml</import>
    </macros>
    <expand macro="requirements">
        <requirement type="package" version="1.71">biopython</requirement>
        <requirement type="package" version="0.15">pysam</requirement>
    </expand>
    <command><![CDATA[
#if $bamFile:
    ln -s '$bamFile' 'reads.bam' &&
    ln -s '$bamFile.metadata.bam_index' 'reads.bam.bai' &&
#end if
python '$__tool_directory__/fsd_beforevsafter.py'
--inputFile_SSCS '$file'
--inputName1 @ESCAPE_IDENTIFIER@
--makeDCS '$makeDCS'
#if $afterTrimming:
    --afterTrimming '$afterTrimming'
#end if
#if $bamFile:
    --bamFile 'reads.bam'
#end if
--output_pdf '$output_pdf'
--output_tabular '$output_tabular'
    ]]></command>
    <inputs>
        <param name="file" type="data" format="tabular" label="Input tags of SSCSs" optional="false" help="This dataset is generated by post-processing of the output from 'Make Families' or 'Correct Barcodes' tool by extracting the first two columns, sorting the tags (column 1) and adding the counts of unique occurencies of each tag. See Help section below for a detailed explanation."/>
        <param name="makeDCS" type="data" format="fasta" label="Input tags after making DCSs" help="Input in fasta format with the tags of the reads, which were aligned to DCSs. This file is produced by the 'Make consensus reads' tool."/>
        <param name="afterTrimming" type="data" format="fasta" optional="true" label="Input tags after trimming" help="Input in fasta format with the tags of the reads, which were not filtered out after trimming. This file is produced by the 'Sequence Content Trimmer'."/>
        <param name="bamFile" type="data" format="bam" optional="true" label="Input tags aligned to the reference genome" help="Input in BAM format with the reads that were aligned to the reference genome."/>
    </inputs>
    <outputs>
        <data name="output_pdf" format="pdf" label="${tool.name} on ${on_string}: PDF"/>
        <data name="output_tabular" format="tabular" label="${tool.name} on ${on_string}: Summary"/>
    </outputs>
    <tests>
        <test>
            <param name="file" value="fsd_ba_data.tab"/>
            <param name="makeDCS" value="fsd_ba_DCS.fna"/>
            <param name="afterTrimming" value="fsd_ba_trimmed.fna"/>
            <param name="bamFile" value="fsd_ba.bam"/>
            <output name="output_pdf" file="fsd_ba_output.pdf" lines_diff="183"/>
            <output name="output_tabular" file="fsd_ba_output.tab"/>
        </test>
    </tests>
    <help><![CDATA[

**What it does**

This tool will create a distribution of family sizes from tags of various steps of the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_.

**Input**

**Dataset 1:** This tools expects a tabular file with the tags of all families, their sizes and information about forward (ab) and reverse (ba) strands.::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

.. class:: infomark

**How to generate the input**

The first step of the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ is the **Make Families** tool or the **Correct Barcodes** tool that produces output in this form::

 1                        2  3     4
 ------------------------------------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab read1 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAATAGCTCGAT ab read2 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAATAGCTCGAT ab read3 CGCTACGTGACTGGGTCATG
 AAAAAAAAAAAAAAAAATGGTATG ba read3 CGCTACGTGACTAAAACATG

We only need columns 1 and 2. These two columns can be extracted from this dataset using the **Cut** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAAAATGGTATG ba

Next, the tags are sorted in ascending or descending order using the **Sort** tool::

 1                        2
 ---------------------------
 AAAAAAAAAAAAAAAAATGGTATG ba
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab
 AAAAAAAAAAAAAAATAGCTCGAT ab

Finally, unique occurencies of each tag are counted. This is done using **Unique lines** tool that adds an additional column with the counts that also represent the family size (column 1)::

 1 2                        3
 -----------------------------
 1 AAAAAAAAAAAAAAAAATGGTATG ba
 3 AAAAAAAAAAAAAATGGTATGGAC ab

These data can now be used in this tool.

**Dataset 2:** A fasta file is required with all tags and the associated family size of both strands (forward and reverse) in the header and the read itself in the next line. This input file can be obtained by the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ with the tool **Make consensus reads**.

**Dataset 3 (optional):** In addition, the fasta file with all tags, which were not filtered out after trimming, can be included. This file can also be obtained by the `Du Novo Analysis Pipeline` with the tool **Sequence Content Trimmer**.

For both input files (dataset 2 and 3), only the DCSs of one mate are necessary these tools give information on both mates in the output file), since both files have the same tags and family sizes, but different DCSs, which are not required in this tool::

 >AAAAAAAAATAGATCATAGACTCT 7-10
 CTAGACTCACTGGCGTTACTGACTGCGAGACCCTCCACGTG
 >AAAAAAAAGGCAGAAGATATACGC 11-3
 CNCNGGCCCCCCGCTCCGTGCACAGACGNNGCNACTGACAA

**Dataset 4 (optional):** BAM file of the aligned reads. This file can be obtained by the tool `Map with BWA-MEM <https://arxiv.org/abs/1303.3997>`_.

**Output**

The output is a PDF file with a plot and a summary of the data of the plots. The tool compares various datasets from the `Du Novo Analysis Pipeline <https://doi.org/10.1186/s13059-016-1039-4>`_ and helps in decision making of various parameters (e.g family size, minimum read length, etc). For example: Re-running trimming with different parameters allows to recover reads that would be lost due to stringent filtering by read length. This tool also allows to assess reads on target. The tool extracts the tags of reads and their family sizes before SSCS building, after DCS building, after trimming and finally after the alignment to the reference genome. In the plot, the family sizes for both SSCS-ab and SSCS-ba are shown; whereas the summary represents only counts either of SSCS-ab or of SSCS-ba.

]]>
    </help>
    <expand macro="citation" />
</tool>
author	iuc
date	Fri, 10 Sep 2021 06:59:00 +0000
parents	ec5a92514113
children