view make_families.xml @ 3:aedbdf8ce1af draft

planemo upload commit 670b3282d2c120882b956ad617e61369467fb0fe
author nick
date Tue, 08 Dec 2015 08:59:27 -0500
parents ba2a53b970ca
children 7f513b9b1b1e
line wrap: on
line source

<?xml version="1.0"?>
<tool id="make_families" name="Du Novo: Make families" version="0.2">
  <description>of duplex sequencing reads</description>
  <requirements>
    <requirement type="package" version="0.2">duplex</requirement>
    <requirement type="set_environment">DUPLEX_DIR</requirement>
  </requirements>
  <!-- TODO: Add dependency on coreutils to get paste? -->
  <command>paste $fastq1 $fastq2
    | paste - - - -
    | awk -f \$DUPLEX_DIR/make-barcodes.awk -v TAG_LEN=$taglen -v INVARIANT=$invariant
    | sort
    &gt; $output
  </command>
  <inputs>
    <param name="fastq1" type="data" format="fastq" label="Sequencing reads, mate 1"/>
    <param name="fastq2" type="data" format="fastq" label="Sequencing reads, mate 2"/>
    <param name="taglen" type="integer" value="12" min="0" label="Tag length" help="length of each random barcode on the ends of the fragments"/>
    <param name="invariant" type="integer" value="5" min="0" label="Invariant sequence length" help="length of the sequence between the tag and actual sample sequence (the restriction site, normally)"/>
  </inputs>
  <outputs>
    <data name="output" format="tabular"/>
  </outputs>
  <tests>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="1"/>
      <output name="output" file="smoke.families.tsv"/>
    </test>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="0"/>
      <output name="output" file="smoke.families.i0.tsv"/>
    </test>
  </tests>
  <help>

**What it does**

This tool is for processing raw duplex sequencing data, removing the barcodes and grouping by them into families of reads from the same fragment.

-----

**Output**

The output will be a tabular file where each line corresponds to a pair of input reads.

The columns are::

  1: barcode (both tags joined and ordered)
  2: tag order in barcode ("ab" or "ba")
  3: read1 name
  4: read1 sequence (minus the tag and invariant sequences)
  5: read1 quality scores (minus the same tag and invariant)
  6: read2 name
  7: read2 sequence (minus the tag and invariant sequences)
  8: read2 quality scores (minus the same tag and invariant)

-----

**Barcode creation**

For each pair, the tool will remove the tag at the beginning of each read and create a barcode by concatenating the two tags. The order of the tags is determined by a string comparison so that it will make an identical barcode from pairs of either order. The original tag order will be noted in the second column.

Since pairs from opposite strands will have the same tags, but in the reverse order, this produces the same barcode for reads from the same fragment, regardless of strand. Then a simple sort will group all reads from the same strand together, separated into strands by the different "order" values.

Examples::

  +---------------+-----------------+
  |  input tags   |     output      |
  +-------+-------+-------+---------+
  | read1 | read2 | order | barcode |
  +-------+-------+-------+---------+
  |  ATG  |  CCT  |  ab   | ATGCCT  |
  +-------+-------+-------+---------+
  |  CCT  |  ATG  |  ba   | ATGCCT  |
  +-------+-------+-------+---------+

    </help>
</tool>