view make_families.xml @ 0:d2e46adc199e draft

planemo upload commit 35b743e6492923c0e2b1e5e434eaf4e56d268108
author nick
date Mon, 23 Nov 2015 22:06:21 -0500
parents
children b63d6673f883
line wrap: on
line source

<?xml version="1.0"?>
<tool id="make_families" name="Make families" version="0.1">
  <description>from duplex sequencing data</description>
  <requirements>
    <requirement type="package" version="0.1">duplex</requirement>
    <requirement type="set_environment">DUPLEX_DIR</requirement>
  </requirements>
  <command>paste $fastq1 $fastq2
    | paste - - - -
    | awk -f \$DUPLEX_DIR/make-barcodes.awk -v TAG_LEN=$taglen -v INVARIANT=$invariant
    | sort
    &gt; $output
  </command>
  <inputs>
    <param name="fastq1" type="data" format="fastq" label="Sequencing reads, mate 1"/>
    <param name="fastq2" type="data" format="fastq" label="Sequencing reads, mate 2"/>
    <param name="taglen" type="integer" value="12" min="0" label="Tag length" help="length of each random barcode on the ends of the fragments"/>
    <param name="invariant" type="integer" value="5" min="0" label="Invariant sequence length" help="length of the sequence between the tag and actual sample sequence (the restriction site, normally)"/>
  </inputs>
  <outputs>
    <data name="output" format="tabular"/>
  </outputs>
  <tests>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="1"/>
      <output name="output" file="smoke.families.tsv"/>
    </test>
    <test>
      <param name="fastq1" value="smoke_1.fq"/>
      <param name="fastq2" value="smoke_2.fq"/>
      <param name="taglen" value="5"/>
      <param name="invariant" value="0"/>
      <output name="output" file="smoke.families.i0.tsv"/>
    </test>
  </tests>
  <help>

**What it does**

This tool is for processing raw duplex sequencing data, removing the barcodes and grouping by them into families of reads from the same fragment.

-----

**Output**

The output will be a tabular file where each line corresponds to a pair of input reads.

The columns are::

  1: barcode (both tags joined and ordered)
  2: tag order in barcode ("ab" or "ba")
  3: read1 name
  4: read1 sequence (minus the tag and invariant sequences)
  5: read1 quality scores (minus the same tag and invariant)
  6: read2 name
  7: read2 sequence (minus the tag and invariant sequences)
  8: read2 quality scores (minus the same tag and invariant)

-----

**Barcode creation**

For each pair, the tool will remove the tag at the beginning of each read and create a barcode by concatenating the two tags. The order of the tags is determined by a string comparison so that it will make an identical barcode from pairs of either order. The original tag order will be noted in the second column.

Since pairs from opposite strands will have the same tags, but in the reverse order, this produces the same barcode for reads from the same fragment, regardless of strand. Then a simple sort will group all reads from the same strand together, separated into strands by the different "order" values.

Examples::

  +---------------+-----------------+
  |  input tags   |     output      |
  +-------+-------+-------+---------+
  | read1 | read2 | order | barcode |
  +-------+-------+-------+---------+
  |  ATG  |  CCT  |  ab   | ATGCCT  |
  +-------+-------+-------+---------+
  |  CCT  |  ATG  |  ba   | ATGCCT  |
  +-------+-------+-------+---------+

    </help>
</tool>