view allele-counts.xml @ 11:cf2af5c3118c draft default tip

"planemo upload for repository https://github.com/galaxyproject/dunovo commit ac9577f0047ad984b307fe2c4b40e2eb45a0e6e2"
author nick
date Fri, 03 Apr 2020 16:05:04 -0400
parents 7f19e8c03358
children
line wrap: on
line source

<tool id="allele_counts_1" version="1.3.2" name="Variant Annotator">
  <description> process variant counts</description>
  <stdio>
    <exit_code range="1:" level="fatal" />
    <exit_code range=":-1" level="fatal" />
  </stdio>
  <command>python3 '$__tool_directory__/allele-counts.py' -i '$input' -o '$output' -f $freq -c $covg $header $stranded $nofilt
  #if $seed:
    -r '$seed'
  #end if
  </command>
  <inputs>
    <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/>
    <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold" help="in percent"/>
    <param name="covg" type="integer" value="10" min="0" label="Coverage threshold" help="in reads (per strand)"/>
    <param name="nofilt" type="boolean" truevalue="-n" falsevalue="" checked="False" label="Do not filter sites or alleles" />
    <param name="stranded" type="boolean" truevalue="-s" falsevalue="" checked="False" label="Output stranded base counts" />
    <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" />
    <param name="seed" type="text" value="" label="PRNG seed" />
  </inputs>
  <outputs>
    <data name="output" format="tabular" />
  </outputs>

  <tests>
    <test>
      <param name="input" value="tests/artificial.vcf.in" />
      <param name="freq" value="10" />
      <param name="covg" value="10" />
      <param name="seed" value="1" />
      <output name="output" file="tests/artificial.csv.out" />
    </test>
  </tests>

  <help>

.. class:: infomark

**What it does**

This tool parses variant counts from a special VCF file. It counts simple variants, calculates numbers of alleles, and calculates minor allele frequency. It can apply filters based on coverage, strand bias, and minor allele frequency cutoffs.

-----

.. class:: infomark

**Input Format**

.. class:: warningmark

**Note:** variants that are not A/C/G/T SNVs will be ignored!

The input VCF should be like the output of the **Naive Variant Detector** tool (using the stranded option). The sample column(s) must give the read count for each variant **on each strand**. Below is an example of a valid sample column entry (the important part is after the last colon)::

    0/0:1:0.02:+T=27,+G=1,-T=22,

-----

.. class:: infomark

**Output**

Each row represents one site in one sample. For **unstranded** output, 13 fields give information about that site::

    1.  SAMPLE  - Sample name (from VCF sample column labels)
    2.  CHR     - Chromosome of the site
    3.  POS     - Chromosomal coordinate of the site
    4.  A       - Number of reads supporting an 'A'
    5.  C       - 'C' reads
    6.  G       - 'G' reads
    7.  T       - 'T' reads
    8.  CVRG    - Total (number of reads supporting one of the four bases above)
    9.  ALLELES - Number of qualifying alleles
    10. MAJOR   - Major allele
    11. MINOR   - Minor allele (2nd most prevalent variant)
    12. MAF     - Frequency of minor allele
    13. BIAS    - Strand bias measure

For stranded output, instead of using 4 columns to report read counts per base, 8 are used to report the stranded counts per base::

    1       2   3   4  5  6  7  8  9 10 11  12    13     14    15   16   17
    SAMPLE CHR POS +A +C +G +T -A -C -G -T CVRG ALLELES MAJOR MINOR MAF BIAS

**Example**

Below is a header line, followed by some example data lines. Since the input contained three samples, the data for each site is reported on three consecutive lines. However, if a sample fell below the coverage threshold at that site, the line will be omitted::

    #SAMPLE  CHR    POS  A   C    G    T  CVRG  ALLELES  MAJOR  MINOR  MAF      BIAS
    BLOOD_1  chr20  99   0   101  1    2  104   1        C      T      0.01923  0.33657
    BLOOD_2  chr20  99   82  44   0    1  127   2        A      C      0.34646  0.07823
    BLOOD_3  chr20  99   0   110  1    0  111   1        C      G      0.009    1.00909
    BLOOD_1  chr20  100  3   5    100  0  108   1        G      C      0.0463   0.15986
    BLOOD_3  chr20  100  1   118  11   0  130   0        C      G      0.08462  0.04154

-----

.. class:: warningmark

**Site printing and allele tallying requirements**

Coverage threshold:

If a coverage threshold is used, the number of reads **on each strand** must be at or above the threshold. If either strand is below the threshold, the line will be omitted. **N.B.** this means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. Also, since only simple variants are counted, a site with 100 reads, all supporting a deletion variant, would not be printed.

Frequency threshold:

If a frequency threshold is used, alleles are only counted (in the ALLELES column) if they meet or exceed this minor allele frequency threshold.

Strand bias:

The alleles passing the threshold on each strand must match (though not in order), or the allele count will be 0. So a site with A, C, G on the plus strand and A, G on the minus strand will get an allele count of zero, though the (strand-independent) major allele, minor allele, and minor allele frequency will still be reported. If there is a tie for the minor allele, one will be randomly chosen.

Additionally, a measure of strand bias is given in the last column. This is calculated using the method of Guo et al., 2012. A value of "." is given when there is no valid result of the calculation due to a zero denominator. This occurs when there are no reads on one of the strands, or when there is no minor allele.

  </help>

  <citations>
    <citation type="bibtex">
      @article{Blankenberg2014,
        author = {Blankenberg, Daniel and {Von Kuster}, Gregory and Bouvier, Emil and Baker, Dannon and Afgan, Enis and Stoler, Nicholas and Taylor, James and Nekrutenko, Anton},
        doi = {10.1186/gb4161},
        issn = {1465-6906},
        journal = {Genome Biology},
        keywords = {galaxy},
        number = {2},
        pages = {403},
        title = {{Dissemination of scientific software with Galaxy ToolShed}},
        url = {http://genomebiology.biomedcentral.com/articles/10.1186/gb4161},
        volume = {15},
        year = {2014}
      }
    </citation>
    <citation type="bibtex">
      @article{Dickins2014,
        archivePrefix = {arXiv},
        arxivId = {15334406},
        author = {Dickins, Benjamin and Rebolledo-Jaramillo, Boris and Su, Marcia Shu Wei and Paul, Ian M and Blankenberg, Daniel and Stoler, Nicholas and Makova, Kateryna D and Nekrutenko, Anton},
        doi = {10.2144/000114146},
        eprint = {15334406},
        isbn = {5049880467},
        issn = {19409818},
        journal = {BioTechniques},
        number = {3},
        pages = {134--141},
        pmid = {24641477},
        title = {{Controlling for contamination in re-sequencing studies with a reproducible web-based phylogenetic approach}},
        volume = {56},
        year = {2014}
      }
    </citation>
  </citations>

</tool>