5
|
1 <tool id="allele_counts_1" version="1.1" name="Variant Annotator">
|
|
2 <description> process variant counts</description>
|
|
3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header $stranded $nofilt</command>
|
0
|
4 <inputs>
|
|
5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/>
|
3
|
6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/>
|
5
|
7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (in reads per strand)"/>
|
|
8 <param name="nofilt" type="boolean" truevalue="-n" falsevalue="" checked="False" label="Do not filter sites or alleles" />
|
|
9 <param name="stranded" type="boolean" truevalue="-s" falsevalue="" checked="False" label="Output stranded base counts" />
|
3
|
10 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" />
|
0
|
11 </inputs>
|
|
12 <outputs>
|
|
13 <data name="output" format="tabular"/>
|
|
14 </outputs>
|
|
15 <stdio>
|
|
16 <exit_code range="1:" err_level="fatal"/>
|
|
17 <exit_code range=":-1" err_level="fatal"/>
|
|
18 </stdio>
|
|
19
|
|
20 <help>
|
3
|
21
|
4
|
22 .. class:: infomark
|
|
23
|
|
24 **What it does**
|
|
25
|
5
|
26 This tool parses variant counts from a special VCF file. It counts simple variants, calculates numbers of alleles, and calculates minor allele frequency. It can apply filters based on coverage, strand bias, and minor allele frequency cutoffs.
|
4
|
27
|
|
28 -----
|
|
29
|
5
|
30 .. class:: infomark
|
|
31
|
|
32 **Input Format**
|
|
33
|
3
|
34 .. class:: warningmark
|
|
35
|
5
|
36 **Note:** variants that are not A/C/G/T SNVs will be ignored!
|
3
|
37
|
5
|
38 The input VCF should be like the output of the **Naive Variant Detector** tool (using the stranded option). The sample column(s) must give the read count for each variant **on each strand**. Below is an example of a valid sample column entry (the important part is after the last colon)::
|
|
39
|
|
40 0/0:1:0.02:+T=27,+G=1,-T=22,
|
3
|
41
|
|
42 -----
|
|
43
|
|
44 .. class:: infomark
|
|
45
|
5
|
46 **Output**
|
3
|
47
|
5
|
48 Each row represents one site in one sample. For unstranded output, 12 fields give information about that site::
|
0
|
49
|
5
|
50 1. SAMPLE - Sample name (from VCF sample column labels)
|
3
|
51 2. CHR - Chromosome of the site
|
|
52 3. POS - Chromosomal coordinate of the site
|
|
53 4. A - Number of reads supporting an 'A'
|
5
|
54 5. C - 'C' reads
|
|
55 6. G - 'G' reads
|
|
56 7. T - 'T' reads
|
3
|
57 8. CVRG - Total (number of reads supporting one of the four bases above)
|
|
58 9. ALLELES - Number of qualifying alleles
|
5
|
59 10. MAJOR - Major allele
|
|
60 11. MINOR - Minor allele (2nd most prevalent variant)
|
3
|
61 12. MINOR.FREQ.PERC. - Frequency of minor allele
|
|
62
|
5
|
63 For stranded output, instead of using 4 columns to report read counts per base, 8 are used to report the stranded counts per base::
|
|
64
|
|
65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
|
66 SAMPLE CHR POS +A +C +G +T -A -C -G -T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
|
|
67
|
4
|
68 **Example**
|
|
69
|
5
|
70 Below is a header line, followed by some example data lines. Since the input contained three samples, the data for each site is reported on three consecutive lines. However, if a sample fell below the coverage threshold at that site, the line will be omitted::
|
4
|
71
|
|
72 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
|
|
73 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923
|
|
74 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646
|
|
75 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009
|
|
76 BLOOD_1 chr20 100 3 5 100 0 108 1 G C 0.0463
|
|
77 BLOOD_3 chr20 100 1 118 11 0 130 0 C G 0.08462
|
3
|
78
|
|
79 -----
|
|
80
|
|
81 .. class:: warningmark
|
|
82
|
|
83 **Site printing and allele tallying requirements**
|
|
84
|
5
|
85 Coverage threshold:
|
3
|
86
|
5
|
87 If a coverage threshold is used, the number of reads **on each strand** must be at or above the threshold. If either strand is below the threshold, the line will be omitted. **N.B.** this means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. Also, since only simple variants are counted, a site with 100 reads, all supporting a deletion variant, would not be printed.
|
3
|
88
|
5
|
89 Frequency threshold:
|
3
|
90
|
5
|
91 If a frequency threshold is used, alleles are only counted (in the ALLELES column) if they meet or exceed this minor allele frequency threshold.
|
3
|
92
|
5
|
93 Strand bias:
|
3
|
94
|
5
|
95 The alleles passing the threshold on each strand must match (though not in order), or the allele count will be 0. So a site with A, C, G on the plus strand and A, G on the minus strand will get an allele count of zero, though the (strand-independent) major allele, minor allele, and minor allele frequency will still be reported. If there is a tie for the minor allele, one will be randomly chosen.
|
3
|
96
|
0
|
97 </help>
|
|
98
|
|
99 </tool> |