0
|
1 <tool id="allele_counts_1" version="1.0" name="Count alleles">
|
|
2 <description>and minor allele frequencies</description>
|
|
3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header</command>
|
|
4 <inputs>
|
|
5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/>
|
3
|
6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/>
|
0
|
7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (per strand)"/>
|
3
|
8 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" />
|
0
|
9 </inputs>
|
|
10 <outputs>
|
|
11 <data name="output" format="tabular"/>
|
|
12 </outputs>
|
|
13 <stdio>
|
|
14 <exit_code range="1:" err_level="fatal"/>
|
|
15 <exit_code range=":-1" err_level="fatal"/>
|
|
16 </stdio>
|
|
17
|
|
18 <help>
|
3
|
19
|
4
|
20 .. class:: infomark
|
|
21
|
|
22 **What it does**
|
|
23
|
|
24 This tool parses variant counts from a special VCF file (normally the output of the **Naive Variant Detector** tool). It counts simple (ACGT) variants, calculates numbers of alleles, and calculates minor allele frequency. It applies filters based on coverage, strand bias, and minor allele frequency cutoffs.
|
|
25
|
|
26 -----
|
|
27
|
3
|
28 .. class:: warningmark
|
|
29
|
|
30 **Note**
|
|
31
|
4
|
32 The VCF must have a certain genotype field in the sample columns, giving the read count of each type of variant. Also, the variant data **must be stranded**. The **Naive Variant Detector** tool produces this type of VCF.
|
3
|
33
|
|
34 -----
|
|
35
|
|
36 .. class:: infomark
|
|
37
|
|
38 **Output columns**
|
|
39
|
|
40 Each row represents one site in one sample. 12 fields give information about that site::
|
0
|
41
|
3
|
42 1. SAMPLE - Sample names (from VCF sample column labels)
|
|
43 2. CHR - Chromosome of the site
|
|
44 3. POS - Chromosomal coordinate of the site
|
|
45 4. A - Number of reads supporting an 'A'
|
|
46 5. C - ditto, for 'C'
|
|
47 6. G - ditto, for 'G'
|
|
48 7. T - ditto, for 'T'
|
|
49 8. CVRG - Total (number of reads supporting one of the four bases above)
|
|
50 9. ALLELES - Number of qualifying alleles
|
|
51 10. MAJOR - Major allele base
|
|
52 11. MINOR - Minor allele base (2nd most prevalent variant)
|
|
53 12. MINOR.FREQ.PERC. - Frequency of minor allele
|
|
54
|
4
|
55 **Example**
|
|
56
|
|
57 This is the header line, followed by some example data lines. Note that some samples and/or sites will not be included in the output, if they fall below the coverage threshold::
|
|
58
|
|
59 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
|
|
60 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923
|
|
61 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646
|
|
62 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009
|
|
63 BLOOD_1 chr20 100 3 5 100 0 108 1 G C 0.0463
|
|
64 BLOOD_3 chr20 100 1 118 11 0 130 0 C G 0.08462
|
3
|
65
|
|
66 -----
|
|
67
|
|
68 .. class:: warningmark
|
|
69
|
|
70 **Site printing and allele tallying requirements**
|
|
71
|
|
72 Each line is printed only when the site is covered by the threshold number of reads **on each strand**. If coverage of either strand is below the threshold, the line (sample + site combination) is omitted.
|
|
73
|
|
74 **N.B.**: This means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option.
|
|
75
|
4
|
76 Also, reads supporting a variant outside the canonical 4 nucleotides will not count towards the coverage requirement. For instance, a site/sample line with 100x coverage, all of which support a deletion variant, will not be printed.
|
3
|
77
|
4
|
78 Alleles are only counted (in column 9) if they meet or exceed the minor allele frequency threshold. So a site/sample line with types of variants, 96% A, 3.3% C, and 0.7% G, will count as 2 alleles (at 1% threshold).
|
3
|
79
|
4
|
80 Strand bias: the alleles passing the threshold on each strand have to match (though not in order). Otherwise, the allele count will be 0. So a site/sample line whose + strand shows 70% A, 27% C, and 3% G, and - strand shows 70% A and 30% C will have an allele count of 0. The minor allele and minor allele frequency, though, will always be reported\*.
|
3
|
81
|
4
|
82 But in this version, there is no requirement that the strands show similar allele frequencies, as long as they both pass the threshold.
|
3
|
83
|
4
|
84 \*One specific case will actually affect the reported minor allele identity and frequency. If there is a tie for the minor allele (between the 2nd and 3rd most common alleles), the minor allele will be reporated as 'N', and the frequency as 0.0.
|
3
|
85
|
0
|
86 </help>
|
|
87
|
|
88 </tool> |