comparison allele-counts.xml @ 5:31361191d2d2

Uploaded tarball. Version 1.1: Stranded output, slightly different handling of minor allele ties and 0 coverage sites, revised help text, added test datasets.
author nick
date Thu, 12 Sep 2013 11:34:23 -0400
parents 898eb3daab43
children df3b28364cd2
comparison
equal deleted inserted replaced
4:898eb3daab43 5:31361191d2d2
1 <tool id="allele_counts_1" version="1.0" name="Count alleles"> 1 <tool id="allele_counts_1" version="1.1" name="Variant Annotator">
2 <description>and minor allele frequencies</description> 2 <description> process variant counts</description>
3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header</command> 3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header $stranded $nofilt</command>
4 <inputs> 4 <inputs>
5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/> 5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/>
6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/> 6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/>
7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (per strand)"/> 7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (in reads per strand)"/>
8 <param name="nofilt" type="boolean" truevalue="-n" falsevalue="" checked="False" label="Do not filter sites or alleles" />
9 <param name="stranded" type="boolean" truevalue="-s" falsevalue="" checked="False" label="Output stranded base counts" />
8 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" /> 10 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" />
9 </inputs> 11 </inputs>
10 <outputs> 12 <outputs>
11 <data name="output" format="tabular"/> 13 <data name="output" format="tabular"/>
12 </outputs> 14 </outputs>
19 21
20 .. class:: infomark 22 .. class:: infomark
21 23
22 **What it does** 24 **What it does**
23 25
24 This tool parses variant counts from a special VCF file (normally the output of the **Naive Variant Detector** tool). It counts simple (ACGT) variants, calculates numbers of alleles, and calculates minor allele frequency. It applies filters based on coverage, strand bias, and minor allele frequency cutoffs. 26 This tool parses variant counts from a special VCF file. It counts simple variants, calculates numbers of alleles, and calculates minor allele frequency. It can apply filters based on coverage, strand bias, and minor allele frequency cutoffs.
25
26 -----
27
28 .. class:: warningmark
29
30 **Note**
31
32 The VCF must have a certain genotype field in the sample columns, giving the read count of each type of variant. Also, the variant data **must be stranded**. The **Naive Variant Detector** tool produces this type of VCF.
33 27
34 ----- 28 -----
35 29
36 .. class:: infomark 30 .. class:: infomark
37 31
38 **Output columns** 32 **Input Format**
39 33
40 Each row represents one site in one sample. 12 fields give information about that site:: 34 .. class:: warningmark
41 35
42 1. SAMPLE - Sample names (from VCF sample column labels) 36 **Note:** variants that are not A/C/G/T SNVs will be ignored!
37
38 The input VCF should be like the output of the **Naive Variant Detector** tool (using the stranded option). The sample column(s) must give the read count for each variant **on each strand**. Below is an example of a valid sample column entry (the important part is after the last colon)::
39
40 0/0:1:0.02:+T=27,+G=1,-T=22,
41
42 -----
43
44 .. class:: infomark
45
46 **Output**
47
48 Each row represents one site in one sample. For unstranded output, 12 fields give information about that site::
49
50 1. SAMPLE - Sample name (from VCF sample column labels)
43 2. CHR - Chromosome of the site 51 2. CHR - Chromosome of the site
44 3. POS - Chromosomal coordinate of the site 52 3. POS - Chromosomal coordinate of the site
45 4. A - Number of reads supporting an 'A' 53 4. A - Number of reads supporting an 'A'
46 5. C - ditto, for 'C' 54 5. C - 'C' reads
47 6. G - ditto, for 'G' 55 6. G - 'G' reads
48 7. T - ditto, for 'T' 56 7. T - 'T' reads
49 8. CVRG - Total (number of reads supporting one of the four bases above) 57 8. CVRG - Total (number of reads supporting one of the four bases above)
50 9. ALLELES - Number of qualifying alleles 58 9. ALLELES - Number of qualifying alleles
51 10. MAJOR - Major allele base 59 10. MAJOR - Major allele
52 11. MINOR - Minor allele base (2nd most prevalent variant) 60 11. MINOR - Minor allele (2nd most prevalent variant)
53 12. MINOR.FREQ.PERC. - Frequency of minor allele 61 12. MINOR.FREQ.PERC. - Frequency of minor allele
62
63 For stranded output, instead of using 4 columns to report read counts per base, 8 are used to report the stranded counts per base::
64
65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
66 SAMPLE CHR POS +A +C +G +T -A -C -G -T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
54 67
55 **Example** 68 **Example**
56 69
57 This is the header line, followed by some example data lines. Note that some samples and/or sites will not be included in the output, if they fall below the coverage threshold:: 70 Below is a header line, followed by some example data lines. Since the input contained three samples, the data for each site is reported on three consecutive lines. However, if a sample fell below the coverage threshold at that site, the line will be omitted::
58 71
59 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC. 72 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC.
60 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923 73 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923
61 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646 74 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646
62 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009 75 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009
67 80
68 .. class:: warningmark 81 .. class:: warningmark
69 82
70 **Site printing and allele tallying requirements** 83 **Site printing and allele tallying requirements**
71 84
72 Each line is printed only when the site is covered by the threshold number of reads **on each strand**. If coverage of either strand is below the threshold, the line (sample + site combination) is omitted. 85 Coverage threshold:
73 86
74 **N.B.**: This means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. 87 If a coverage threshold is used, the number of reads **on each strand** must be at or above the threshold. If either strand is below the threshold, the line will be omitted. **N.B.** this means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. Also, since only simple variants are counted, a site with 100 reads, all supporting a deletion variant, would not be printed.
75 88
76 Also, reads supporting a variant outside the canonical 4 nucleotides will not count towards the coverage requirement. For instance, a site/sample line with 100x coverage, all of which support a deletion variant, will not be printed. 89 Frequency threshold:
77 90
78 Alleles are only counted (in column 9) if they meet or exceed the minor allele frequency threshold. So a site/sample line with types of variants, 96% A, 3.3% C, and 0.7% G, will count as 2 alleles (at 1% threshold). 91 If a frequency threshold is used, alleles are only counted (in the ALLELES column) if they meet or exceed this minor allele frequency threshold.
79 92
80 Strand bias: the alleles passing the threshold on each strand have to match (though not in order). Otherwise, the allele count will be 0. So a site/sample line whose + strand shows 70% A, 27% C, and 3% G, and - strand shows 70% A and 30% C will have an allele count of 0. The minor allele and minor allele frequency, though, will always be reported\*. 93 Strand bias:
81 94
82 But in this version, there is no requirement that the strands show similar allele frequencies, as long as they both pass the threshold. 95 The alleles passing the threshold on each strand must match (though not in order), or the allele count will be 0. So a site with A, C, G on the plus strand and A, G on the minus strand will get an allele count of zero, though the (strand-independent) major allele, minor allele, and minor allele frequency will still be reported. If there is a tie for the minor allele, one will be randomly chosen.
83
84 \*One specific case will actually affect the reported minor allele identity and frequency. If there is a tie for the minor allele (between the 2nd and 3rd most common alleles), the minor allele will be reporated as 'N', and the frequency as 0.0.
85 96
86 </help> 97 </help>
87 98
88 </tool> 99 </tool>