Mercurial > repos > nick > allele_counts
comparison allele-counts.xml @ 5:31361191d2d2
Uploaded tarball.
Version 1.1: Stranded output, slightly different handling of minor allele ties and 0 coverage sites, revised help text, added test datasets.
author | nick |
---|---|
date | Thu, 12 Sep 2013 11:34:23 -0400 |
parents | 898eb3daab43 |
children | df3b28364cd2 |
comparison
equal
deleted
inserted
replaced
4:898eb3daab43 | 5:31361191d2d2 |
---|---|
1 <tool id="allele_counts_1" version="1.0" name="Count alleles"> | 1 <tool id="allele_counts_1" version="1.1" name="Variant Annotator"> |
2 <description>and minor allele frequencies</description> | 2 <description> process variant counts</description> |
3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header</command> | 3 <command interpreter="python">allele-counts.py -i $input -o $output -f $freq -c $covg $header $stranded $nofilt</command> |
4 <inputs> | 4 <inputs> |
5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/> | 5 <param name="input" type="data" format="vcf" label="Input variants from Naive Variants Detector"/> |
6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/> | 6 <param name="freq" type="float" value="1.0" min="0" max="100" label="Minor allele frequency threshold (in percent)"/> |
7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (per strand)"/> | 7 <param name="covg" type="integer" value="10" min="0" label="Coverage threshold (in reads per strand)"/> |
8 <param name="nofilt" type="boolean" truevalue="-n" falsevalue="" checked="False" label="Do not filter sites or alleles" /> | |
9 <param name="stranded" type="boolean" truevalue="-s" falsevalue="" checked="False" label="Output stranded base counts" /> | |
8 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" /> | 10 <param name="header" type="boolean" truevalue="-H" falsevalue="" checked="True" label="Write header line" /> |
9 </inputs> | 11 </inputs> |
10 <outputs> | 12 <outputs> |
11 <data name="output" format="tabular"/> | 13 <data name="output" format="tabular"/> |
12 </outputs> | 14 </outputs> |
19 | 21 |
20 .. class:: infomark | 22 .. class:: infomark |
21 | 23 |
22 **What it does** | 24 **What it does** |
23 | 25 |
24 This tool parses variant counts from a special VCF file (normally the output of the **Naive Variant Detector** tool). It counts simple (ACGT) variants, calculates numbers of alleles, and calculates minor allele frequency. It applies filters based on coverage, strand bias, and minor allele frequency cutoffs. | 26 This tool parses variant counts from a special VCF file. It counts simple variants, calculates numbers of alleles, and calculates minor allele frequency. It can apply filters based on coverage, strand bias, and minor allele frequency cutoffs. |
25 | |
26 ----- | |
27 | |
28 .. class:: warningmark | |
29 | |
30 **Note** | |
31 | |
32 The VCF must have a certain genotype field in the sample columns, giving the read count of each type of variant. Also, the variant data **must be stranded**. The **Naive Variant Detector** tool produces this type of VCF. | |
33 | 27 |
34 ----- | 28 ----- |
35 | 29 |
36 .. class:: infomark | 30 .. class:: infomark |
37 | 31 |
38 **Output columns** | 32 **Input Format** |
39 | 33 |
40 Each row represents one site in one sample. 12 fields give information about that site:: | 34 .. class:: warningmark |
41 | 35 |
42 1. SAMPLE - Sample names (from VCF sample column labels) | 36 **Note:** variants that are not A/C/G/T SNVs will be ignored! |
37 | |
38 The input VCF should be like the output of the **Naive Variant Detector** tool (using the stranded option). The sample column(s) must give the read count for each variant **on each strand**. Below is an example of a valid sample column entry (the important part is after the last colon):: | |
39 | |
40 0/0:1:0.02:+T=27,+G=1,-T=22, | |
41 | |
42 ----- | |
43 | |
44 .. class:: infomark | |
45 | |
46 **Output** | |
47 | |
48 Each row represents one site in one sample. For unstranded output, 12 fields give information about that site:: | |
49 | |
50 1. SAMPLE - Sample name (from VCF sample column labels) | |
43 2. CHR - Chromosome of the site | 51 2. CHR - Chromosome of the site |
44 3. POS - Chromosomal coordinate of the site | 52 3. POS - Chromosomal coordinate of the site |
45 4. A - Number of reads supporting an 'A' | 53 4. A - Number of reads supporting an 'A' |
46 5. C - ditto, for 'C' | 54 5. C - 'C' reads |
47 6. G - ditto, for 'G' | 55 6. G - 'G' reads |
48 7. T - ditto, for 'T' | 56 7. T - 'T' reads |
49 8. CVRG - Total (number of reads supporting one of the four bases above) | 57 8. CVRG - Total (number of reads supporting one of the four bases above) |
50 9. ALLELES - Number of qualifying alleles | 58 9. ALLELES - Number of qualifying alleles |
51 10. MAJOR - Major allele base | 59 10. MAJOR - Major allele |
52 11. MINOR - Minor allele base (2nd most prevalent variant) | 60 11. MINOR - Minor allele (2nd most prevalent variant) |
53 12. MINOR.FREQ.PERC. - Frequency of minor allele | 61 12. MINOR.FREQ.PERC. - Frequency of minor allele |
62 | |
63 For stranded output, instead of using 4 columns to report read counts per base, 8 are used to report the stranded counts per base:: | |
64 | |
65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
66 SAMPLE CHR POS +A +C +G +T -A -C -G -T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC. | |
54 | 67 |
55 **Example** | 68 **Example** |
56 | 69 |
57 This is the header line, followed by some example data lines. Note that some samples and/or sites will not be included in the output, if they fall below the coverage threshold:: | 70 Below is a header line, followed by some example data lines. Since the input contained three samples, the data for each site is reported on three consecutive lines. However, if a sample fell below the coverage threshold at that site, the line will be omitted:: |
58 | 71 |
59 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC. | 72 #SAMPLE CHR POS A C G T CVRG ALLELES MAJOR MINOR MINOR.FREQ.PERC. |
60 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923 | 73 BLOOD_1 chr20 99 0 101 1 2 104 1 C T 0.01923 |
61 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646 | 74 BLOOD_2 chr20 99 82 44 0 1 127 2 A C 0.34646 |
62 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009 | 75 BLOOD_3 chr20 99 0 110 1 0 111 1 C G 0.009 |
67 | 80 |
68 .. class:: warningmark | 81 .. class:: warningmark |
69 | 82 |
70 **Site printing and allele tallying requirements** | 83 **Site printing and allele tallying requirements** |
71 | 84 |
72 Each line is printed only when the site is covered by the threshold number of reads **on each strand**. If coverage of either strand is below the threshold, the line (sample + site combination) is omitted. | 85 Coverage threshold: |
73 | 86 |
74 **N.B.**: This means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. | 87 If a coverage threshold is used, the number of reads **on each strand** must be at or above the threshold. If either strand is below the threshold, the line will be omitted. **N.B.** this means the total coverage for each printed site will be at least twice the number you give in the "coverage threshold" option. Also, since only simple variants are counted, a site with 100 reads, all supporting a deletion variant, would not be printed. |
75 | 88 |
76 Also, reads supporting a variant outside the canonical 4 nucleotides will not count towards the coverage requirement. For instance, a site/sample line with 100x coverage, all of which support a deletion variant, will not be printed. | 89 Frequency threshold: |
77 | 90 |
78 Alleles are only counted (in column 9) if they meet or exceed the minor allele frequency threshold. So a site/sample line with types of variants, 96% A, 3.3% C, and 0.7% G, will count as 2 alleles (at 1% threshold). | 91 If a frequency threshold is used, alleles are only counted (in the ALLELES column) if they meet or exceed this minor allele frequency threshold. |
79 | 92 |
80 Strand bias: the alleles passing the threshold on each strand have to match (though not in order). Otherwise, the allele count will be 0. So a site/sample line whose + strand shows 70% A, 27% C, and 3% G, and - strand shows 70% A and 30% C will have an allele count of 0. The minor allele and minor allele frequency, though, will always be reported\*. | 93 Strand bias: |
81 | 94 |
82 But in this version, there is no requirement that the strands show similar allele frequencies, as long as they both pass the threshold. | 95 The alleles passing the threshold on each strand must match (though not in order), or the allele count will be 0. So a site with A, C, G on the plus strand and A, G on the minus strand will get an allele count of zero, though the (strand-independent) major allele, minor allele, and minor allele frequency will still be reported. If there is a tie for the minor allele, one will be randomly chosen. |
83 | |
84 \*One specific case will actually affect the reported minor allele identity and frequency. If there is a tie for the minor allele (between the 2nd and 3rd most common alleles), the minor allele will be reporated as 'N', and the frequency as 0.0. | |
85 | 96 |
86 </help> | 97 </help> |
87 | 98 |
88 </tool> | 99 </tool> |