Mercurial > repos > miller-lab > genome_diversity
comparison pca.xml @ 14:8ae67e9fb6ff
Uploaded Miller Lab Devshed version a51c894f5bed again [possible toolshed.g2 bug]
author | miller-lab |
---|---|
date | Fri, 28 Sep 2012 11:35:56 -0400 |
parents | |
children | 95a05c1ef5d5 |
comparison
equal
deleted
inserted
replaced
13:fdb4240fb565 | 14:8ae67e9fb6ff |
---|---|
1 <tool id="gd_pca" name="PCA" version="1.0.0"> | |
2 <description>: Principal Component Analysis of genotype data</description> | |
3 | |
4 <command interpreter="python"> | |
5 pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path" | |
6 </command> | |
7 | |
8 <inputs> | |
9 <param name="input" type="data" format="gd_ped" label="Dataset" /> | |
10 </inputs> | |
11 | |
12 <outputs> | |
13 <data name="output" format="html" /> | |
14 </outputs> | |
15 | |
16 <!-- | |
17 <tests> | |
18 <test> | |
19 <param name="input" value="fake" ftype="gd_ped" > | |
20 <metadata name="base_name" value="admix" /> | |
21 <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" /> | |
22 <composite_data value="test_out/prepare_population_structure/admix.ped" /> | |
23 <composite_data value="test_out/prepare_population_structure/admix.map" /> | |
24 <edit_attributes type="name" value="fake" /> | |
25 </param> | |
26 | |
27 <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2"> | |
28 <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" /> | |
29 <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" /> | |
30 <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" /> | |
31 <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" /> | |
32 <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" /> | |
33 <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" /> | |
34 <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" /> | |
35 </output> | |
36 | |
37 </test> | |
38 </tests> | |
39 --> | |
40 | |
41 <help> | |
42 | |
43 **Dataset formats** | |
44 | |
45 The input dataset is in gd_ped_ format. | |
46 The output dataset is html_ with links to a pdf for a graphical output and | |
47 text files. (`Dataset missing?`_) | |
48 | |
49 .. _gd_ped: ./static/formatHelp.html#gd_ped | |
50 .. _html: ./static/formalHelp.html#html | |
51 .. _Dataset missing?: ./static/formatHelp.html | |
52 | |
53 ----- | |
54 | |
55 **What it does** | |
56 | |
57 The user selects a gd_ped dataset generated by the Prepare Input tool. | |
58 The PCA tool runs a | |
59 Principal Component Analysis on the input genotype data and constructs | |
60 a plot of the top two principal components. It also reports the | |
61 following estimates of the statistical significance of the analysis. | |
62 | |
63 1. Average divergence between each pair of populations. Specifically, | |
64 from the covariance matrix X whose eigenvectors were computed, we can | |
65 compute a "distance", d, for each pair of individuals (i,j): d(i,j) = | |
66 X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now | |
67 define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b) | |
68 / (\|pop a\| * \|pop b\|). We then normalize D so that the diagonal | |
69 has mean 1 and report it. | |
70 | |
71 2. Anova statistics for population differences along each | |
72 eigenvector. For each eigenvector, a P-value for statistical | |
73 significance of differences between each pair of populations along | |
74 that eigenvector is printed. +++ is used to highlight P-values less | |
75 than 1e-06. \*\*\* is used to highlight P-values between 1e-06 and | |
76 1e-03. If there are more than 2 populations, then an overall P-value | |
77 is also printed for that eigenvector, as are the populations with | |
78 minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is | |
79 only 1 population, no Anova statistics are printed.] | |
80 | |
81 3. Statistical significance of differences between populations. For | |
82 each pair of populations, the above Anova statistics are summed across | |
83 eigenvectors. The result is approximately chisq with d.o.f. equal to | |
84 the number of eigenvectors. The chisq statistic and its p-value are | |
85 printed. [If there is only 1 population, no statistics are printed.] | |
86 | |
87 We post-process the output of the PCA tool to estimate "admixture | |
88 fractions". For this, we take three populations at a time and | |
89 determine each one's average point in the PCA plot (by separately | |
90 averaging first and second coordinates). For each combination of two | |
91 center points, modeling two ancestral populations, we try to model the | |
92 third central point as having a certain fraction, r, of its SNP | |
93 genotypes from the second ancestral population and the remainder from | |
94 the first ancestral population, where we estimate r. The output file | |
95 "coordinates.txt" then contains pairs of lines like | |
96 | |
97 projection along chord Population1 -> Population2 | |
98 Population3: 0.12345 | |
99 | |
100 where the number (in this case 0.1245) is the estimation of r. | |
101 Computations with simulated data suggests that the true r is | |
102 systematically underestimated, perhaps giving roughly 0.6 times r. | |
103 | |
104 ----- | |
105 | |
106 **Acknowledgments** | |
107 | |
108 We use the programs "smartpca" and "ploteig" downloaded from | |
109 | |
110 http://genepath.med.harvard.edu/~reich/Software.htm | |
111 | |
112 and described in the paper "Population structure and eigenanalysis" | |
113 by Nick Patterson, Alkes L. Price, and David Reich, PLoS Genetics, 2 (2006), e190. | |
114 | |
115 </help> | |
116 </tool> |