annotate pca.xml @ 12:4b6590dd7250

Uploaded
author miller-lab
date Wed, 12 Sep 2012 17:10:26 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
12
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
1 <tool id="gd_pca" name="PCA" version="1.0.0">
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
2
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
3 <command interpreter="python">
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
4 pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
5 </command>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
6
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
7 <inputs>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
8 <param name="input" type="data" format="gd_ped" label="Dataset" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
9 </inputs>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
10
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
11 <outputs>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
12 <data name="output" format="html" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
13 </outputs>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
14
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
15 <!--
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
16 <tests>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
17 <test>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
18 <param name="input" value="fake" ftype="gd_ped" >
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
19 <metadata name="base_name" value="admix" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
20 <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
21 <composite_data value="test_out/prepare_population_structure/admix.ped" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
22 <composite_data value="test_out/prepare_population_structure/admix.map" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
23 <edit_attributes type="name" value="fake" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
24 </param>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
25
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
26 <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
27 <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
28 <extra_files type="file" name="admix.gd_indivs" value="test_out/pca/admix.gd_indivs" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
29 <extra_files type="file" name="admix.gd_snp" value="test_out/pca/admix.gd_snp" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
30 <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
31 <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
32 <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
33 <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
34 </output>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
35
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
36 </test>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
37 </tests>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
38 -->
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
39
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
40 <help>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
41 **What it does**
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
42
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
43 The users selects a set of data generated by the Galaxy tool to "prepare to look for population structure". The PCA tool runs a Principal Component Analysis on the input genotype data and constructs a plot of the top two principal components. It also reports the following estimates of the statistical significance of the analysis.
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
44
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
45 1. Average divergence between each pair of populations. Specifically, from the covariance matrix X whose eigenvectors were computed, we can compute a "distance", d, for each pair of individuals (i,j): d(i,j) = X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b) / (\|pop a\| * \|pop b\|). We then normalize D so that the diagonal has mean 1 and report it.
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
46
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
47 2. Anova statistics for population differences along each eigenvector. For each eigenvector, a P-value for statistical significance of differences between each pair of populations along that eigenvector is printed. +++ is used to highlight P-values less than 1e-06. \*\*\* is used to highlight P-values between 1e-06 and 1e-03. If there are more than 2 populations, then an overall P-value is also printed for that eigenvector, as are the populations with minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is only 1 population, no Anova statistics are printed.]
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
48
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
49 3. Statistical significance of differences between populations. For each pair of populations, the above Anova statistics are summed across eigenvectors. The result is approximately chisq with d.o.f. equal to the number of eigenvectors. The chisq statistic and its p-value are printed. [If there is only 1 population, no statistics are printed.]
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
50
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
51 We post-process the output of the PCA tool to estimate "admixture fractions". For this, we take three populations at a time and determine each one's average point in the PCA plot (by separately averaging first and second coordinates). For each combination of two center points, modeling two ancestral populations, we try to model the third central point as having a certain fraction, r, of its SNP genotypes from the second ancestral population and the remainder from the first ancestral population, where we estimate r. The output file "coordinates.txt" then contains pairs of lines like
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
52
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
53 projection along chord Population1 -> Population2
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
54 Population3: 0.12345
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
55
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
56 where the number (in this case 0.1245) is the estimation of r. Computations with simulated data suggests that the true r is systematically underestimated, perhaps giving roughly 0.6 times r.
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
57
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
58 **Acknowledgments**
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
59
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
60 We use the programs "smartpca" and "ploteig" downloaded from
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
61
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
62 http://genepath.med.harvard.edu/~reich/Software.htm
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
63
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
64 and described in the paper "Population structure and eigenanalysis". by Nick Patterson, Alkes L.Price and David Reich, PLoS Genetics, 2 (2006), e190.
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
65 </help>
4b6590dd7250 Uploaded
miller-lab
parents:
diff changeset
66 </tool>