genome_diversity: pca.xml annotate

annotate pca.xml @ 7:e29f4d801bb0

change wsf -> snp; wpf -> sap

author	Richard Burhans <burhans@bx.psu.edu>
date	Wed, 18 Apr 2012 11:12:21 -0400
parents	7a94f11fe71f
children	9b92372de9f6

rev	line source
0 2c498d40ecde Uploaded miller-lab parents: diff changeset	1 <tool id="gd_pca" name="PCA" version="1.0.0">
2c498d40ecde Uploaded miller-lab parents: diff changeset	2
2c498d40ecde Uploaded miller-lab parents: diff changeset	3 <command interpreter="python">
4 7a94f11fe71f change output.extra_files_path to output.files_path Richard Burhans <burhans@bx.psu.edu> parents: 0 diff changeset	4 pca.py "$input" "$input.extra_files_path" "$output" "$output.files_path"
0 2c498d40ecde Uploaded miller-lab parents: diff changeset	5 </command>
2c498d40ecde Uploaded miller-lab parents: diff changeset	6
2c498d40ecde Uploaded miller-lab parents: diff changeset	7 <inputs>
2c498d40ecde Uploaded miller-lab parents: diff changeset	8 <param name="input" type="data" format="wped" label="Dataset" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	9 </inputs>
2c498d40ecde Uploaded miller-lab parents: diff changeset	10
2c498d40ecde Uploaded miller-lab parents: diff changeset	11 <outputs>
2c498d40ecde Uploaded miller-lab parents: diff changeset	12 <data name="output" format="html" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	13 </outputs>
2c498d40ecde Uploaded miller-lab parents: diff changeset	14
2c498d40ecde Uploaded miller-lab parents: diff changeset	15 <!--
2c498d40ecde Uploaded miller-lab parents: diff changeset	16 <tests>
2c498d40ecde Uploaded miller-lab parents: diff changeset	17 <test>
2c498d40ecde Uploaded miller-lab parents: diff changeset	18 <param name="input" value="fake" ftype="wped" >
2c498d40ecde Uploaded miller-lab parents: diff changeset	19 <metadata name="base_name" value="admix" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	20 <composite_data value="test_out/prepare_population_structure/prepare_population_structure.html" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	21 <composite_data value="test_out/prepare_population_structure/admix.ped" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	22 <composite_data value="test_out/prepare_population_structure/admix.map" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	23 <edit_attributes type="name" value="fake" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	24 </param>
2c498d40ecde Uploaded miller-lab parents: diff changeset	25
2c498d40ecde Uploaded miller-lab parents: diff changeset	26 <output name="output" file="test_out/pca/pca.html" ftype="html" compare="diff" lines_diff="2">
2c498d40ecde Uploaded miller-lab parents: diff changeset	27 <extra_files type="file" name="admix.geno" value="test_out/pca/admix.geno" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	28 <extra_files type="file" name="admix.ind" value="test_out/pca/admix.ind" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	29 <extra_files type="file" name="admix.snp" value="test_out/pca/admix.snp" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	30 <extra_files type="file" name="coordinates.txt" value="test_out/pca/coordinates.txt" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	31 <extra_files type="file" name="explained.txt" value="test_out/pca/explained.txt" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	32 <extra_files type="file" name="par.admix" value="test_out/pca/par.admix" compare="diff" lines_diff="10" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	33 <extra_files type="file" name="PCA.pdf" value="test_out/pca/PCA.pdf" compare="sim_size" delta = "1000" />
2c498d40ecde Uploaded miller-lab parents: diff changeset	34 </output>
2c498d40ecde Uploaded miller-lab parents: diff changeset	35
2c498d40ecde Uploaded miller-lab parents: diff changeset	36 </test>
2c498d40ecde Uploaded miller-lab parents: diff changeset	37 </tests>
2c498d40ecde Uploaded miller-lab parents: diff changeset	38 -->
2c498d40ecde Uploaded miller-lab parents: diff changeset	39
2c498d40ecde Uploaded miller-lab parents: diff changeset	40 <help>
2c498d40ecde Uploaded miller-lab parents: diff changeset	41 What it does
2c498d40ecde Uploaded miller-lab parents: diff changeset	42
2c498d40ecde Uploaded miller-lab parents: diff changeset	43 The users selects a set of data generated by the Galaxy tool to "prepare to look for population structure". The PCA tool runs a Principal Component Analysis on the input genotype data and constructs a plot of the top two principal components. It also reports the following estimates of the statistical significance of the analysis.
2c498d40ecde Uploaded miller-lab parents: diff changeset	44
2c498d40ecde Uploaded miller-lab parents: diff changeset	45 1. Average divergence between each pair of populations. Specifically, from the covariance matrix X whose eigenvectors were computed, we can compute a "distance", d, for each pair of individuals (i,j): d(i,j) = X(i,i) + X(j,j) - 2X(i,j). For each pair of populations (a,b) now define an average distance: D(a,b) = \sum d(i,j) (in pop a, in pop b) / (\\|pop a\\| * \\|pop b\\|). We then normalize D so that the diagonal has mean 1 and report it.
2c498d40ecde Uploaded miller-lab parents: diff changeset	46
2c498d40ecde Uploaded miller-lab parents: diff changeset	47 2. Anova statistics for population differences along each eigenvector. For each eigenvector, a P-value for statistical significance of differences between each pair of populations along that eigenvector is printed. +++ is used to highlight P-values less than 1e-06. \\\* is used to highlight P-values between 1e-06 and 1e-03. If there are more than 2 populations, then an overall P-value is also printed for that eigenvector, as are the populations with minimum (minv) and maximum (maxv) eigenvector coordinate. [If there is only 1 population, no Anova statistics are printed.]
2c498d40ecde Uploaded miller-lab parents: diff changeset	48
2c498d40ecde Uploaded miller-lab parents: diff changeset	49 3. Statistical significance of differences between populations. For each pair of populations, the above Anova statistics are summed across eigenvectors. The result is approximately chisq with d.o.f. equal to the number of eigenvectors. The chisq statistic and its p-value are printed. [If there is only 1 population, no statistics are printed.]
2c498d40ecde Uploaded miller-lab parents: diff changeset	50
2c498d40ecde Uploaded miller-lab parents: diff changeset	51 We post-process the output of the PCA tool to estimate "admixture fractions". For this, we take three populations at a time and determine each one's average point in the PCA plot (by separately averaging first and second coordinates). For each combination of two center points, modeling two ancestral populations, we try to model the third central point as having a certain fraction, r, of its SNP genotypes from the second ancestral population and the remainder from the first ancestral population, where we estimate r. The output file "coordinates.txt" then contains pairs of lines like
2c498d40ecde Uploaded miller-lab parents: diff changeset	52
2c498d40ecde Uploaded miller-lab parents: diff changeset	53 projection along chord Population1 -> Population2
2c498d40ecde Uploaded miller-lab parents: diff changeset	54 Population3: 0.12345
2c498d40ecde Uploaded miller-lab parents: diff changeset	55
2c498d40ecde Uploaded miller-lab parents: diff changeset	56 where the number (in this case 0.1245) is the estimation of r. Computations with simulated data suggests that the true r is systematically underestimated, perhaps giving roughly 0.6 times r.
2c498d40ecde Uploaded miller-lab parents: diff changeset	57
2c498d40ecde Uploaded miller-lab parents: diff changeset	58 Acknowledgments
2c498d40ecde Uploaded miller-lab parents: diff changeset	59
2c498d40ecde Uploaded miller-lab parents: diff changeset	60 We use the programs "smartpca" and "ploteig" downloaded from
2c498d40ecde Uploaded miller-lab parents: diff changeset	61
2c498d40ecde Uploaded miller-lab parents: diff changeset	62 http://genepath.med.harvard.edu/~reich/Software.htm
2c498d40ecde Uploaded miller-lab parents: diff changeset	63
2c498d40ecde Uploaded miller-lab parents: diff changeset	64 and described in the paper "Population structure and eigenanalysis". by Nick Patterson, Alkes L.Price and David Reich, PLoS Genetics, 2 (2006), e190.
2c498d40ecde Uploaded miller-lab parents: diff changeset	65 </help>
2c498d40ecde Uploaded miller-lab parents: diff changeset	66 </tool>

Mercurial > repos > miller-lab > genome_diversity

annotate pca.xml @ 7:e29f4d801bb0