Mercurial > repos > xuebing > sharplabtool
comparison tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3
Uploaded
author | xuebing |
---|---|
date | Fri, 09 Mar 2012 19:37:19 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:9071e359b9a3 |
---|---|
1 """ | |
2 run smartpca | |
3 | |
4 This uses galaxy code developed by Dan to deal with | |
5 arbitrary output files using an html dataset with it's own | |
6 subdirectory containing the arbitrary files | |
7 We create that html file and add all the links we need | |
8 | |
9 Note that we execute the smartpca.perl program in the output subdirectory | |
10 to avoid having to clear out the job directory after running | |
11 | |
12 Code to convert linkage format ped files into eigenstratgeno format is left here | |
13 in case we decide to autoconvert | |
14 | |
15 Added a plot in R with better labels than the default eigensoft plot december 26 2007 | |
16 | |
17 DOCUMENTATION OF smartpca program: | |
18 | |
19 smartpca runs Principal Components Analysis on input genotype data and | |
20 outputs principal components (eigenvectors) and eigenvalues. | |
21 The method assumes that samples are unrelated. (However, a small number | |
22 of cryptically related individuals is usually not a problem in practice | |
23 as they will typically be discarded as outliers.) | |
24 | |
25 5 different input formats are supported. See ../CONVERTF/README | |
26 for documentation on using the convertf program to convert between formats. | |
27 | |
28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate | |
29 how parfile works via a toy example (see example.perl in this directory). | |
30 This example takes input in EIGENSTRAT format. The syntax of how to take input | |
31 in other formats is analogous to the convertf program, see ../CONVERTF/README. | |
32 | |
33 The smartpca program prints various statistics to standard output. | |
34 To redirect this information to a file, change the above syntax to | |
35 "../bin/smartpca -p parfile >logfile". For a description of these | |
36 statistics, see the documentation file smartpca.info in this directory. | |
37 | |
38 Estimated running time of the smartpca program is | |
39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers. | |
40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations. | |
41 Thus, under the default of up to 5 outlier removal iterations, running time is | |
42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours. | |
43 | |
44 ------------------------------------------------------------------------ | |
45 | |
46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca: | |
47 | |
48 genotypename: input genotype file (in any format: see ../CONVERTF/README) | |
49 snpname: input snp file (in any format: see ../CONVERTF/README) | |
50 indivname: input indiv file (in any format: see ../CONVERTF/README) | |
51 evecoutname: output file of eigenvectors. See numoutevec parameter below. | |
52 evaloutname: output file of all eigenvalues | |
53 | |
54 OPTIONAL PARAMETERS: | |
55 | |
56 numoutevec: number of eigenvectors to output. Default is 10. | |
57 numoutlieriter: maximum number of outlier removal iterations. | |
58 Default is 5. To turn off outlier removal, set this parameter to 0. | |
59 numoutlierevec: number of principal components along which to | |
60 remove outliers during each outlier removal iteration. Default is 10. | |
61 outliersigmathresh: number of standard deviations which an individual must | |
62 exceed, along one of the top (numoutlierevec) principal components, in | |
63 order for that individual to be removed as an outlier. Default is 6.0. | |
64 outlieroutname: output logfile of outlier individuals removed. If not specified, | |
65 smartpca will print this information to stdout, which is the default. | |
66 usenorm: Whether to normalize each SNP by a quantity related to allele freq. | |
67 Default is YES. (When analyzing microsatellite data, should be set to NO. | |
68 See Patterson et al. 2006.) | |
69 altnormstyle: Affects very subtle details in normalization formula. | |
70 Default is YES (normalization formulas of Patterson et al. 2006) | |
71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO. | |
72 missingmode: If set to YES, then instead of doing PCA on # reference alleles, | |
73 do PCA on whether each data point is missing or nonmissing. Default is NO. | |
74 May be useful for detecting informative missingness (Clayton et al. 2005). | |
75 nsnpldregress: If set to a positive integer, then LD correction is turned on, | |
76 and input to PCA will be the residual of a regression involving that many | |
77 previous SNPs, according to physical location. See Patterson et al. 2006. | |
78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2. | |
79 maxdistldregress: If doing LD correction, this is the maximum genetic distance | |
80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum. | |
81 poplistname: If wishing to infer eigenvectors using only individuals from a | |
82 subset of populations, and then project individuals from all populations | |
83 onto those eigenvectors, this input file contains a list of population names, | |
84 one population name per line, which will be used to infer eigenvectors. | |
85 It is assumed that the population of each individual is specified in the | |
86 indiv file. Default is to use individuals from all populations. | |
87 phylipoutname: output file containing an fst matrix which can be used as input | |
88 to programs in the PHYLIP package, such as the "fitch" program for | |
89 constructing phylogenetic trees. | |
90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set. | |
91 The smartpca default for this parameter is YES, since different variances | |
92 for males vs. females on X chr may confound PCA analysis. | |
93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed | |
94 to missing data. The smartpca default for this parameter is YES. | |
95 badsnpname: specifies a list of SNPs which should be excluded from the data set. | |
96 Same format as example.snp. Cannot be used if input is in | |
97 PACKEDPED or PACKEDANCESTRYMAP format. | |
98 popsizelimit: If set to a positive integer, the result is that only the first | |
99 popsizelimit individuals from each population will be included in the | |
100 analysis. It is assumed that the population of each individual is specified | |
101 in the indiv file. Default is to use all individuals in the analysis. | |
102 | |
103 The next 5 optional parameters allow the user to output genotype, snp and | |
104 indiv files which will be identical to the input files except that: | |
105 Any individuals set to Ignore in the input indiv file will be | |
106 removed from the data set (see ../CONVERTF/README) | |
107 Any data excluded or set to missing based on noxdata, nomalexhet and | |
108 badsnpname parameters (see above) will be removed from the data set. | |
109 The user may decide to output these files in any format. | |
110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP | |
111 genotypeoutname: output genotype file | |
112 snpoutname: output snp file | |
113 indivoutname: output indiv file | |
114 outputgroup: see documentation in ../CONVERTF/README | |
115 """ | |
116 import sys,os,time,subprocess,string,glob | |
117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke | |
118 verbose = False | |
119 | |
120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''): | |
121 """ | |
122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly | |
123 the subject class | |
124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers | |
125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings | |
126 At least you have the data and the analysis in one single place. Highly reproducible little | |
127 piece of research. | |
128 """ | |
129 debug=False | |
130 f = file(eigpca,'r') | |
131 R = [] | |
132 if debug: | |
133 R.append('sessionInfo()') | |
134 R.append("print('dir()=:')") | |
135 R.append('dir()') | |
136 R.append("print('pdfname=%s')" % pdfname) | |
137 gvec = [] | |
138 pca1 = [] | |
139 pca2 = [] | |
140 groups = {} | |
141 glist = [] # list for legend | |
142 ngroup = 1 # increment for each new group encountered for pch vector | |
143 for n,row in enumerate(f): | |
144 if n > 1: | |
145 rowlist = row.strip().split() | |
146 group = rowlist[-1] | |
147 v1 = rowlist[1] | |
148 v2 = rowlist[2] | |
149 try: | |
150 v1 = float(v1) | |
151 except: | |
152 v1 = 0.0 | |
153 try: | |
154 v2 = float(v2) | |
155 except: | |
156 v2 = 0.0 | |
157 if not groups.get(group,None): | |
158 groups[group] = ngroup | |
159 glist.append(group) | |
160 ngroup += 1 # for next group | |
161 gvec.append(groups[group]) # lookup group number | |
162 pca1.append('%f' % v1) | |
163 pca2.append('%f' % v2) | |
164 # now have vectors of group,pca1 and pca2 | |
165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh | |
166 llist = ['"%s"' % x for x in llist] # need to quote for R | |
167 R.append('llist=c(%s)' % ','.join(llist)) | |
168 | |
169 plist = range(2,len(llist)+2) # pch - avoid black circles | |
170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist])) | |
171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point | |
172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5 | |
173 R.append("par(mai=c(1,1,1,0.5))") | |
174 maint = title | |
175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w)) | |
176 R.append("par(lab=c(10,10,10))") | |
177 R.append('pca1 = c(%s)' % ','.join(pca1)) | |
178 R.append('pca2 = c(%s)' % ','.join(pca2)) | |
179 R.append('pgvec = c(%s)' % ','.join(pgvec)) | |
180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint | |
181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" | |
182 R.append(s) | |
183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') | |
184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') | |
185 R.append('dev.off()') | |
186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w)) | |
187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint | |
188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" | |
189 R.append(s) | |
190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') | |
191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') | |
192 R.append('dev.off()') | |
193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp) | |
194 print >> sys.stdout, '\n'.join(R) | |
195 print >> sys.stdout, rlog | |
196 | |
197 | |
198 def getfSize(fpath,outpath): | |
199 """ | |
200 format a nice file size string | |
201 """ | |
202 size = '' | |
203 fp = os.path.join(outpath,fpath) | |
204 if os.path.isfile(fp): | |
205 n = float(os.path.getsize(fp)) | |
206 if n > 2**20: | |
207 size = ' (%1.1f MB)' % (n/2**20) | |
208 elif n > 2**10: | |
209 size = ' (%1.1f KB)' % (n/2**10) | |
210 elif n > 0: | |
211 size = ' (%d B)' % (int(n)) | |
212 return size | |
213 | |
214 | |
215 def runEigen(): | |
216 """ run the smartpca prog - documentation follows | |
217 | |
218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l | |
219 fakeped_100.eigenlog -o fakeped_100.eigenout | |
220 | |
221 DOCUMENTATION OF smartpca.perl program: | |
222 | |
223 This program calls the smartpca program (see ../POPGEN/README). | |
224 For this to work, the bin directory containing smartpca MUST be in your path. | |
225 See ./example.perl for a toy example. | |
226 | |
227 ../bin/smartpca.perl | |
228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README) | |
229 -a example.snp : snp file (see ../CONVERTF/README) | |
230 -b example.ind : indiv file (see ../CONVERTF/README) | |
231 -k k : (Default is 10) number of principal components to output | |
232 -o example.pca : output file of principal components. Individuals removed | |
233 as outliers will have all values set to 0.0 in this file. | |
234 -p example.plot : prefix of output plot files of top 2 principal components. | |
235 (labeling individuals according to labels in indiv file) | |
236 -e example.eval : output file of all eigenvalues | |
237 -l example.log : output logfile | |
238 -m maxiter : (Default is 5) maximum number of outlier removal iterations. | |
239 To turn off outlier removal, set -m 0. | |
240 -t topk : (Default is 10) number of principal components along which | |
241 to remove outliers during each outlier removal iteration. | |
242 -s sigma : (Default is 6.0) number of standard deviations which an | |
243 individual must exceed, along one of topk top principal | |
244 components, in order to be removed as an outlier. | |
245 | |
246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832 | |
247 | |
248 All files can be viewed however, by making links in the primary (HTML) history item like: | |
249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/> | |
250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a> | |
251 | |
252 <command interpreter="python"> | |
253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1" | |
254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca" | |
255 </command> | |
256 | |
257 """ | |
258 if len(sys.argv) < 9: | |
259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,' | |
260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft' | |
261 sys.exit(1) | |
262 else: | |
263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv)) | |
264 skillme = ' %s' % string.punctuation | |
265 trantab = string.maketrans(skillme,'_'*len(skillme)) | |
266 ofname = sys.argv[5] | |
267 progname = os.path.basename(sys.argv[0]) | |
268 infile = sys.argv[1] | |
269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset | |
270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title | |
271 outfile1 = sys.argv[3] | |
272 newfilepath = sys.argv[4] | |
273 try: | |
274 os.mkdirs(newfilepath) | |
275 except: | |
276 pass | |
277 op = os.path.split(outfile1)[0] | |
278 try: # for test - needs this done | |
279 os.makedirs(op) | |
280 except: | |
281 pass | |
282 eigen_k = sys.argv[5] | |
283 eigen_m = sys.argv[6] | |
284 eigen_t = sys.argv[7] | |
285 eigen_s = sys.argv[8] | |
286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment | |
287 eigentitle = os.path.join(newfilepath,title) | |
288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues', | |
289 'Smartpca log (contents shown below)'] | |
290 rplotname = 'PCAPlot.pdf' | |
291 eigenexts = [rplotname, "pca.xls", "eval.xls"] | |
292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat | |
293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots | |
294 eigenouts = [x for x in newfiles] | |
295 eigenlogf = '%s_log.txt' % title | |
296 newfiles.append(eigenlogf) # so it will also appear in the links | |
297 lfname = outfile1 | |
298 lf = file(lfname,'w') | |
299 lf.write(galhtmlprefix % progname) | |
300 try: | |
301 os.makedirs(newfilepath) | |
302 except: | |
303 pass | |
304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \ | |
305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \ | |
306 eigen_k, eigen_m, eigen_t, eigen_s) | |
307 env = os.environ | |
308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath) | |
309 retval = p.wait() | |
310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory | |
311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read() | |
312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting | |
313 try: | |
314 eigpcaRes = file(eeigen,'r').read() | |
315 except: | |
316 eigpcaRes = '' | |
317 file(eigpca,'w').write(eigpcaRes) | |
318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe) | |
319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow()) | |
320 lf.write('<h4>%s</h4>\n' % s) | |
321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe)) | |
322 lf.write('(click on the image below to see a much higher quality PDF version)') | |
323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares? | |
324 if os.path.exists(os.path.join(newfilepath,thumbnail)): | |
325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n') | |
326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \ | |
327 % (newfiles[0],thumbnail,explanations[0])) | |
328 allfiles = os.listdir(newfilepath) | |
329 allfiles.sort() | |
330 sizes = [getfSize(x,newfilepath) for x in allfiles] | |
331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list | |
332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles)) | |
333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf) | |
334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog | |
335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL) | |
336 lf.write(s) | |
337 lf.write(galhtmlpostfix) # end galhtmlprefix div | |
338 lf.close() | |
339 | |
340 | |
341 if __name__ == "__main__": | |
342 runEigen() |