comparison tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3

Uploaded
author xuebing
date Fri, 09 Mar 2012 19:37:19 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:9071e359b9a3
1 """
2 run smartpca
3
4 This uses galaxy code developed by Dan to deal with
5 arbitrary output files using an html dataset with it's own
6 subdirectory containing the arbitrary files
7 We create that html file and add all the links we need
8
9 Note that we execute the smartpca.perl program in the output subdirectory
10 to avoid having to clear out the job directory after running
11
12 Code to convert linkage format ped files into eigenstratgeno format is left here
13 in case we decide to autoconvert
14
15 Added a plot in R with better labels than the default eigensoft plot december 26 2007
16
17 DOCUMENTATION OF smartpca program:
18
19 smartpca runs Principal Components Analysis on input genotype data and
20 outputs principal components (eigenvectors) and eigenvalues.
21 The method assumes that samples are unrelated. (However, a small number
22 of cryptically related individuals is usually not a problem in practice
23 as they will typically be discarded as outliers.)
24
25 5 different input formats are supported. See ../CONVERTF/README
26 for documentation on using the convertf program to convert between formats.
27
28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate
29 how parfile works via a toy example (see example.perl in this directory).
30 This example takes input in EIGENSTRAT format. The syntax of how to take input
31 in other formats is analogous to the convertf program, see ../CONVERTF/README.
32
33 The smartpca program prints various statistics to standard output.
34 To redirect this information to a file, change the above syntax to
35 "../bin/smartpca -p parfile >logfile". For a description of these
36 statistics, see the documentation file smartpca.info in this directory.
37
38 Estimated running time of the smartpca program is
39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
41 Thus, under the default of up to 5 outlier removal iterations, running time is
42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
43
44 ------------------------------------------------------------------------
45
46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
47
48 genotypename: input genotype file (in any format: see ../CONVERTF/README)
49 snpname: input snp file (in any format: see ../CONVERTF/README)
50 indivname: input indiv file (in any format: see ../CONVERTF/README)
51 evecoutname: output file of eigenvectors. See numoutevec parameter below.
52 evaloutname: output file of all eigenvalues
53
54 OPTIONAL PARAMETERS:
55
56 numoutevec: number of eigenvectors to output. Default is 10.
57 numoutlieriter: maximum number of outlier removal iterations.
58 Default is 5. To turn off outlier removal, set this parameter to 0.
59 numoutlierevec: number of principal components along which to
60 remove outliers during each outlier removal iteration. Default is 10.
61 outliersigmathresh: number of standard deviations which an individual must
62 exceed, along one of the top (numoutlierevec) principal components, in
63 order for that individual to be removed as an outlier. Default is 6.0.
64 outlieroutname: output logfile of outlier individuals removed. If not specified,
65 smartpca will print this information to stdout, which is the default.
66 usenorm: Whether to normalize each SNP by a quantity related to allele freq.
67 Default is YES. (When analyzing microsatellite data, should be set to NO.
68 See Patterson et al. 2006.)
69 altnormstyle: Affects very subtle details in normalization formula.
70 Default is YES (normalization formulas of Patterson et al. 2006)
71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
72 missingmode: If set to YES, then instead of doing PCA on # reference alleles,
73 do PCA on whether each data point is missing or nonmissing. Default is NO.
74 May be useful for detecting informative missingness (Clayton et al. 2005).
75 nsnpldregress: If set to a positive integer, then LD correction is turned on,
76 and input to PCA will be the residual of a regression involving that many
77 previous SNPs, according to physical location. See Patterson et al. 2006.
78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2.
79 maxdistldregress: If doing LD correction, this is the maximum genetic distance
80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum.
81 poplistname: If wishing to infer eigenvectors using only individuals from a
82 subset of populations, and then project individuals from all populations
83 onto those eigenvectors, this input file contains a list of population names,
84 one population name per line, which will be used to infer eigenvectors.
85 It is assumed that the population of each individual is specified in the
86 indiv file. Default is to use individuals from all populations.
87 phylipoutname: output file containing an fst matrix which can be used as input
88 to programs in the PHYLIP package, such as the "fitch" program for
89 constructing phylogenetic trees.
90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set.
91 The smartpca default for this parameter is YES, since different variances
92 for males vs. females on X chr may confound PCA analysis.
93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed
94 to missing data. The smartpca default for this parameter is YES.
95 badsnpname: specifies a list of SNPs which should be excluded from the data set.
96 Same format as example.snp. Cannot be used if input is in
97 PACKEDPED or PACKEDANCESTRYMAP format.
98 popsizelimit: If set to a positive integer, the result is that only the first
99 popsizelimit individuals from each population will be included in the
100 analysis. It is assumed that the population of each individual is specified
101 in the indiv file. Default is to use all individuals in the analysis.
102
103 The next 5 optional parameters allow the user to output genotype, snp and
104 indiv files which will be identical to the input files except that:
105 Any individuals set to Ignore in the input indiv file will be
106 removed from the data set (see ../CONVERTF/README)
107 Any data excluded or set to missing based on noxdata, nomalexhet and
108 badsnpname parameters (see above) will be removed from the data set.
109 The user may decide to output these files in any format.
110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
111 genotypeoutname: output genotype file
112 snpoutname: output snp file
113 indivoutname: output indiv file
114 outputgroup: see documentation in ../CONVERTF/README
115 """
116 import sys,os,time,subprocess,string,glob
117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
118 verbose = False
119
120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
121 """
122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
123 the subject class
124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
126 At least you have the data and the analysis in one single place. Highly reproducible little
127 piece of research.
128 """
129 debug=False
130 f = file(eigpca,'r')
131 R = []
132 if debug:
133 R.append('sessionInfo()')
134 R.append("print('dir()=:')")
135 R.append('dir()')
136 R.append("print('pdfname=%s')" % pdfname)
137 gvec = []
138 pca1 = []
139 pca2 = []
140 groups = {}
141 glist = [] # list for legend
142 ngroup = 1 # increment for each new group encountered for pch vector
143 for n,row in enumerate(f):
144 if n > 1:
145 rowlist = row.strip().split()
146 group = rowlist[-1]
147 v1 = rowlist[1]
148 v2 = rowlist[2]
149 try:
150 v1 = float(v1)
151 except:
152 v1 = 0.0
153 try:
154 v2 = float(v2)
155 except:
156 v2 = 0.0
157 if not groups.get(group,None):
158 groups[group] = ngroup
159 glist.append(group)
160 ngroup += 1 # for next group
161 gvec.append(groups[group]) # lookup group number
162 pca1.append('%f' % v1)
163 pca2.append('%f' % v2)
164 # now have vectors of group,pca1 and pca2
165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
166 llist = ['"%s"' % x for x in llist] # need to quote for R
167 R.append('llist=c(%s)' % ','.join(llist))
168
169 plist = range(2,len(llist)+2) # pch - avoid black circles
170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
173 R.append("par(mai=c(1,1,1,0.5))")
174 maint = title
175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
176 R.append("par(lab=c(10,10,10))")
177 R.append('pca1 = c(%s)' % ','.join(pca1))
178 R.append('pca2 = c(%s)' % ','.join(pca2))
179 R.append('pgvec = c(%s)' % ','.join(pgvec))
180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
182 R.append(s)
183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
185 R.append('dev.off()')
186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
189 R.append(s)
190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
192 R.append('dev.off()')
193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
194 print >> sys.stdout, '\n'.join(R)
195 print >> sys.stdout, rlog
196
197
198 def getfSize(fpath,outpath):
199 """
200 format a nice file size string
201 """
202 size = ''
203 fp = os.path.join(outpath,fpath)
204 if os.path.isfile(fp):
205 n = float(os.path.getsize(fp))
206 if n > 2**20:
207 size = ' (%1.1f MB)' % (n/2**20)
208 elif n > 2**10:
209 size = ' (%1.1f KB)' % (n/2**10)
210 elif n > 0:
211 size = ' (%d B)' % (int(n))
212 return size
213
214
215 def runEigen():
216 """ run the smartpca prog - documentation follows
217
218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
219 fakeped_100.eigenlog -o fakeped_100.eigenout
220
221 DOCUMENTATION OF smartpca.perl program:
222
223 This program calls the smartpca program (see ../POPGEN/README).
224 For this to work, the bin directory containing smartpca MUST be in your path.
225 See ./example.perl for a toy example.
226
227 ../bin/smartpca.perl
228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
229 -a example.snp : snp file (see ../CONVERTF/README)
230 -b example.ind : indiv file (see ../CONVERTF/README)
231 -k k : (Default is 10) number of principal components to output
232 -o example.pca : output file of principal components. Individuals removed
233 as outliers will have all values set to 0.0 in this file.
234 -p example.plot : prefix of output plot files of top 2 principal components.
235 (labeling individuals according to labels in indiv file)
236 -e example.eval : output file of all eigenvalues
237 -l example.log : output logfile
238 -m maxiter : (Default is 5) maximum number of outlier removal iterations.
239 To turn off outlier removal, set -m 0.
240 -t topk : (Default is 10) number of principal components along which
241 to remove outliers during each outlier removal iteration.
242 -s sigma : (Default is 6.0) number of standard deviations which an
243 individual must exceed, along one of topk top principal
244 components, in order to be removed as an outlier.
245
246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
247
248 All files can be viewed however, by making links in the primary (HTML) history item like:
249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
251
252 <command interpreter="python">
253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
255 </command>
256
257 """
258 if len(sys.argv) < 9:
259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
261 sys.exit(1)
262 else:
263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
264 skillme = ' %s' % string.punctuation
265 trantab = string.maketrans(skillme,'_'*len(skillme))
266 ofname = sys.argv[5]
267 progname = os.path.basename(sys.argv[0])
268 infile = sys.argv[1]
269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
271 outfile1 = sys.argv[3]
272 newfilepath = sys.argv[4]
273 try:
274 os.mkdirs(newfilepath)
275 except:
276 pass
277 op = os.path.split(outfile1)[0]
278 try: # for test - needs this done
279 os.makedirs(op)
280 except:
281 pass
282 eigen_k = sys.argv[5]
283 eigen_m = sys.argv[6]
284 eigen_t = sys.argv[7]
285 eigen_s = sys.argv[8]
286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
287 eigentitle = os.path.join(newfilepath,title)
288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
289 'Smartpca log (contents shown below)']
290 rplotname = 'PCAPlot.pdf'
291 eigenexts = [rplotname, "pca.xls", "eval.xls"]
292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
294 eigenouts = [x for x in newfiles]
295 eigenlogf = '%s_log.txt' % title
296 newfiles.append(eigenlogf) # so it will also appear in the links
297 lfname = outfile1
298 lf = file(lfname,'w')
299 lf.write(galhtmlprefix % progname)
300 try:
301 os.makedirs(newfilepath)
302 except:
303 pass
304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
306 eigen_k, eigen_m, eigen_t, eigen_s)
307 env = os.environ
308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
309 retval = p.wait()
310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
313 try:
314 eigpcaRes = file(eeigen,'r').read()
315 except:
316 eigpcaRes = ''
317 file(eigpca,'w').write(eigpcaRes)
318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
320 lf.write('<h4>%s</h4>\n' % s)
321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
322 lf.write('(click on the image below to see a much higher quality PDF version)')
323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
324 if os.path.exists(os.path.join(newfilepath,thumbnail)):
325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
327 % (newfiles[0],thumbnail,explanations[0]))
328 allfiles = os.listdir(newfilepath)
329 allfiles.sort()
330 sizes = [getfSize(x,newfilepath) for x in allfiles]
331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
336 lf.write(s)
337 lf.write(galhtmlpostfix) # end galhtmlprefix div
338 lf.close()
339
340
341 if __name__ == "__main__":
342 runEigen()