0
|
1 """
|
|
2 run smartpca
|
|
3
|
|
4 This uses galaxy code developed by Dan to deal with
|
|
5 arbitrary output files using an html dataset with it's own
|
|
6 subdirectory containing the arbitrary files
|
|
7 We create that html file and add all the links we need
|
|
8
|
|
9 Note that we execute the smartpca.perl program in the output subdirectory
|
|
10 to avoid having to clear out the job directory after running
|
|
11
|
|
12 Code to convert linkage format ped files into eigenstratgeno format is left here
|
|
13 in case we decide to autoconvert
|
|
14
|
|
15 Added a plot in R with better labels than the default eigensoft plot december 26 2007
|
|
16
|
|
17 DOCUMENTATION OF smartpca program:
|
|
18
|
|
19 smartpca runs Principal Components Analysis on input genotype data and
|
|
20 outputs principal components (eigenvectors) and eigenvalues.
|
|
21 The method assumes that samples are unrelated. (However, a small number
|
|
22 of cryptically related individuals is usually not a problem in practice
|
|
23 as they will typically be discarded as outliers.)
|
|
24
|
|
25 5 different input formats are supported. See ../CONVERTF/README
|
|
26 for documentation on using the convertf program to convert between formats.
|
|
27
|
|
28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate
|
|
29 how parfile works via a toy example (see example.perl in this directory).
|
|
30 This example takes input in EIGENSTRAT format. The syntax of how to take input
|
|
31 in other formats is analogous to the convertf program, see ../CONVERTF/README.
|
|
32
|
|
33 The smartpca program prints various statistics to standard output.
|
|
34 To redirect this information to a file, change the above syntax to
|
|
35 "../bin/smartpca -p parfile >logfile". For a description of these
|
|
36 statistics, see the documentation file smartpca.info in this directory.
|
|
37
|
|
38 Estimated running time of the smartpca program is
|
|
39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
|
|
40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
|
|
41 Thus, under the default of up to 5 outlier removal iterations, running time is
|
|
42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
|
|
43
|
|
44 ------------------------------------------------------------------------
|
|
45
|
|
46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
|
|
47
|
|
48 genotypename: input genotype file (in any format: see ../CONVERTF/README)
|
|
49 snpname: input snp file (in any format: see ../CONVERTF/README)
|
|
50 indivname: input indiv file (in any format: see ../CONVERTF/README)
|
|
51 evecoutname: output file of eigenvectors. See numoutevec parameter below.
|
|
52 evaloutname: output file of all eigenvalues
|
|
53
|
|
54 OPTIONAL PARAMETERS:
|
|
55
|
|
56 numoutevec: number of eigenvectors to output. Default is 10.
|
|
57 numoutlieriter: maximum number of outlier removal iterations.
|
|
58 Default is 5. To turn off outlier removal, set this parameter to 0.
|
|
59 numoutlierevec: number of principal components along which to
|
|
60 remove outliers during each outlier removal iteration. Default is 10.
|
|
61 outliersigmathresh: number of standard deviations which an individual must
|
|
62 exceed, along one of the top (numoutlierevec) principal components, in
|
|
63 order for that individual to be removed as an outlier. Default is 6.0.
|
|
64 outlieroutname: output logfile of outlier individuals removed. If not specified,
|
|
65 smartpca will print this information to stdout, which is the default.
|
|
66 usenorm: Whether to normalize each SNP by a quantity related to allele freq.
|
|
67 Default is YES. (When analyzing microsatellite data, should be set to NO.
|
|
68 See Patterson et al. 2006.)
|
|
69 altnormstyle: Affects very subtle details in normalization formula.
|
|
70 Default is YES (normalization formulas of Patterson et al. 2006)
|
|
71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
|
|
72 missingmode: If set to YES, then instead of doing PCA on # reference alleles,
|
|
73 do PCA on whether each data point is missing or nonmissing. Default is NO.
|
|
74 May be useful for detecting informative missingness (Clayton et al. 2005).
|
|
75 nsnpldregress: If set to a positive integer, then LD correction is turned on,
|
|
76 and input to PCA will be the residual of a regression involving that many
|
|
77 previous SNPs, according to physical location. See Patterson et al. 2006.
|
|
78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2.
|
|
79 maxdistldregress: If doing LD correction, this is the maximum genetic distance
|
|
80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum.
|
|
81 poplistname: If wishing to infer eigenvectors using only individuals from a
|
|
82 subset of populations, and then project individuals from all populations
|
|
83 onto those eigenvectors, this input file contains a list of population names,
|
|
84 one population name per line, which will be used to infer eigenvectors.
|
|
85 It is assumed that the population of each individual is specified in the
|
|
86 indiv file. Default is to use individuals from all populations.
|
|
87 phylipoutname: output file containing an fst matrix which can be used as input
|
|
88 to programs in the PHYLIP package, such as the "fitch" program for
|
|
89 constructing phylogenetic trees.
|
|
90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set.
|
|
91 The smartpca default for this parameter is YES, since different variances
|
|
92 for males vs. females on X chr may confound PCA analysis.
|
|
93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed
|
|
94 to missing data. The smartpca default for this parameter is YES.
|
|
95 badsnpname: specifies a list of SNPs which should be excluded from the data set.
|
|
96 Same format as example.snp. Cannot be used if input is in
|
|
97 PACKEDPED or PACKEDANCESTRYMAP format.
|
|
98 popsizelimit: If set to a positive integer, the result is that only the first
|
|
99 popsizelimit individuals from each population will be included in the
|
|
100 analysis. It is assumed that the population of each individual is specified
|
|
101 in the indiv file. Default is to use all individuals in the analysis.
|
|
102
|
|
103 The next 5 optional parameters allow the user to output genotype, snp and
|
|
104 indiv files which will be identical to the input files except that:
|
|
105 Any individuals set to Ignore in the input indiv file will be
|
|
106 removed from the data set (see ../CONVERTF/README)
|
|
107 Any data excluded or set to missing based on noxdata, nomalexhet and
|
|
108 badsnpname parameters (see above) will be removed from the data set.
|
|
109 The user may decide to output these files in any format.
|
|
110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
|
|
111 genotypeoutname: output genotype file
|
|
112 snpoutname: output snp file
|
|
113 indivoutname: output indiv file
|
|
114 outputgroup: see documentation in ../CONVERTF/README
|
|
115 """
|
|
116 import sys,os,time,subprocess,string,glob
|
|
117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
|
|
118 verbose = False
|
|
119
|
|
120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
|
|
121 """
|
|
122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
|
|
123 the subject class
|
|
124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
|
|
125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
|
|
126 At least you have the data and the analysis in one single place. Highly reproducible little
|
|
127 piece of research.
|
|
128 """
|
|
129 debug=False
|
|
130 f = file(eigpca,'r')
|
|
131 R = []
|
|
132 if debug:
|
|
133 R.append('sessionInfo()')
|
|
134 R.append("print('dir()=:')")
|
|
135 R.append('dir()')
|
|
136 R.append("print('pdfname=%s')" % pdfname)
|
|
137 gvec = []
|
|
138 pca1 = []
|
|
139 pca2 = []
|
|
140 groups = {}
|
|
141 glist = [] # list for legend
|
|
142 ngroup = 1 # increment for each new group encountered for pch vector
|
|
143 for n,row in enumerate(f):
|
|
144 if n > 1:
|
|
145 rowlist = row.strip().split()
|
|
146 group = rowlist[-1]
|
|
147 v1 = rowlist[1]
|
|
148 v2 = rowlist[2]
|
|
149 try:
|
|
150 v1 = float(v1)
|
|
151 except:
|
|
152 v1 = 0.0
|
|
153 try:
|
|
154 v2 = float(v2)
|
|
155 except:
|
|
156 v2 = 0.0
|
|
157 if not groups.get(group,None):
|
|
158 groups[group] = ngroup
|
|
159 glist.append(group)
|
|
160 ngroup += 1 # for next group
|
|
161 gvec.append(groups[group]) # lookup group number
|
|
162 pca1.append('%f' % v1)
|
|
163 pca2.append('%f' % v2)
|
|
164 # now have vectors of group,pca1 and pca2
|
|
165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
|
|
166 llist = ['"%s"' % x for x in llist] # need to quote for R
|
|
167 R.append('llist=c(%s)' % ','.join(llist))
|
|
168
|
|
169 plist = range(2,len(llist)+2) # pch - avoid black circles
|
|
170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
|
|
171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
|
|
172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
|
|
173 R.append("par(mai=c(1,1,1,0.5))")
|
|
174 maint = title
|
|
175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
|
|
176 R.append("par(lab=c(10,10,10))")
|
|
177 R.append('pca1 = c(%s)' % ','.join(pca1))
|
|
178 R.append('pca2 = c(%s)' % ','.join(pca2))
|
|
179 R.append('pgvec = c(%s)' % ','.join(pgvec))
|
|
180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
|
|
181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
|
|
182 R.append(s)
|
|
183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
|
|
184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
|
|
185 R.append('dev.off()')
|
|
186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
|
|
187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
|
|
188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
|
|
189 R.append(s)
|
|
190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
|
|
191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
|
|
192 R.append('dev.off()')
|
|
193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
|
|
194 print >> sys.stdout, '\n'.join(R)
|
|
195 print >> sys.stdout, rlog
|
|
196
|
|
197
|
|
198 def getfSize(fpath,outpath):
|
|
199 """
|
|
200 format a nice file size string
|
|
201 """
|
|
202 size = ''
|
|
203 fp = os.path.join(outpath,fpath)
|
|
204 if os.path.isfile(fp):
|
|
205 n = float(os.path.getsize(fp))
|
|
206 if n > 2**20:
|
|
207 size = ' (%1.1f MB)' % (n/2**20)
|
|
208 elif n > 2**10:
|
|
209 size = ' (%1.1f KB)' % (n/2**10)
|
|
210 elif n > 0:
|
|
211 size = ' (%d B)' % (int(n))
|
|
212 return size
|
|
213
|
|
214
|
|
215 def runEigen():
|
|
216 """ run the smartpca prog - documentation follows
|
|
217
|
|
218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
|
|
219 fakeped_100.eigenlog -o fakeped_100.eigenout
|
|
220
|
|
221 DOCUMENTATION OF smartpca.perl program:
|
|
222
|
|
223 This program calls the smartpca program (see ../POPGEN/README).
|
|
224 For this to work, the bin directory containing smartpca MUST be in your path.
|
|
225 See ./example.perl for a toy example.
|
|
226
|
|
227 ../bin/smartpca.perl
|
|
228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
|
|
229 -a example.snp : snp file (see ../CONVERTF/README)
|
|
230 -b example.ind : indiv file (see ../CONVERTF/README)
|
|
231 -k k : (Default is 10) number of principal components to output
|
|
232 -o example.pca : output file of principal components. Individuals removed
|
|
233 as outliers will have all values set to 0.0 in this file.
|
|
234 -p example.plot : prefix of output plot files of top 2 principal components.
|
|
235 (labeling individuals according to labels in indiv file)
|
|
236 -e example.eval : output file of all eigenvalues
|
|
237 -l example.log : output logfile
|
|
238 -m maxiter : (Default is 5) maximum number of outlier removal iterations.
|
|
239 To turn off outlier removal, set -m 0.
|
|
240 -t topk : (Default is 10) number of principal components along which
|
|
241 to remove outliers during each outlier removal iteration.
|
|
242 -s sigma : (Default is 6.0) number of standard deviations which an
|
|
243 individual must exceed, along one of topk top principal
|
|
244 components, in order to be removed as an outlier.
|
|
245
|
|
246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
|
|
247
|
|
248 All files can be viewed however, by making links in the primary (HTML) history item like:
|
|
249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
|
|
250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
|
|
251
|
|
252 <command interpreter="python">
|
|
253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
|
|
254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
|
|
255 </command>
|
|
256
|
|
257 """
|
|
258 if len(sys.argv) < 9:
|
|
259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
|
|
260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
|
|
261 sys.exit(1)
|
|
262 else:
|
|
263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
|
|
264 skillme = ' %s' % string.punctuation
|
|
265 trantab = string.maketrans(skillme,'_'*len(skillme))
|
|
266 ofname = sys.argv[5]
|
|
267 progname = os.path.basename(sys.argv[0])
|
|
268 infile = sys.argv[1]
|
|
269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
|
|
270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
|
|
271 outfile1 = sys.argv[3]
|
|
272 newfilepath = sys.argv[4]
|
|
273 try:
|
|
274 os.mkdirs(newfilepath)
|
|
275 except:
|
|
276 pass
|
|
277 op = os.path.split(outfile1)[0]
|
|
278 try: # for test - needs this done
|
|
279 os.makedirs(op)
|
|
280 except:
|
|
281 pass
|
|
282 eigen_k = sys.argv[5]
|
|
283 eigen_m = sys.argv[6]
|
|
284 eigen_t = sys.argv[7]
|
|
285 eigen_s = sys.argv[8]
|
|
286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
|
|
287 eigentitle = os.path.join(newfilepath,title)
|
|
288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
|
|
289 'Smartpca log (contents shown below)']
|
|
290 rplotname = 'PCAPlot.pdf'
|
|
291 eigenexts = [rplotname, "pca.xls", "eval.xls"]
|
|
292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
|
|
293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
|
|
294 eigenouts = [x for x in newfiles]
|
|
295 eigenlogf = '%s_log.txt' % title
|
|
296 newfiles.append(eigenlogf) # so it will also appear in the links
|
|
297 lfname = outfile1
|
|
298 lf = file(lfname,'w')
|
|
299 lf.write(galhtmlprefix % progname)
|
|
300 try:
|
|
301 os.makedirs(newfilepath)
|
|
302 except:
|
|
303 pass
|
|
304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
|
|
305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
|
|
306 eigen_k, eigen_m, eigen_t, eigen_s)
|
|
307 env = os.environ
|
|
308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
|
|
309 retval = p.wait()
|
|
310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
|
|
311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
|
|
312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
|
|
313 try:
|
|
314 eigpcaRes = file(eeigen,'r').read()
|
|
315 except:
|
|
316 eigpcaRes = ''
|
|
317 file(eigpca,'w').write(eigpcaRes)
|
|
318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
|
|
319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
|
|
320 lf.write('<h4>%s</h4>\n' % s)
|
|
321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
|
|
322 lf.write('(click on the image below to see a much higher quality PDF version)')
|
|
323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
|
|
324 if os.path.exists(os.path.join(newfilepath,thumbnail)):
|
|
325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
|
|
326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
|
|
327 % (newfiles[0],thumbnail,explanations[0]))
|
|
328 allfiles = os.listdir(newfilepath)
|
|
329 allfiles.sort()
|
|
330 sizes = [getfSize(x,newfilepath) for x in allfiles]
|
|
331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
|
|
332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
|
|
333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
|
|
334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
|
|
335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
|
|
336 lf.write(s)
|
|
337 lf.write(galhtmlpostfix) # end galhtmlprefix div
|
|
338 lf.close()
|
|
339
|
|
340
|
|
341 if __name__ == "__main__":
|
|
342 runEigen()
|