Mercurial > repos > xuebing > sharplabtool
comparison tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3
Uploaded
| author | xuebing |
|---|---|
| date | Fri, 09 Mar 2012 19:37:19 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:9071e359b9a3 |
|---|---|
| 1 """ | |
| 2 run smartpca | |
| 3 | |
| 4 This uses galaxy code developed by Dan to deal with | |
| 5 arbitrary output files using an html dataset with it's own | |
| 6 subdirectory containing the arbitrary files | |
| 7 We create that html file and add all the links we need | |
| 8 | |
| 9 Note that we execute the smartpca.perl program in the output subdirectory | |
| 10 to avoid having to clear out the job directory after running | |
| 11 | |
| 12 Code to convert linkage format ped files into eigenstratgeno format is left here | |
| 13 in case we decide to autoconvert | |
| 14 | |
| 15 Added a plot in R with better labels than the default eigensoft plot december 26 2007 | |
| 16 | |
| 17 DOCUMENTATION OF smartpca program: | |
| 18 | |
| 19 smartpca runs Principal Components Analysis on input genotype data and | |
| 20 outputs principal components (eigenvectors) and eigenvalues. | |
| 21 The method assumes that samples are unrelated. (However, a small number | |
| 22 of cryptically related individuals is usually not a problem in practice | |
| 23 as they will typically be discarded as outliers.) | |
| 24 | |
| 25 5 different input formats are supported. See ../CONVERTF/README | |
| 26 for documentation on using the convertf program to convert between formats. | |
| 27 | |
| 28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate | |
| 29 how parfile works via a toy example (see example.perl in this directory). | |
| 30 This example takes input in EIGENSTRAT format. The syntax of how to take input | |
| 31 in other formats is analogous to the convertf program, see ../CONVERTF/README. | |
| 32 | |
| 33 The smartpca program prints various statistics to standard output. | |
| 34 To redirect this information to a file, change the above syntax to | |
| 35 "../bin/smartpca -p parfile >logfile". For a description of these | |
| 36 statistics, see the documentation file smartpca.info in this directory. | |
| 37 | |
| 38 Estimated running time of the smartpca program is | |
| 39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers. | |
| 40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations. | |
| 41 Thus, under the default of up to 5 outlier removal iterations, running time is | |
| 42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours. | |
| 43 | |
| 44 ------------------------------------------------------------------------ | |
| 45 | |
| 46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca: | |
| 47 | |
| 48 genotypename: input genotype file (in any format: see ../CONVERTF/README) | |
| 49 snpname: input snp file (in any format: see ../CONVERTF/README) | |
| 50 indivname: input indiv file (in any format: see ../CONVERTF/README) | |
| 51 evecoutname: output file of eigenvectors. See numoutevec parameter below. | |
| 52 evaloutname: output file of all eigenvalues | |
| 53 | |
| 54 OPTIONAL PARAMETERS: | |
| 55 | |
| 56 numoutevec: number of eigenvectors to output. Default is 10. | |
| 57 numoutlieriter: maximum number of outlier removal iterations. | |
| 58 Default is 5. To turn off outlier removal, set this parameter to 0. | |
| 59 numoutlierevec: number of principal components along which to | |
| 60 remove outliers during each outlier removal iteration. Default is 10. | |
| 61 outliersigmathresh: number of standard deviations which an individual must | |
| 62 exceed, along one of the top (numoutlierevec) principal components, in | |
| 63 order for that individual to be removed as an outlier. Default is 6.0. | |
| 64 outlieroutname: output logfile of outlier individuals removed. If not specified, | |
| 65 smartpca will print this information to stdout, which is the default. | |
| 66 usenorm: Whether to normalize each SNP by a quantity related to allele freq. | |
| 67 Default is YES. (When analyzing microsatellite data, should be set to NO. | |
| 68 See Patterson et al. 2006.) | |
| 69 altnormstyle: Affects very subtle details in normalization formula. | |
| 70 Default is YES (normalization formulas of Patterson et al. 2006) | |
| 71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO. | |
| 72 missingmode: If set to YES, then instead of doing PCA on # reference alleles, | |
| 73 do PCA on whether each data point is missing or nonmissing. Default is NO. | |
| 74 May be useful for detecting informative missingness (Clayton et al. 2005). | |
| 75 nsnpldregress: If set to a positive integer, then LD correction is turned on, | |
| 76 and input to PCA will be the residual of a regression involving that many | |
| 77 previous SNPs, according to physical location. See Patterson et al. 2006. | |
| 78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2. | |
| 79 maxdistldregress: If doing LD correction, this is the maximum genetic distance | |
| 80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum. | |
| 81 poplistname: If wishing to infer eigenvectors using only individuals from a | |
| 82 subset of populations, and then project individuals from all populations | |
| 83 onto those eigenvectors, this input file contains a list of population names, | |
| 84 one population name per line, which will be used to infer eigenvectors. | |
| 85 It is assumed that the population of each individual is specified in the | |
| 86 indiv file. Default is to use individuals from all populations. | |
| 87 phylipoutname: output file containing an fst matrix which can be used as input | |
| 88 to programs in the PHYLIP package, such as the "fitch" program for | |
| 89 constructing phylogenetic trees. | |
| 90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set. | |
| 91 The smartpca default for this parameter is YES, since different variances | |
| 92 for males vs. females on X chr may confound PCA analysis. | |
| 93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed | |
| 94 to missing data. The smartpca default for this parameter is YES. | |
| 95 badsnpname: specifies a list of SNPs which should be excluded from the data set. | |
| 96 Same format as example.snp. Cannot be used if input is in | |
| 97 PACKEDPED or PACKEDANCESTRYMAP format. | |
| 98 popsizelimit: If set to a positive integer, the result is that only the first | |
| 99 popsizelimit individuals from each population will be included in the | |
| 100 analysis. It is assumed that the population of each individual is specified | |
| 101 in the indiv file. Default is to use all individuals in the analysis. | |
| 102 | |
| 103 The next 5 optional parameters allow the user to output genotype, snp and | |
| 104 indiv files which will be identical to the input files except that: | |
| 105 Any individuals set to Ignore in the input indiv file will be | |
| 106 removed from the data set (see ../CONVERTF/README) | |
| 107 Any data excluded or set to missing based on noxdata, nomalexhet and | |
| 108 badsnpname parameters (see above) will be removed from the data set. | |
| 109 The user may decide to output these files in any format. | |
| 110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP | |
| 111 genotypeoutname: output genotype file | |
| 112 snpoutname: output snp file | |
| 113 indivoutname: output indiv file | |
| 114 outputgroup: see documentation in ../CONVERTF/README | |
| 115 """ | |
| 116 import sys,os,time,subprocess,string,glob | |
| 117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke | |
| 118 verbose = False | |
| 119 | |
| 120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''): | |
| 121 """ | |
| 122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly | |
| 123 the subject class | |
| 124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers | |
| 125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings | |
| 126 At least you have the data and the analysis in one single place. Highly reproducible little | |
| 127 piece of research. | |
| 128 """ | |
| 129 debug=False | |
| 130 f = file(eigpca,'r') | |
| 131 R = [] | |
| 132 if debug: | |
| 133 R.append('sessionInfo()') | |
| 134 R.append("print('dir()=:')") | |
| 135 R.append('dir()') | |
| 136 R.append("print('pdfname=%s')" % pdfname) | |
| 137 gvec = [] | |
| 138 pca1 = [] | |
| 139 pca2 = [] | |
| 140 groups = {} | |
| 141 glist = [] # list for legend | |
| 142 ngroup = 1 # increment for each new group encountered for pch vector | |
| 143 for n,row in enumerate(f): | |
| 144 if n > 1: | |
| 145 rowlist = row.strip().split() | |
| 146 group = rowlist[-1] | |
| 147 v1 = rowlist[1] | |
| 148 v2 = rowlist[2] | |
| 149 try: | |
| 150 v1 = float(v1) | |
| 151 except: | |
| 152 v1 = 0.0 | |
| 153 try: | |
| 154 v2 = float(v2) | |
| 155 except: | |
| 156 v2 = 0.0 | |
| 157 if not groups.get(group,None): | |
| 158 groups[group] = ngroup | |
| 159 glist.append(group) | |
| 160 ngroup += 1 # for next group | |
| 161 gvec.append(groups[group]) # lookup group number | |
| 162 pca1.append('%f' % v1) | |
| 163 pca2.append('%f' % v2) | |
| 164 # now have vectors of group,pca1 and pca2 | |
| 165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh | |
| 166 llist = ['"%s"' % x for x in llist] # need to quote for R | |
| 167 R.append('llist=c(%s)' % ','.join(llist)) | |
| 168 | |
| 169 plist = range(2,len(llist)+2) # pch - avoid black circles | |
| 170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist])) | |
| 171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point | |
| 172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5 | |
| 173 R.append("par(mai=c(1,1,1,0.5))") | |
| 174 maint = title | |
| 175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w)) | |
| 176 R.append("par(lab=c(10,10,10))") | |
| 177 R.append('pca1 = c(%s)' % ','.join(pca1)) | |
| 178 R.append('pca2 = c(%s)' % ','.join(pca2)) | |
| 179 R.append('pgvec = c(%s)' % ','.join(pgvec)) | |
| 180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint | |
| 181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" | |
| 182 R.append(s) | |
| 183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') | |
| 184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') | |
| 185 R.append('dev.off()') | |
| 186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w)) | |
| 187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint | |
| 188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" | |
| 189 R.append(s) | |
| 190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') | |
| 191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') | |
| 192 R.append('dev.off()') | |
| 193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp) | |
| 194 print >> sys.stdout, '\n'.join(R) | |
| 195 print >> sys.stdout, rlog | |
| 196 | |
| 197 | |
| 198 def getfSize(fpath,outpath): | |
| 199 """ | |
| 200 format a nice file size string | |
| 201 """ | |
| 202 size = '' | |
| 203 fp = os.path.join(outpath,fpath) | |
| 204 if os.path.isfile(fp): | |
| 205 n = float(os.path.getsize(fp)) | |
| 206 if n > 2**20: | |
| 207 size = ' (%1.1f MB)' % (n/2**20) | |
| 208 elif n > 2**10: | |
| 209 size = ' (%1.1f KB)' % (n/2**10) | |
| 210 elif n > 0: | |
| 211 size = ' (%d B)' % (int(n)) | |
| 212 return size | |
| 213 | |
| 214 | |
| 215 def runEigen(): | |
| 216 """ run the smartpca prog - documentation follows | |
| 217 | |
| 218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l | |
| 219 fakeped_100.eigenlog -o fakeped_100.eigenout | |
| 220 | |
| 221 DOCUMENTATION OF smartpca.perl program: | |
| 222 | |
| 223 This program calls the smartpca program (see ../POPGEN/README). | |
| 224 For this to work, the bin directory containing smartpca MUST be in your path. | |
| 225 See ./example.perl for a toy example. | |
| 226 | |
| 227 ../bin/smartpca.perl | |
| 228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README) | |
| 229 -a example.snp : snp file (see ../CONVERTF/README) | |
| 230 -b example.ind : indiv file (see ../CONVERTF/README) | |
| 231 -k k : (Default is 10) number of principal components to output | |
| 232 -o example.pca : output file of principal components. Individuals removed | |
| 233 as outliers will have all values set to 0.0 in this file. | |
| 234 -p example.plot : prefix of output plot files of top 2 principal components. | |
| 235 (labeling individuals according to labels in indiv file) | |
| 236 -e example.eval : output file of all eigenvalues | |
| 237 -l example.log : output logfile | |
| 238 -m maxiter : (Default is 5) maximum number of outlier removal iterations. | |
| 239 To turn off outlier removal, set -m 0. | |
| 240 -t topk : (Default is 10) number of principal components along which | |
| 241 to remove outliers during each outlier removal iteration. | |
| 242 -s sigma : (Default is 6.0) number of standard deviations which an | |
| 243 individual must exceed, along one of topk top principal | |
| 244 components, in order to be removed as an outlier. | |
| 245 | |
| 246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832 | |
| 247 | |
| 248 All files can be viewed however, by making links in the primary (HTML) history item like: | |
| 249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/> | |
| 250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a> | |
| 251 | |
| 252 <command interpreter="python"> | |
| 253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1" | |
| 254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca" | |
| 255 </command> | |
| 256 | |
| 257 """ | |
| 258 if len(sys.argv) < 9: | |
| 259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,' | |
| 260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft' | |
| 261 sys.exit(1) | |
| 262 else: | |
| 263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv)) | |
| 264 skillme = ' %s' % string.punctuation | |
| 265 trantab = string.maketrans(skillme,'_'*len(skillme)) | |
| 266 ofname = sys.argv[5] | |
| 267 progname = os.path.basename(sys.argv[0]) | |
| 268 infile = sys.argv[1] | |
| 269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset | |
| 270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title | |
| 271 outfile1 = sys.argv[3] | |
| 272 newfilepath = sys.argv[4] | |
| 273 try: | |
| 274 os.mkdirs(newfilepath) | |
| 275 except: | |
| 276 pass | |
| 277 op = os.path.split(outfile1)[0] | |
| 278 try: # for test - needs this done | |
| 279 os.makedirs(op) | |
| 280 except: | |
| 281 pass | |
| 282 eigen_k = sys.argv[5] | |
| 283 eigen_m = sys.argv[6] | |
| 284 eigen_t = sys.argv[7] | |
| 285 eigen_s = sys.argv[8] | |
| 286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment | |
| 287 eigentitle = os.path.join(newfilepath,title) | |
| 288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues', | |
| 289 'Smartpca log (contents shown below)'] | |
| 290 rplotname = 'PCAPlot.pdf' | |
| 291 eigenexts = [rplotname, "pca.xls", "eval.xls"] | |
| 292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat | |
| 293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots | |
| 294 eigenouts = [x for x in newfiles] | |
| 295 eigenlogf = '%s_log.txt' % title | |
| 296 newfiles.append(eigenlogf) # so it will also appear in the links | |
| 297 lfname = outfile1 | |
| 298 lf = file(lfname,'w') | |
| 299 lf.write(galhtmlprefix % progname) | |
| 300 try: | |
| 301 os.makedirs(newfilepath) | |
| 302 except: | |
| 303 pass | |
| 304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \ | |
| 305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \ | |
| 306 eigen_k, eigen_m, eigen_t, eigen_s) | |
| 307 env = os.environ | |
| 308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath) | |
| 309 retval = p.wait() | |
| 310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory | |
| 311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read() | |
| 312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting | |
| 313 try: | |
| 314 eigpcaRes = file(eeigen,'r').read() | |
| 315 except: | |
| 316 eigpcaRes = '' | |
| 317 file(eigpca,'w').write(eigpcaRes) | |
| 318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe) | |
| 319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow()) | |
| 320 lf.write('<h4>%s</h4>\n' % s) | |
| 321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe)) | |
| 322 lf.write('(click on the image below to see a much higher quality PDF version)') | |
| 323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares? | |
| 324 if os.path.exists(os.path.join(newfilepath,thumbnail)): | |
| 325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n') | |
| 326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \ | |
| 327 % (newfiles[0],thumbnail,explanations[0])) | |
| 328 allfiles = os.listdir(newfilepath) | |
| 329 allfiles.sort() | |
| 330 sizes = [getfSize(x,newfilepath) for x in allfiles] | |
| 331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list | |
| 332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles)) | |
| 333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf) | |
| 334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog | |
| 335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL) | |
| 336 lf.write(s) | |
| 337 lf.write(galhtmlpostfix) # end galhtmlprefix div | |
| 338 lf.close() | |
| 339 | |
| 340 | |
| 341 if __name__ == "__main__": | |
| 342 runEigen() |
