Mercurial > repos > xuebing > sharplabtool
diff tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3
Uploaded
author | xuebing |
---|---|
date | Fri, 09 Mar 2012 19:37:19 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/rgenetics/rgEigPCA.py Fri Mar 09 19:37:19 2012 -0500 @@ -0,0 +1,342 @@ +""" +run smartpca + +This uses galaxy code developed by Dan to deal with +arbitrary output files using an html dataset with it's own +subdirectory containing the arbitrary files +We create that html file and add all the links we need + +Note that we execute the smartpca.perl program in the output subdirectory +to avoid having to clear out the job directory after running + +Code to convert linkage format ped files into eigenstratgeno format is left here +in case we decide to autoconvert + +Added a plot in R with better labels than the default eigensoft plot december 26 2007 + +DOCUMENTATION OF smartpca program: + +smartpca runs Principal Components Analysis on input genotype data and + outputs principal components (eigenvectors) and eigenvalues. + The method assumes that samples are unrelated. (However, a small number + of cryptically related individuals is usually not a problem in practice + as they will typically be discarded as outliers.) + +5 different input formats are supported. See ../CONVERTF/README +for documentation on using the convertf program to convert between formats. + +The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate +how parfile works via a toy example (see example.perl in this directory). +This example takes input in EIGENSTRAT format. The syntax of how to take input +in other formats is analogous to the convertf program, see ../CONVERTF/README. + +The smartpca program prints various statistics to standard output. +To redirect this information to a file, change the above syntax to +"../bin/smartpca -p parfile >logfile". For a description of these +statistics, see the documentation file smartpca.info in this directory. + +Estimated running time of the smartpca program is + 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers. + 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations. +Thus, under the default of up to 5 outlier removal iterations, running time is + up to 1.5e-11 * nSNP * NSAMPLES^2 hours. + +------------------------------------------------------------------------ + +DESCRIPTION OF EACH PARAMETER in parfile for smartpca: + +genotypename: input genotype file (in any format: see ../CONVERTF/README) +snpname: input snp file (in any format: see ../CONVERTF/README) +indivname: input indiv file (in any format: see ../CONVERTF/README) +evecoutname: output file of eigenvectors. See numoutevec parameter below. +evaloutname: output file of all eigenvalues + +OPTIONAL PARAMETERS: + +numoutevec: number of eigenvectors to output. Default is 10. +numoutlieriter: maximum number of outlier removal iterations. + Default is 5. To turn off outlier removal, set this parameter to 0. +numoutlierevec: number of principal components along which to + remove outliers during each outlier removal iteration. Default is 10. +outliersigmathresh: number of standard deviations which an individual must + exceed, along one of the top (numoutlierevec) principal components, in + order for that individual to be removed as an outlier. Default is 6.0. +outlieroutname: output logfile of outlier individuals removed. If not specified, + smartpca will print this information to stdout, which is the default. +usenorm: Whether to normalize each SNP by a quantity related to allele freq. + Default is YES. (When analyzing microsatellite data, should be set to NO. + See Patterson et al. 2006.) +altnormstyle: Affects very subtle details in normalization formula. + Default is YES (normalization formulas of Patterson et al. 2006) + To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO. +missingmode: If set to YES, then instead of doing PCA on # reference alleles, + do PCA on whether each data point is missing or nonmissing. Default is NO. + May be useful for detecting informative missingness (Clayton et al. 2005). +nsnpldregress: If set to a positive integer, then LD correction is turned on, + and input to PCA will be the residual of a regression involving that many + previous SNPs, according to physical location. See Patterson et al. 2006. + Default is 0 (no LD correction). If desiring LD correction, we recommend 2. +maxdistldregress: If doing LD correction, this is the maximum genetic distance + (in Morgans) for previous SNPs used in LD correction. Default is no maximum. +poplistname: If wishing to infer eigenvectors using only individuals from a + subset of populations, and then project individuals from all populations + onto those eigenvectors, this input file contains a list of population names, + one population name per line, which will be used to infer eigenvectors. + It is assumed that the population of each individual is specified in the + indiv file. Default is to use individuals from all populations. +phylipoutname: output file containing an fst matrix which can be used as input + to programs in the PHYLIP package, such as the "fitch" program for + constructing phylogenetic trees. +noxdata: if set to YES, all SNPs on X chr are excluded from the data set. + The smartpca default for this parameter is YES, since different variances + for males vs. females on X chr may confound PCA analysis. +nomalexhet: if set to YES, any het genotypes on X chr for males are changed + to missing data. The smartpca default for this parameter is YES. +badsnpname: specifies a list of SNPs which should be excluded from the data set. + Same format as example.snp. Cannot be used if input is in + PACKEDPED or PACKEDANCESTRYMAP format. +popsizelimit: If set to a positive integer, the result is that only the first + popsizelimit individuals from each population will be included in the + analysis. It is assumed that the population of each individual is specified + in the indiv file. Default is to use all individuals in the analysis. + +The next 5 optional parameters allow the user to output genotype, snp and + indiv files which will be identical to the input files except that: + Any individuals set to Ignore in the input indiv file will be + removed from the data set (see ../CONVERTF/README) + Any data excluded or set to missing based on noxdata, nomalexhet and + badsnpname parameters (see above) will be removed from the data set. + The user may decide to output these files in any format. +outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP +genotypeoutname: output genotype file +snpoutname: output snp file +indivoutname: output indiv file +outputgroup: see documentation in ../CONVERTF/README +""" +import sys,os,time,subprocess,string,glob +from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke +verbose = False + +def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''): + """ + the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly + the subject class + Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers + somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings + At least you have the data and the analysis in one single place. Highly reproducible little + piece of research. + """ + debug=False + f = file(eigpca,'r') + R = [] + if debug: + R.append('sessionInfo()') + R.append("print('dir()=:')") + R.append('dir()') + R.append("print('pdfname=%s')" % pdfname) + gvec = [] + pca1 = [] + pca2 = [] + groups = {} + glist = [] # list for legend + ngroup = 1 # increment for each new group encountered for pch vector + for n,row in enumerate(f): + if n > 1: + rowlist = row.strip().split() + group = rowlist[-1] + v1 = rowlist[1] + v2 = rowlist[2] + try: + v1 = float(v1) + except: + v1 = 0.0 + try: + v2 = float(v2) + except: + v2 = 0.0 + if not groups.get(group,None): + groups[group] = ngroup + glist.append(group) + ngroup += 1 # for next group + gvec.append(groups[group]) # lookup group number + pca1.append('%f' % v1) + pca2.append('%f' % v2) + # now have vectors of group,pca1 and pca2 + llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh + llist = ['"%s"' % x for x in llist] # need to quote for R + R.append('llist=c(%s)' % ','.join(llist)) + + plist = range(2,len(llist)+2) # pch - avoid black circles + R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist])) + pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point + R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5 + R.append("par(mai=c(1,1,1,0.5))") + maint = title + R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w)) + R.append("par(lab=c(10,10,10))") + R.append('pca1 = c(%s)' % ','.join(pca1)) + R.append('pca2 = c(%s)' % ','.join(pca2)) + R.append('pgvec = c(%s)' % ','.join(pgvec)) + s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint + s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" + R.append(s) + R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') + R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') + R.append('dev.off()') + R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w)) + s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint + s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)" + R.append(s) + R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")') + R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")') + R.append('dev.off()') + rlog,flist = RRun(rcmd=R,title=title,outdir=nfp) + print >> sys.stdout, '\n'.join(R) + print >> sys.stdout, rlog + + +def getfSize(fpath,outpath): + """ + format a nice file size string + """ + size = '' + fp = os.path.join(outpath,fpath) + if os.path.isfile(fp): + n = float(os.path.getsize(fp)) + if n > 2**20: + size = ' (%1.1f MB)' % (n/2**20) + elif n > 2**10: + size = ' (%1.1f KB)' % (n/2**10) + elif n > 0: + size = ' (%d B)' % (int(n)) + return size + + +def runEigen(): + """ run the smartpca prog - documentation follows + + smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l + fakeped_100.eigenlog -o fakeped_100.eigenout + +DOCUMENTATION OF smartpca.perl program: + +This program calls the smartpca program (see ../POPGEN/README). +For this to work, the bin directory containing smartpca MUST be in your path. +See ./example.perl for a toy example. + +../bin/smartpca.perl +-i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README) +-a example.snp : snp file (see ../CONVERTF/README) +-b example.ind : indiv file (see ../CONVERTF/README) +-k k : (Default is 10) number of principal components to output +-o example.pca : output file of principal components. Individuals removed + as outliers will have all values set to 0.0 in this file. +-p example.plot : prefix of output plot files of top 2 principal components. + (labeling individuals according to labels in indiv file) +-e example.eval : output file of all eigenvalues +-l example.log : output logfile +-m maxiter : (Default is 5) maximum number of outlier removal iterations. + To turn off outlier removal, set -m 0. +-t topk : (Default is 10) number of principal components along which + to remove outliers during each outlier removal iteration. +-s sigma : (Default is 6.0) number of standard deviations which an + individual must exceed, along one of topk top principal + components, in order to be removed as an outlier. + + now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832 + +All files can be viewed however, by making links in the primary (HTML) history item like: +<img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/> +<a href="display_child?parent_id=2&designation=SomeText?">Some Text</a> + + <command interpreter="python"> + rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1" + "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca" + </command> + + """ + if len(sys.argv) < 9: + print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,' + print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft' + sys.exit(1) + else: + print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv)) + skillme = ' %s' % string.punctuation + trantab = string.maketrans(skillme,'_'*len(skillme)) + ofname = sys.argv[5] + progname = os.path.basename(sys.argv[0]) + infile = sys.argv[1] + infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset + title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title + outfile1 = sys.argv[3] + newfilepath = sys.argv[4] + try: + os.mkdirs(newfilepath) + except: + pass + op = os.path.split(outfile1)[0] + try: # for test - needs this done + os.makedirs(op) + except: + pass + eigen_k = sys.argv[5] + eigen_m = sys.argv[6] + eigen_t = sys.argv[7] + eigen_s = sys.argv[8] + eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment + eigentitle = os.path.join(newfilepath,title) + explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues', + 'Smartpca log (contents shown below)'] + rplotname = 'PCAPlot.pdf' + eigenexts = [rplotname, "pca.xls", "eval.xls"] + newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat + rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots + eigenouts = [x for x in newfiles] + eigenlogf = '%s_log.txt' % title + newfiles.append(eigenlogf) # so it will also appear in the links + lfname = outfile1 + lf = file(lfname,'w') + lf.write(galhtmlprefix % progname) + try: + os.makedirs(newfilepath) + except: + pass + smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \ + (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \ + eigen_k, eigen_m, eigen_t, eigen_s) + env = os.environ + p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath) + retval = p.wait() + # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory + elog = file(os.path.join(newfilepath,eigenlogf),'r').read() + eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting + try: + eigpcaRes = file(eeigen,'r').read() + except: + eigpcaRes = '' + file(eigpca,'w').write(eigpcaRes) + makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe) + s = 'Output from %s run at %s<br/>\n' % (progname,timenow()) + lf.write('<h4>%s</h4>\n' % s) + lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe)) + lf.write('(click on the image below to see a much higher quality PDF version)') + thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares? + if os.path.exists(os.path.join(newfilepath,thumbnail)): + lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n') + lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \ + % (newfiles[0],thumbnail,explanations[0])) + allfiles = os.listdir(newfilepath) + allfiles.sort() + sizes = [getfSize(x,newfilepath) for x in allfiles] + lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list + lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles)) + lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf) + lf.write('<pre>%s</pre></div>' % elog) # the eigenlog + s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL) + lf.write(s) + lf.write(galhtmlpostfix) # end galhtmlprefix div + lf.close() + + +if __name__ == "__main__": + runEigen()