sharplabtool: tools/rgenetics/rgEigPCA.py comparison

comparison tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3

Uploaded

author	xuebing
date	Fri, 09 Mar 2012 19:37:19 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:9071e359b9a3
+"""
+run smartpca
+This uses galaxy code developed by Dan to deal with
+arbitrary output files using an html dataset with it's own
+subdirectory containing the arbitrary files
+We create that html file and add all the links we need
+Note that we execute the smartpca.perl program in the output subdirectory
+to avoid having to clear out the job directory after running
+Code to convert linkage format ped files into eigenstratgeno format is left here
+in case we decide to autoconvert
+Added a plot in R with better labels than the default eigensoft plot december 26 2007
+DOCUMENTATION OF smartpca program:
+smartpca runs Principal Components Analysis on input genotype data and
+outputs principal components (eigenvectors) and eigenvalues.
+The method assumes that samples are unrelated.  (However, a small number
+of cryptically related individuals is usually not a problem in practice
+as they will typically be discarded as outliers.)
+5 different input formats are supported.  See ../CONVERTF/README
+for documentation on using the convertf program to convert between formats.
+The syntax of smartpca is "../bin/smartpca -p parfile".  We illustrate
+how parfile works via a toy example (see example.perl in this directory).
+This example takes input in EIGENSTRAT format.  The syntax of how to take input
+in other formats is analogous to the convertf program, see ../CONVERTF/README.
+The smartpca program prints various statistics to standard output.
+To redirect this information to a file, change the above syntax to
+"../bin/smartpca -p parfile >logfile".  For a description of these
+statistics, see the documentation file smartpca.info in this directory.
+Estimated running time of the smartpca program is
+2.5e-12 * nSNP * NSAMPLES^2 hours            if not removing outliers.
+2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m)    if m outlier removal iterations.
+Thus, under the default of up to 5 outlier removal iterations, running time is
+up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
+------------------------------------------------------------------------
+DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
+genotypename: input genotype file (in any format: see ../CONVERTF/README)
+snpname:      input snp file      (in any format: see ../CONVERTF/README)
+indivname:    input indiv file    (in any format: see ../CONVERTF/README)
+evecoutname:  output file of eigenvectors.  See numoutevec parameter below.
+evaloutname:  output file of all eigenvalues
+OPTIONAL PARAMETERS:
+numoutevec:     number of eigenvectors to output.  Default is 10.
+numoutlieriter: maximum number of outlier removal iterations.
+Default is 5.  To turn off outlier removal, set this parameter to 0.
+numoutlierevec: number of principal components along which to
+remove outliers during each outlier removal iteration.  Default is 10.
+outliersigmathresh: number of standard deviations which an individual must
+exceed, along one of the top (numoutlierevec) principal components, in
+order for that individual to be removed as an outlier.  Default is 6.0.
+outlieroutname: output logfile of outlier individuals removed. If not specified,
+smartpca will print this information to stdout, which is the default.
+usenorm: Whether to normalize each SNP by a quantity related to allele freq.
+Default is YES.  (When analyzing microsatellite data, should be set to NO.
+See Patterson et al. 2006.)
+altnormstyle: Affects very subtle details in normalization formula.
+Default is YES (normalization formulas of Patterson et al. 2006)
+To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
+missingmode: If set to YES, then instead of doing PCA on # reference alleles,
+do PCA on whether each data point is missing or nonmissing.  Default is NO.
+May be useful for detecting informative missingness (Clayton et al. 2005).
+nsnpldregress: If set to a positive integer, then LD correction is turned on,
+and input to PCA will be the residual of a regression involving that many
+previous SNPs, according to physical location.  See Patterson et al. 2006.
+Default is 0 (no LD correction).  If desiring LD correction, we recommend 2.
+maxdistldregress: If doing LD correction, this is the maximum genetic distance
+(in Morgans) for previous SNPs used in LD correction.  Default is no maximum.
+poplistname:   If wishing to infer eigenvectors using only individuals from a
+subset of populations, and then project individuals from all populations
+onto those eigenvectors, this input file contains a list of population names,
+one population name per line, which will be used to infer eigenvectors.
+It is assumed that the population of each individual is specified in the
+indiv file.  Default is to use individuals from all populations.
+phylipoutname: output file containing an fst matrix which can be used as input
+to programs in the PHYLIP package, such as the "fitch" program for
+constructing phylogenetic trees.
+noxdata:    if set to YES, all SNPs on X chr are excluded from the data set.
+The smartpca default for this parameter is YES, since different variances
+for males vs. females on X chr may confound PCA analysis.
+nomalexhet: if set to YES, any het genotypes on X chr for males are changed
+to missing data.  The smartpca default for this parameter is YES.
+badsnpname: specifies a list of SNPs which should be excluded from the data set.
+Same format as example.snp.  Cannot be used if input is in
+PACKEDPED or PACKEDANCESTRYMAP format.
+popsizelimit: If set to a positive integer, the result is that only the first
+popsizelimit individuals from each population will be included in the
+analysis. It is assumed that the population of each individual is specified
+in the indiv file.  Default is to use all individuals in the analysis.
+The next 5 optional parameters allow the user to output genotype, snp and
+indiv files which will be identical to the input files except that:
+Any individuals set to Ignore in the input indiv file will be
+removed from the data set (see ../CONVERTF/README)
+Any data excluded or set to missing based on noxdata, nomalexhet and
+badsnpname parameters (see above) will be removed from the data set.
+The user may decide to output these files in any format.
+outputformat:    ANCESTRYMAP,  EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
+genotypeoutname: output genotype file
+snpoutname:      output snp file
+indivoutname:    output indiv file
+outputgroup: see documentation in ../CONVERTF/README
+"""
+import sys,os,time,subprocess,string,glob
+from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
+verbose = False
+def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
+"""
+the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
+the subject class
+Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
+somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
+At least you have the data and the analysis in one single place. Highly reproducible little
+piece of research.
+"""
+debug=False
+f = file(eigpca,'r')
+R = []
+if debug:
+R.append('sessionInfo()')
+R.append("print('dir()=:')")
+R.append('dir()')
+R.append("print('pdfname=%s')" % pdfname)
+gvec = []
+pca1 = []
+pca2 = []
+groups = {}
+glist = [] # list for legend
+ngroup = 1 # increment for each new group encountered for pch vector
+for n,row in enumerate(f):
+if n > 1:
+rowlist = row.strip().split()
+group = rowlist[-1]
+v1 = rowlist[1]
+v2 = rowlist[2]
+try:
+v1 = float(v1)
+except:
+v1 = 0.0
+try:
+v2 = float(v2)
+except:
+v2 = 0.0
+if not groups.get(group,None):
+groups[group] = ngroup
+glist.append(group)
+ngroup += 1 # for next group
+gvec.append(groups[group]) # lookup group number
+pca1.append('%f' % v1)
+pca2.append('%f' % v2)
+# now have vectors of group,pca1 and pca2
+llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
+llist = ['"%s"' % x for x in llist] # need to quote for R
+R.append('llist=c(%s)' % ','.join(llist))
+plist = range(2,len(llist)+2) # pch - avoid black circles
+R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
+pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
+R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
+R.append("par(mai=c(1,1,1,0.5))")
+maint = title
+R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
+R.append("par(lab=c(10,10,10))")
+R.append('pca1 = c(%s)' % ','.join(pca1))
+R.append('pca2 = c(%s)' % ','.join(pca2))
+R.append('pgvec = c(%s)' % ','.join(pgvec))
+s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
+s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
+R.append(s)
+R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
+R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
+R.append('dev.off()')
+R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
+s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
+s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
+R.append(s)
+R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
+R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
+R.append('dev.off()')
+rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
+print >> sys.stdout, '\n'.join(R)
+print >> sys.stdout, rlog
+def getfSize(fpath,outpath):
+"""
+format a nice file size string
+"""
+size = ''
+fp = os.path.join(outpath,fpath)
+if os.path.isfile(fp):
+n = float(os.path.getsize(fp))
+if n > 2**20:
+size = ' (%1.1f MB)' % (n/2**20)
+elif n > 2**10:
+size = ' (%1.1f KB)' % (n/2**10)
+elif n > 0:
+size = ' (%d B)' % (int(n))
+return size
+def runEigen():
+""" run the smartpca prog - documentation follows
+smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
+fakeped_100.eigenlog -o fakeped_100.eigenout
+DOCUMENTATION OF smartpca.perl program:
+This program calls the smartpca program (see ../POPGEN/README).
+For this to work, the bin directory containing smartpca MUST be in your path.
+See ./example.perl for a toy example.
+../bin/smartpca.perl
+-i example.geno  : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
+-a example.snp   : snp file   (see ../CONVERTF/README)
+-b example.ind   : indiv file (see ../CONVERTF/README)
+-k k             : (Default is 10) number of principal components to output
+-o example.pca   : output file of principal components.  Individuals removed
+as outliers will have all values set to 0.0 in this file.
+-p example.plot  : prefix of output plot files of top 2 principal components.
+(labeling individuals according to labels in indiv file)
+-e example.eval  : output file of all eigenvalues
+-l example.log   : output logfile
+-m maxiter       : (Default is 5) maximum number of outlier removal iterations.
+To turn off outlier removal, set -m 0.
+-t topk          : (Default is 10) number of principal components along which
+to remove outliers during each outlier removal iteration.
+-s sigma         : (Default is 6.0) number of standard deviations which an
+individual must exceed, along one of topk top principal
+components, in order to be removed as an outlier.
+now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
+All files can be viewed however, by making links in the primary (HTML) history item like:
+<img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
+<a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
+<command interpreter="python">
+rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
+"$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
+</command>
+"""
+if len(sys.argv) < 9:
+print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
+print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
+sys.exit(1)
+else:
+print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
+skillme = ' %s' % string.punctuation
+trantab = string.maketrans(skillme,'_'*len(skillme))
+ofname = sys.argv[5]
+progname = os.path.basename(sys.argv[0])
+infile = sys.argv[1]
+infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
+title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
+outfile1 = sys.argv[3]
+newfilepath = sys.argv[4]
+try:
+os.mkdirs(newfilepath)
+except:
+pass
+op = os.path.split(outfile1)[0]
+try: # for test - needs this done
+os.makedirs(op)
+except:
+pass
+eigen_k = sys.argv[5]
+eigen_m = sys.argv[6]
+eigen_t = sys.argv[7]
+eigen_s = sys.argv[8]
+eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
+eigentitle = os.path.join(newfilepath,title)
+explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
+'Smartpca log (contents shown below)']
+rplotname = 'PCAPlot.pdf'
+eigenexts = [rplotname, "pca.xls", "eval.xls"]
+newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
+rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
+eigenouts = [x for x in newfiles]
+eigenlogf = '%s_log.txt' % title
+newfiles.append(eigenlogf) # so it will also appear in the links
+lfname = outfile1
+lf = file(lfname,'w')
+lf.write(galhtmlprefix % progname)
+try:
+os.makedirs(newfilepath)
+except:
+pass
+smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
+(smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
+eigen_k, eigen_m, eigen_t, eigen_s)
+env = os.environ
+p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
+retval = p.wait()
+# copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
+elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
+eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
+try:
+eigpcaRes = file(eeigen,'r').read()
+except:
+eigpcaRes = ''
+file(eigpca,'w').write(eigpcaRes)
+makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
+s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
+lf.write('<h4>%s</h4>\n' % s)
+lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
+lf.write('(click on the image below to see a much higher quality PDF version)')
+thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
+if os.path.exists(os.path.join(newfilepath,thumbnail)):
+lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
+lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
+% (newfiles[0],thumbnail,explanations[0]))
+allfiles = os.listdir(newfilepath)
+allfiles.sort()
+sizes = [getfSize(x,newfilepath) for x in allfiles]
+lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
+lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
+lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
+lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
+s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
+lf.write(s)
+lf.write(galhtmlpostfix) # end galhtmlprefix div
+lf.close()
+if __name__ == "__main__":
+runEigen()

Mercurial > repos > xuebing > sharplabtool

comparison tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3