sharplabtool: tools/rgenetics/rgEigPCA.py annotate

annotate tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3

Uploaded

author	xuebing
date	Fri, 09 Mar 2012 19:37:19 -0500
parents
children

rev	line source
0 9071e359b9a3 Uploaded xuebing parents: diff changeset	1 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	2 run smartpca
9071e359b9a3 Uploaded xuebing parents: diff changeset	3
9071e359b9a3 Uploaded xuebing parents: diff changeset	4 This uses galaxy code developed by Dan to deal with
9071e359b9a3 Uploaded xuebing parents: diff changeset	5 arbitrary output files using an html dataset with it's own
9071e359b9a3 Uploaded xuebing parents: diff changeset	6 subdirectory containing the arbitrary files
9071e359b9a3 Uploaded xuebing parents: diff changeset	7 We create that html file and add all the links we need
9071e359b9a3 Uploaded xuebing parents: diff changeset	8
9071e359b9a3 Uploaded xuebing parents: diff changeset	9 Note that we execute the smartpca.perl program in the output subdirectory
9071e359b9a3 Uploaded xuebing parents: diff changeset	10 to avoid having to clear out the job directory after running
9071e359b9a3 Uploaded xuebing parents: diff changeset	11
9071e359b9a3 Uploaded xuebing parents: diff changeset	12 Code to convert linkage format ped files into eigenstratgeno format is left here
9071e359b9a3 Uploaded xuebing parents: diff changeset	13 in case we decide to autoconvert
9071e359b9a3 Uploaded xuebing parents: diff changeset	14
9071e359b9a3 Uploaded xuebing parents: diff changeset	15 Added a plot in R with better labels than the default eigensoft plot december 26 2007
9071e359b9a3 Uploaded xuebing parents: diff changeset	16
9071e359b9a3 Uploaded xuebing parents: diff changeset	17 DOCUMENTATION OF smartpca program:
9071e359b9a3 Uploaded xuebing parents: diff changeset	18
9071e359b9a3 Uploaded xuebing parents: diff changeset	19 smartpca runs Principal Components Analysis on input genotype data and
9071e359b9a3 Uploaded xuebing parents: diff changeset	20 outputs principal components (eigenvectors) and eigenvalues.
9071e359b9a3 Uploaded xuebing parents: diff changeset	21 The method assumes that samples are unrelated. (However, a small number
9071e359b9a3 Uploaded xuebing parents: diff changeset	22 of cryptically related individuals is usually not a problem in practice
9071e359b9a3 Uploaded xuebing parents: diff changeset	23 as they will typically be discarded as outliers.)
9071e359b9a3 Uploaded xuebing parents: diff changeset	24
9071e359b9a3 Uploaded xuebing parents: diff changeset	25 5 different input formats are supported. See ../CONVERTF/README
9071e359b9a3 Uploaded xuebing parents: diff changeset	26 for documentation on using the convertf program to convert between formats.
9071e359b9a3 Uploaded xuebing parents: diff changeset	27
9071e359b9a3 Uploaded xuebing parents: diff changeset	28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate
9071e359b9a3 Uploaded xuebing parents: diff changeset	29 how parfile works via a toy example (see example.perl in this directory).
9071e359b9a3 Uploaded xuebing parents: diff changeset	30 This example takes input in EIGENSTRAT format. The syntax of how to take input
9071e359b9a3 Uploaded xuebing parents: diff changeset	31 in other formats is analogous to the convertf program, see ../CONVERTF/README.
9071e359b9a3 Uploaded xuebing parents: diff changeset	32
9071e359b9a3 Uploaded xuebing parents: diff changeset	33 The smartpca program prints various statistics to standard output.
9071e359b9a3 Uploaded xuebing parents: diff changeset	34 To redirect this information to a file, change the above syntax to
9071e359b9a3 Uploaded xuebing parents: diff changeset	35 "../bin/smartpca -p parfile >logfile". For a description of these
9071e359b9a3 Uploaded xuebing parents: diff changeset	36 statistics, see the documentation file smartpca.info in this directory.
9071e359b9a3 Uploaded xuebing parents: diff changeset	37
9071e359b9a3 Uploaded xuebing parents: diff changeset	38 Estimated running time of the smartpca program is
9071e359b9a3 Uploaded xuebing parents: diff changeset	39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
9071e359b9a3 Uploaded xuebing parents: diff changeset	40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
9071e359b9a3 Uploaded xuebing parents: diff changeset	41 Thus, under the default of up to 5 outlier removal iterations, running time is
9071e359b9a3 Uploaded xuebing parents: diff changeset	42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
9071e359b9a3 Uploaded xuebing parents: diff changeset	43
9071e359b9a3 Uploaded xuebing parents: diff changeset	44 ------------------------------------------------------------------------
9071e359b9a3 Uploaded xuebing parents: diff changeset	45
9071e359b9a3 Uploaded xuebing parents: diff changeset	46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
9071e359b9a3 Uploaded xuebing parents: diff changeset	47
9071e359b9a3 Uploaded xuebing parents: diff changeset	48 genotypename: input genotype file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	49 snpname: input snp file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	50 indivname: input indiv file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	51 evecoutname: output file of eigenvectors. See numoutevec parameter below.
9071e359b9a3 Uploaded xuebing parents: diff changeset	52 evaloutname: output file of all eigenvalues
9071e359b9a3 Uploaded xuebing parents: diff changeset	53
9071e359b9a3 Uploaded xuebing parents: diff changeset	54 OPTIONAL PARAMETERS:
9071e359b9a3 Uploaded xuebing parents: diff changeset	55
9071e359b9a3 Uploaded xuebing parents: diff changeset	56 numoutevec: number of eigenvectors to output. Default is 10.
9071e359b9a3 Uploaded xuebing parents: diff changeset	57 numoutlieriter: maximum number of outlier removal iterations.
9071e359b9a3 Uploaded xuebing parents: diff changeset	58 Default is 5. To turn off outlier removal, set this parameter to 0.
9071e359b9a3 Uploaded xuebing parents: diff changeset	59 numoutlierevec: number of principal components along which to
9071e359b9a3 Uploaded xuebing parents: diff changeset	60 remove outliers during each outlier removal iteration. Default is 10.
9071e359b9a3 Uploaded xuebing parents: diff changeset	61 outliersigmathresh: number of standard deviations which an individual must
9071e359b9a3 Uploaded xuebing parents: diff changeset	62 exceed, along one of the top (numoutlierevec) principal components, in
9071e359b9a3 Uploaded xuebing parents: diff changeset	63 order for that individual to be removed as an outlier. Default is 6.0.
9071e359b9a3 Uploaded xuebing parents: diff changeset	64 outlieroutname: output logfile of outlier individuals removed. If not specified,
9071e359b9a3 Uploaded xuebing parents: diff changeset	65 smartpca will print this information to stdout, which is the default.
9071e359b9a3 Uploaded xuebing parents: diff changeset	66 usenorm: Whether to normalize each SNP by a quantity related to allele freq.
9071e359b9a3 Uploaded xuebing parents: diff changeset	67 Default is YES. (When analyzing microsatellite data, should be set to NO.
9071e359b9a3 Uploaded xuebing parents: diff changeset	68 See Patterson et al. 2006.)
9071e359b9a3 Uploaded xuebing parents: diff changeset	69 altnormstyle: Affects very subtle details in normalization formula.
9071e359b9a3 Uploaded xuebing parents: diff changeset	70 Default is YES (normalization formulas of Patterson et al. 2006)
9071e359b9a3 Uploaded xuebing parents: diff changeset	71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
9071e359b9a3 Uploaded xuebing parents: diff changeset	72 missingmode: If set to YES, then instead of doing PCA on # reference alleles,
9071e359b9a3 Uploaded xuebing parents: diff changeset	73 do PCA on whether each data point is missing or nonmissing. Default is NO.
9071e359b9a3 Uploaded xuebing parents: diff changeset	74 May be useful for detecting informative missingness (Clayton et al. 2005).
9071e359b9a3 Uploaded xuebing parents: diff changeset	75 nsnpldregress: If set to a positive integer, then LD correction is turned on,
9071e359b9a3 Uploaded xuebing parents: diff changeset	76 and input to PCA will be the residual of a regression involving that many
9071e359b9a3 Uploaded xuebing parents: diff changeset	77 previous SNPs, according to physical location. See Patterson et al. 2006.
9071e359b9a3 Uploaded xuebing parents: diff changeset	78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2.
9071e359b9a3 Uploaded xuebing parents: diff changeset	79 maxdistldregress: If doing LD correction, this is the maximum genetic distance
9071e359b9a3 Uploaded xuebing parents: diff changeset	80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum.
9071e359b9a3 Uploaded xuebing parents: diff changeset	81 poplistname: If wishing to infer eigenvectors using only individuals from a
9071e359b9a3 Uploaded xuebing parents: diff changeset	82 subset of populations, and then project individuals from all populations
9071e359b9a3 Uploaded xuebing parents: diff changeset	83 onto those eigenvectors, this input file contains a list of population names,
9071e359b9a3 Uploaded xuebing parents: diff changeset	84 one population name per line, which will be used to infer eigenvectors.
9071e359b9a3 Uploaded xuebing parents: diff changeset	85 It is assumed that the population of each individual is specified in the
9071e359b9a3 Uploaded xuebing parents: diff changeset	86 indiv file. Default is to use individuals from all populations.
9071e359b9a3 Uploaded xuebing parents: diff changeset	87 phylipoutname: output file containing an fst matrix which can be used as input
9071e359b9a3 Uploaded xuebing parents: diff changeset	88 to programs in the PHYLIP package, such as the "fitch" program for
9071e359b9a3 Uploaded xuebing parents: diff changeset	89 constructing phylogenetic trees.
9071e359b9a3 Uploaded xuebing parents: diff changeset	90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set.
9071e359b9a3 Uploaded xuebing parents: diff changeset	91 The smartpca default for this parameter is YES, since different variances
9071e359b9a3 Uploaded xuebing parents: diff changeset	92 for males vs. females on X chr may confound PCA analysis.
9071e359b9a3 Uploaded xuebing parents: diff changeset	93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed
9071e359b9a3 Uploaded xuebing parents: diff changeset	94 to missing data. The smartpca default for this parameter is YES.
9071e359b9a3 Uploaded xuebing parents: diff changeset	95 badsnpname: specifies a list of SNPs which should be excluded from the data set.
9071e359b9a3 Uploaded xuebing parents: diff changeset	96 Same format as example.snp. Cannot be used if input is in
9071e359b9a3 Uploaded xuebing parents: diff changeset	97 PACKEDPED or PACKEDANCESTRYMAP format.
9071e359b9a3 Uploaded xuebing parents: diff changeset	98 popsizelimit: If set to a positive integer, the result is that only the first
9071e359b9a3 Uploaded xuebing parents: diff changeset	99 popsizelimit individuals from each population will be included in the
9071e359b9a3 Uploaded xuebing parents: diff changeset	100 analysis. It is assumed that the population of each individual is specified
9071e359b9a3 Uploaded xuebing parents: diff changeset	101 in the indiv file. Default is to use all individuals in the analysis.
9071e359b9a3 Uploaded xuebing parents: diff changeset	102
9071e359b9a3 Uploaded xuebing parents: diff changeset	103 The next 5 optional parameters allow the user to output genotype, snp and
9071e359b9a3 Uploaded xuebing parents: diff changeset	104 indiv files which will be identical to the input files except that:
9071e359b9a3 Uploaded xuebing parents: diff changeset	105 Any individuals set to Ignore in the input indiv file will be
9071e359b9a3 Uploaded xuebing parents: diff changeset	106 removed from the data set (see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	107 Any data excluded or set to missing based on noxdata, nomalexhet and
9071e359b9a3 Uploaded xuebing parents: diff changeset	108 badsnpname parameters (see above) will be removed from the data set.
9071e359b9a3 Uploaded xuebing parents: diff changeset	109 The user may decide to output these files in any format.
9071e359b9a3 Uploaded xuebing parents: diff changeset	110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
9071e359b9a3 Uploaded xuebing parents: diff changeset	111 genotypeoutname: output genotype file
9071e359b9a3 Uploaded xuebing parents: diff changeset	112 snpoutname: output snp file
9071e359b9a3 Uploaded xuebing parents: diff changeset	113 indivoutname: output indiv file
9071e359b9a3 Uploaded xuebing parents: diff changeset	114 outputgroup: see documentation in ../CONVERTF/README
9071e359b9a3 Uploaded xuebing parents: diff changeset	115 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	116 import sys,os,time,subprocess,string,glob
9071e359b9a3 Uploaded xuebing parents: diff changeset	117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
9071e359b9a3 Uploaded xuebing parents: diff changeset	118 verbose = False
9071e359b9a3 Uploaded xuebing parents: diff changeset	119
9071e359b9a3 Uploaded xuebing parents: diff changeset	120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
9071e359b9a3 Uploaded xuebing parents: diff changeset	121 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
9071e359b9a3 Uploaded xuebing parents: diff changeset	123 the subject class
9071e359b9a3 Uploaded xuebing parents: diff changeset	124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
9071e359b9a3 Uploaded xuebing parents: diff changeset	125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
9071e359b9a3 Uploaded xuebing parents: diff changeset	126 At least you have the data and the analysis in one single place. Highly reproducible little
9071e359b9a3 Uploaded xuebing parents: diff changeset	127 piece of research.
9071e359b9a3 Uploaded xuebing parents: diff changeset	128 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	129 debug=False
9071e359b9a3 Uploaded xuebing parents: diff changeset	130 f = file(eigpca,'r')
9071e359b9a3 Uploaded xuebing parents: diff changeset	131 R = []
9071e359b9a3 Uploaded xuebing parents: diff changeset	132 if debug:
9071e359b9a3 Uploaded xuebing parents: diff changeset	133 R.append('sessionInfo()')
9071e359b9a3 Uploaded xuebing parents: diff changeset	134 R.append("print('dir()=:')")
9071e359b9a3 Uploaded xuebing parents: diff changeset	135 R.append('dir()')
9071e359b9a3 Uploaded xuebing parents: diff changeset	136 R.append("print('pdfname=%s')" % pdfname)
9071e359b9a3 Uploaded xuebing parents: diff changeset	137 gvec = []
9071e359b9a3 Uploaded xuebing parents: diff changeset	138 pca1 = []
9071e359b9a3 Uploaded xuebing parents: diff changeset	139 pca2 = []
9071e359b9a3 Uploaded xuebing parents: diff changeset	140 groups = {}
9071e359b9a3 Uploaded xuebing parents: diff changeset	141 glist = [] # list for legend
9071e359b9a3 Uploaded xuebing parents: diff changeset	142 ngroup = 1 # increment for each new group encountered for pch vector
9071e359b9a3 Uploaded xuebing parents: diff changeset	143 for n,row in enumerate(f):
9071e359b9a3 Uploaded xuebing parents: diff changeset	144 if n > 1:
9071e359b9a3 Uploaded xuebing parents: diff changeset	145 rowlist = row.strip().split()
9071e359b9a3 Uploaded xuebing parents: diff changeset	146 group = rowlist[-1]
9071e359b9a3 Uploaded xuebing parents: diff changeset	147 v1 = rowlist[1]
9071e359b9a3 Uploaded xuebing parents: diff changeset	148 v2 = rowlist[2]
9071e359b9a3 Uploaded xuebing parents: diff changeset	149 try:
9071e359b9a3 Uploaded xuebing parents: diff changeset	150 v1 = float(v1)
9071e359b9a3 Uploaded xuebing parents: diff changeset	151 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	152 v1 = 0.0
9071e359b9a3 Uploaded xuebing parents: diff changeset	153 try:
9071e359b9a3 Uploaded xuebing parents: diff changeset	154 v2 = float(v2)
9071e359b9a3 Uploaded xuebing parents: diff changeset	155 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	156 v2 = 0.0
9071e359b9a3 Uploaded xuebing parents: diff changeset	157 if not groups.get(group,None):
9071e359b9a3 Uploaded xuebing parents: diff changeset	158 groups[group] = ngroup
9071e359b9a3 Uploaded xuebing parents: diff changeset	159 glist.append(group)
9071e359b9a3 Uploaded xuebing parents: diff changeset	160 ngroup += 1 # for next group
9071e359b9a3 Uploaded xuebing parents: diff changeset	161 gvec.append(groups[group]) # lookup group number
9071e359b9a3 Uploaded xuebing parents: diff changeset	162 pca1.append('%f' % v1)
9071e359b9a3 Uploaded xuebing parents: diff changeset	163 pca2.append('%f' % v2)
9071e359b9a3 Uploaded xuebing parents: diff changeset	164 # now have vectors of group,pca1 and pca2
9071e359b9a3 Uploaded xuebing parents: diff changeset	165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
9071e359b9a3 Uploaded xuebing parents: diff changeset	166 llist = ['"%s"' % x for x in llist] # need to quote for R
9071e359b9a3 Uploaded xuebing parents: diff changeset	167 R.append('llist=c(%s)' % ','.join(llist))
9071e359b9a3 Uploaded xuebing parents: diff changeset	168
9071e359b9a3 Uploaded xuebing parents: diff changeset	169 plist = range(2,len(llist)+2) # pch - avoid black circles
9071e359b9a3 Uploaded xuebing parents: diff changeset	170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
9071e359b9a3 Uploaded xuebing parents: diff changeset	171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
9071e359b9a3 Uploaded xuebing parents: diff changeset	172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
9071e359b9a3 Uploaded xuebing parents: diff changeset	173 R.append("par(mai=c(1,1,1,0.5))")
9071e359b9a3 Uploaded xuebing parents: diff changeset	174 maint = title
9071e359b9a3 Uploaded xuebing parents: diff changeset	175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
9071e359b9a3 Uploaded xuebing parents: diff changeset	176 R.append("par(lab=c(10,10,10))")
9071e359b9a3 Uploaded xuebing parents: diff changeset	177 R.append('pca1 = c(%s)' % ','.join(pca1))
9071e359b9a3 Uploaded xuebing parents: diff changeset	178 R.append('pca2 = c(%s)' % ','.join(pca2))
9071e359b9a3 Uploaded xuebing parents: diff changeset	179 R.append('pgvec = c(%s)' % ','.join(pgvec))
9071e359b9a3 Uploaded xuebing parents: diff changeset	180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
9071e359b9a3 Uploaded xuebing parents: diff changeset	181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
9071e359b9a3 Uploaded xuebing parents: diff changeset	182 R.append(s)
9071e359b9a3 Uploaded xuebing parents: diff changeset	183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
9071e359b9a3 Uploaded xuebing parents: diff changeset	184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
9071e359b9a3 Uploaded xuebing parents: diff changeset	185 R.append('dev.off()')
9071e359b9a3 Uploaded xuebing parents: diff changeset	186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
9071e359b9a3 Uploaded xuebing parents: diff changeset	187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
9071e359b9a3 Uploaded xuebing parents: diff changeset	188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
9071e359b9a3 Uploaded xuebing parents: diff changeset	189 R.append(s)
9071e359b9a3 Uploaded xuebing parents: diff changeset	190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
9071e359b9a3 Uploaded xuebing parents: diff changeset	191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
9071e359b9a3 Uploaded xuebing parents: diff changeset	192 R.append('dev.off()')
9071e359b9a3 Uploaded xuebing parents: diff changeset	193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
9071e359b9a3 Uploaded xuebing parents: diff changeset	194 print >> sys.stdout, '\n'.join(R)
9071e359b9a3 Uploaded xuebing parents: diff changeset	195 print >> sys.stdout, rlog
9071e359b9a3 Uploaded xuebing parents: diff changeset	196
9071e359b9a3 Uploaded xuebing parents: diff changeset	197
9071e359b9a3 Uploaded xuebing parents: diff changeset	198 def getfSize(fpath,outpath):
9071e359b9a3 Uploaded xuebing parents: diff changeset	199 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	200 format a nice file size string
9071e359b9a3 Uploaded xuebing parents: diff changeset	201 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	202 size = ''
9071e359b9a3 Uploaded xuebing parents: diff changeset	203 fp = os.path.join(outpath,fpath)
9071e359b9a3 Uploaded xuebing parents: diff changeset	204 if os.path.isfile(fp):
9071e359b9a3 Uploaded xuebing parents: diff changeset	205 n = float(os.path.getsize(fp))
9071e359b9a3 Uploaded xuebing parents: diff changeset	206 if n > 2**20:
9071e359b9a3 Uploaded xuebing parents: diff changeset	207 size = ' (%1.1f MB)' % (n/2**20)
9071e359b9a3 Uploaded xuebing parents: diff changeset	208 elif n > 2**10:
9071e359b9a3 Uploaded xuebing parents: diff changeset	209 size = ' (%1.1f KB)' % (n/2**10)
9071e359b9a3 Uploaded xuebing parents: diff changeset	210 elif n > 0:
9071e359b9a3 Uploaded xuebing parents: diff changeset	211 size = ' (%d B)' % (int(n))
9071e359b9a3 Uploaded xuebing parents: diff changeset	212 return size
9071e359b9a3 Uploaded xuebing parents: diff changeset	213
9071e359b9a3 Uploaded xuebing parents: diff changeset	214
9071e359b9a3 Uploaded xuebing parents: diff changeset	215 def runEigen():
9071e359b9a3 Uploaded xuebing parents: diff changeset	216 """ run the smartpca prog - documentation follows
9071e359b9a3 Uploaded xuebing parents: diff changeset	217
9071e359b9a3 Uploaded xuebing parents: diff changeset	218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
9071e359b9a3 Uploaded xuebing parents: diff changeset	219 fakeped_100.eigenlog -o fakeped_100.eigenout
9071e359b9a3 Uploaded xuebing parents: diff changeset	220
9071e359b9a3 Uploaded xuebing parents: diff changeset	221 DOCUMENTATION OF smartpca.perl program:
9071e359b9a3 Uploaded xuebing parents: diff changeset	222
9071e359b9a3 Uploaded xuebing parents: diff changeset	223 This program calls the smartpca program (see ../POPGEN/README).
9071e359b9a3 Uploaded xuebing parents: diff changeset	224 For this to work, the bin directory containing smartpca MUST be in your path.
9071e359b9a3 Uploaded xuebing parents: diff changeset	225 See ./example.perl for a toy example.
9071e359b9a3 Uploaded xuebing parents: diff changeset	226
9071e359b9a3 Uploaded xuebing parents: diff changeset	227 ../bin/smartpca.perl
9071e359b9a3 Uploaded xuebing parents: diff changeset	228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	229 -a example.snp : snp file (see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	230 -b example.ind : indiv file (see ../CONVERTF/README)
9071e359b9a3 Uploaded xuebing parents: diff changeset	231 -k k : (Default is 10) number of principal components to output
9071e359b9a3 Uploaded xuebing parents: diff changeset	232 -o example.pca : output file of principal components. Individuals removed
9071e359b9a3 Uploaded xuebing parents: diff changeset	233 as outliers will have all values set to 0.0 in this file.
9071e359b9a3 Uploaded xuebing parents: diff changeset	234 -p example.plot : prefix of output plot files of top 2 principal components.
9071e359b9a3 Uploaded xuebing parents: diff changeset	235 (labeling individuals according to labels in indiv file)
9071e359b9a3 Uploaded xuebing parents: diff changeset	236 -e example.eval : output file of all eigenvalues
9071e359b9a3 Uploaded xuebing parents: diff changeset	237 -l example.log : output logfile
9071e359b9a3 Uploaded xuebing parents: diff changeset	238 -m maxiter : (Default is 5) maximum number of outlier removal iterations.
9071e359b9a3 Uploaded xuebing parents: diff changeset	239 To turn off outlier removal, set -m 0.
9071e359b9a3 Uploaded xuebing parents: diff changeset	240 -t topk : (Default is 10) number of principal components along which
9071e359b9a3 Uploaded xuebing parents: diff changeset	241 to remove outliers during each outlier removal iteration.
9071e359b9a3 Uploaded xuebing parents: diff changeset	242 -s sigma : (Default is 6.0) number of standard deviations which an
9071e359b9a3 Uploaded xuebing parents: diff changeset	243 individual must exceed, along one of topk top principal
9071e359b9a3 Uploaded xuebing parents: diff changeset	244 components, in order to be removed as an outlier.
9071e359b9a3 Uploaded xuebing parents: diff changeset	245
9071e359b9a3 Uploaded xuebing parents: diff changeset	246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
9071e359b9a3 Uploaded xuebing parents: diff changeset	247
9071e359b9a3 Uploaded xuebing parents: diff changeset	248 All files can be viewed however, by making links in the primary (HTML) history item like:
9071e359b9a3 Uploaded xuebing parents: diff changeset	249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
9071e359b9a3 Uploaded xuebing parents: diff changeset	250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
9071e359b9a3 Uploaded xuebing parents: diff changeset	251
9071e359b9a3 Uploaded xuebing parents: diff changeset	252 <command interpreter="python">
9071e359b9a3 Uploaded xuebing parents: diff changeset	253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
9071e359b9a3 Uploaded xuebing parents: diff changeset	254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
9071e359b9a3 Uploaded xuebing parents: diff changeset	255 </command>
9071e359b9a3 Uploaded xuebing parents: diff changeset	256
9071e359b9a3 Uploaded xuebing parents: diff changeset	257 """
9071e359b9a3 Uploaded xuebing parents: diff changeset	258 if len(sys.argv) < 9:
9071e359b9a3 Uploaded xuebing parents: diff changeset	259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
9071e359b9a3 Uploaded xuebing parents: diff changeset	260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
9071e359b9a3 Uploaded xuebing parents: diff changeset	261 sys.exit(1)
9071e359b9a3 Uploaded xuebing parents: diff changeset	262 else:
9071e359b9a3 Uploaded xuebing parents: diff changeset	263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
9071e359b9a3 Uploaded xuebing parents: diff changeset	264 skillme = ' %s' % string.punctuation
9071e359b9a3 Uploaded xuebing parents: diff changeset	265 trantab = string.maketrans(skillme,'_'*len(skillme))
9071e359b9a3 Uploaded xuebing parents: diff changeset	266 ofname = sys.argv[5]
9071e359b9a3 Uploaded xuebing parents: diff changeset	267 progname = os.path.basename(sys.argv[0])
9071e359b9a3 Uploaded xuebing parents: diff changeset	268 infile = sys.argv[1]
9071e359b9a3 Uploaded xuebing parents: diff changeset	269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
9071e359b9a3 Uploaded xuebing parents: diff changeset	270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
9071e359b9a3 Uploaded xuebing parents: diff changeset	271 outfile1 = sys.argv[3]
9071e359b9a3 Uploaded xuebing parents: diff changeset	272 newfilepath = sys.argv[4]
9071e359b9a3 Uploaded xuebing parents: diff changeset	273 try:
9071e359b9a3 Uploaded xuebing parents: diff changeset	274 os.mkdirs(newfilepath)
9071e359b9a3 Uploaded xuebing parents: diff changeset	275 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	276 pass
9071e359b9a3 Uploaded xuebing parents: diff changeset	277 op = os.path.split(outfile1)[0]
9071e359b9a3 Uploaded xuebing parents: diff changeset	278 try: # for test - needs this done
9071e359b9a3 Uploaded xuebing parents: diff changeset	279 os.makedirs(op)
9071e359b9a3 Uploaded xuebing parents: diff changeset	280 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	281 pass
9071e359b9a3 Uploaded xuebing parents: diff changeset	282 eigen_k = sys.argv[5]
9071e359b9a3 Uploaded xuebing parents: diff changeset	283 eigen_m = sys.argv[6]
9071e359b9a3 Uploaded xuebing parents: diff changeset	284 eigen_t = sys.argv[7]
9071e359b9a3 Uploaded xuebing parents: diff changeset	285 eigen_s = sys.argv[8]
9071e359b9a3 Uploaded xuebing parents: diff changeset	286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
9071e359b9a3 Uploaded xuebing parents: diff changeset	287 eigentitle = os.path.join(newfilepath,title)
9071e359b9a3 Uploaded xuebing parents: diff changeset	288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
9071e359b9a3 Uploaded xuebing parents: diff changeset	289 'Smartpca log (contents shown below)']
9071e359b9a3 Uploaded xuebing parents: diff changeset	290 rplotname = 'PCAPlot.pdf'
9071e359b9a3 Uploaded xuebing parents: diff changeset	291 eigenexts = [rplotname, "pca.xls", "eval.xls"]
9071e359b9a3 Uploaded xuebing parents: diff changeset	292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
9071e359b9a3 Uploaded xuebing parents: diff changeset	293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
9071e359b9a3 Uploaded xuebing parents: diff changeset	294 eigenouts = [x for x in newfiles]
9071e359b9a3 Uploaded xuebing parents: diff changeset	295 eigenlogf = '%s_log.txt' % title
9071e359b9a3 Uploaded xuebing parents: diff changeset	296 newfiles.append(eigenlogf) # so it will also appear in the links
9071e359b9a3 Uploaded xuebing parents: diff changeset	297 lfname = outfile1
9071e359b9a3 Uploaded xuebing parents: diff changeset	298 lf = file(lfname,'w')
9071e359b9a3 Uploaded xuebing parents: diff changeset	299 lf.write(galhtmlprefix % progname)
9071e359b9a3 Uploaded xuebing parents: diff changeset	300 try:
9071e359b9a3 Uploaded xuebing parents: diff changeset	301 os.makedirs(newfilepath)
9071e359b9a3 Uploaded xuebing parents: diff changeset	302 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	303 pass
9071e359b9a3 Uploaded xuebing parents: diff changeset	304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
9071e359b9a3 Uploaded xuebing parents: diff changeset	305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
9071e359b9a3 Uploaded xuebing parents: diff changeset	306 eigen_k, eigen_m, eigen_t, eigen_s)
9071e359b9a3 Uploaded xuebing parents: diff changeset	307 env = os.environ
9071e359b9a3 Uploaded xuebing parents: diff changeset	308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
9071e359b9a3 Uploaded xuebing parents: diff changeset	309 retval = p.wait()
9071e359b9a3 Uploaded xuebing parents: diff changeset	310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
9071e359b9a3 Uploaded xuebing parents: diff changeset	311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
9071e359b9a3 Uploaded xuebing parents: diff changeset	312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
9071e359b9a3 Uploaded xuebing parents: diff changeset	313 try:
9071e359b9a3 Uploaded xuebing parents: diff changeset	314 eigpcaRes = file(eeigen,'r').read()
9071e359b9a3 Uploaded xuebing parents: diff changeset	315 except:
9071e359b9a3 Uploaded xuebing parents: diff changeset	316 eigpcaRes = ''
9071e359b9a3 Uploaded xuebing parents: diff changeset	317 file(eigpca,'w').write(eigpcaRes)
9071e359b9a3 Uploaded xuebing parents: diff changeset	318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
9071e359b9a3 Uploaded xuebing parents: diff changeset	319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
9071e359b9a3 Uploaded xuebing parents: diff changeset	320 lf.write('<h4>%s</h4>\n' % s)
9071e359b9a3 Uploaded xuebing parents: diff changeset	321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
9071e359b9a3 Uploaded xuebing parents: diff changeset	322 lf.write('(click on the image below to see a much higher quality PDF version)')
9071e359b9a3 Uploaded xuebing parents: diff changeset	323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
9071e359b9a3 Uploaded xuebing parents: diff changeset	324 if os.path.exists(os.path.join(newfilepath,thumbnail)):
9071e359b9a3 Uploaded xuebing parents: diff changeset	325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
9071e359b9a3 Uploaded xuebing parents: diff changeset	326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
9071e359b9a3 Uploaded xuebing parents: diff changeset	327 % (newfiles[0],thumbnail,explanations[0]))
9071e359b9a3 Uploaded xuebing parents: diff changeset	328 allfiles = os.listdir(newfilepath)
9071e359b9a3 Uploaded xuebing parents: diff changeset	329 allfiles.sort()
9071e359b9a3 Uploaded xuebing parents: diff changeset	330 sizes = [getfSize(x,newfilepath) for x in allfiles]
9071e359b9a3 Uploaded xuebing parents: diff changeset	331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
9071e359b9a3 Uploaded xuebing parents: diff changeset	332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
9071e359b9a3 Uploaded xuebing parents: diff changeset	333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
9071e359b9a3 Uploaded xuebing parents: diff changeset	334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
9071e359b9a3 Uploaded xuebing parents: diff changeset	335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
9071e359b9a3 Uploaded xuebing parents: diff changeset	336 lf.write(s)
9071e359b9a3 Uploaded xuebing parents: diff changeset	337 lf.write(galhtmlpostfix) # end galhtmlprefix div
9071e359b9a3 Uploaded xuebing parents: diff changeset	338 lf.close()
9071e359b9a3 Uploaded xuebing parents: diff changeset	339
9071e359b9a3 Uploaded xuebing parents: diff changeset	340
9071e359b9a3 Uploaded xuebing parents: diff changeset	341 if __name__ == "__main__":
9071e359b9a3 Uploaded xuebing parents: diff changeset	342 runEigen()

Mercurial > repos > xuebing > sharplabtool

annotate tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3