annotate tools/rgenetics/rgEigPCA.py @ 0:9071e359b9a3

Uploaded
author xuebing
date Fri, 09 Mar 2012 19:37:19 -0500
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
1 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
2 run smartpca
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
3
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
4 This uses galaxy code developed by Dan to deal with
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
5 arbitrary output files using an html dataset with it's own
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
6 subdirectory containing the arbitrary files
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
7 We create that html file and add all the links we need
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
8
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
9 Note that we execute the smartpca.perl program in the output subdirectory
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
10 to avoid having to clear out the job directory after running
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
11
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
12 Code to convert linkage format ped files into eigenstratgeno format is left here
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
13 in case we decide to autoconvert
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
14
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
15 Added a plot in R with better labels than the default eigensoft plot december 26 2007
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
16
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
17 DOCUMENTATION OF smartpca program:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
18
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
19 smartpca runs Principal Components Analysis on input genotype data and
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
20 outputs principal components (eigenvectors) and eigenvalues.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
21 The method assumes that samples are unrelated. (However, a small number
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
22 of cryptically related individuals is usually not a problem in practice
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
23 as they will typically be discarded as outliers.)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
24
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
25 5 different input formats are supported. See ../CONVERTF/README
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
26 for documentation on using the convertf program to convert between formats.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
27
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
28 The syntax of smartpca is "../bin/smartpca -p parfile". We illustrate
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
29 how parfile works via a toy example (see example.perl in this directory).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
30 This example takes input in EIGENSTRAT format. The syntax of how to take input
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
31 in other formats is analogous to the convertf program, see ../CONVERTF/README.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
32
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
33 The smartpca program prints various statistics to standard output.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
34 To redirect this information to a file, change the above syntax to
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
35 "../bin/smartpca -p parfile >logfile". For a description of these
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
36 statistics, see the documentation file smartpca.info in this directory.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
37
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
38 Estimated running time of the smartpca program is
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
39 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
40 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
41 Thus, under the default of up to 5 outlier removal iterations, running time is
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
42 up to 1.5e-11 * nSNP * NSAMPLES^2 hours.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
43
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
44 ------------------------------------------------------------------------
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
45
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
46 DESCRIPTION OF EACH PARAMETER in parfile for smartpca:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
47
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
48 genotypename: input genotype file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
49 snpname: input snp file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
50 indivname: input indiv file (in any format: see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
51 evecoutname: output file of eigenvectors. See numoutevec parameter below.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
52 evaloutname: output file of all eigenvalues
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
53
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
54 OPTIONAL PARAMETERS:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
55
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
56 numoutevec: number of eigenvectors to output. Default is 10.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
57 numoutlieriter: maximum number of outlier removal iterations.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
58 Default is 5. To turn off outlier removal, set this parameter to 0.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
59 numoutlierevec: number of principal components along which to
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
60 remove outliers during each outlier removal iteration. Default is 10.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
61 outliersigmathresh: number of standard deviations which an individual must
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
62 exceed, along one of the top (numoutlierevec) principal components, in
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
63 order for that individual to be removed as an outlier. Default is 6.0.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
64 outlieroutname: output logfile of outlier individuals removed. If not specified,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
65 smartpca will print this information to stdout, which is the default.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
66 usenorm: Whether to normalize each SNP by a quantity related to allele freq.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
67 Default is YES. (When analyzing microsatellite data, should be set to NO.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
68 See Patterson et al. 2006.)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
69 altnormstyle: Affects very subtle details in normalization formula.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
70 Default is YES (normalization formulas of Patterson et al. 2006)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
71 To match EIGENSTRAT (normalization formulas of Price et al. 2006), set to NO.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
72 missingmode: If set to YES, then instead of doing PCA on # reference alleles,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
73 do PCA on whether each data point is missing or nonmissing. Default is NO.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
74 May be useful for detecting informative missingness (Clayton et al. 2005).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
75 nsnpldregress: If set to a positive integer, then LD correction is turned on,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
76 and input to PCA will be the residual of a regression involving that many
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
77 previous SNPs, according to physical location. See Patterson et al. 2006.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
78 Default is 0 (no LD correction). If desiring LD correction, we recommend 2.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
79 maxdistldregress: If doing LD correction, this is the maximum genetic distance
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
80 (in Morgans) for previous SNPs used in LD correction. Default is no maximum.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
81 poplistname: If wishing to infer eigenvectors using only individuals from a
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
82 subset of populations, and then project individuals from all populations
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
83 onto those eigenvectors, this input file contains a list of population names,
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
84 one population name per line, which will be used to infer eigenvectors.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
85 It is assumed that the population of each individual is specified in the
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
86 indiv file. Default is to use individuals from all populations.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
87 phylipoutname: output file containing an fst matrix which can be used as input
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
88 to programs in the PHYLIP package, such as the "fitch" program for
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
89 constructing phylogenetic trees.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
90 noxdata: if set to YES, all SNPs on X chr are excluded from the data set.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
91 The smartpca default for this parameter is YES, since different variances
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
92 for males vs. females on X chr may confound PCA analysis.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
93 nomalexhet: if set to YES, any het genotypes on X chr for males are changed
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
94 to missing data. The smartpca default for this parameter is YES.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
95 badsnpname: specifies a list of SNPs which should be excluded from the data set.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
96 Same format as example.snp. Cannot be used if input is in
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
97 PACKEDPED or PACKEDANCESTRYMAP format.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
98 popsizelimit: If set to a positive integer, the result is that only the first
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
99 popsizelimit individuals from each population will be included in the
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
100 analysis. It is assumed that the population of each individual is specified
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
101 in the indiv file. Default is to use all individuals in the analysis.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
102
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
103 The next 5 optional parameters allow the user to output genotype, snp and
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
104 indiv files which will be identical to the input files except that:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
105 Any individuals set to Ignore in the input indiv file will be
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
106 removed from the data set (see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
107 Any data excluded or set to missing based on noxdata, nomalexhet and
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
108 badsnpname parameters (see above) will be removed from the data set.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
109 The user may decide to output these files in any format.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
110 outputformat: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
111 genotypeoutname: output genotype file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
112 snpoutname: output snp file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
113 indivoutname: output indiv file
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
114 outputgroup: see documentation in ../CONVERTF/README
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
115 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
116 import sys,os,time,subprocess,string,glob
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
117 from rgutils import RRun, galhtmlprefix, galhtmlpostfix, timenow, smartpca, rexe, plinke
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
118 verbose = False
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
119
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
120 def makePlot(eigpca='test.pca',title='test',pdfname='test.pdf',h=8,w=10,nfp=None,rexe=''):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
121 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
122 the eigenvec file has a # row with the eigenvectors, then subject ids, eigenvecs and lastly
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
123 the subject class
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
124 Rpy not being used here. Write a real R script and run it. Sadly, this means putting numbers
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
125 somewhere - like in the code as monster R vector constructor c(99.3,2.14) strings
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
126 At least you have the data and the analysis in one single place. Highly reproducible little
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
127 piece of research.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
128 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
129 debug=False
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
130 f = file(eigpca,'r')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
131 R = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
132 if debug:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
133 R.append('sessionInfo()')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
134 R.append("print('dir()=:')")
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
135 R.append('dir()')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
136 R.append("print('pdfname=%s')" % pdfname)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
137 gvec = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
138 pca1 = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
139 pca2 = []
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
140 groups = {}
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
141 glist = [] # list for legend
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
142 ngroup = 1 # increment for each new group encountered for pch vector
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
143 for n,row in enumerate(f):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
144 if n > 1:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
145 rowlist = row.strip().split()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
146 group = rowlist[-1]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
147 v1 = rowlist[1]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
148 v2 = rowlist[2]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
149 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
150 v1 = float(v1)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
151 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
152 v1 = 0.0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
153 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
154 v2 = float(v2)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
155 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
156 v2 = 0.0
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
157 if not groups.get(group,None):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
158 groups[group] = ngroup
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
159 glist.append(group)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
160 ngroup += 1 # for next group
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
161 gvec.append(groups[group]) # lookup group number
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
162 pca1.append('%f' % v1)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
163 pca2.append('%f' % v2)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
164 # now have vectors of group,pca1 and pca2
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
165 llist = [x.encode('ascii') for x in glist] # remove label unicode - eesh
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
166 llist = ['"%s"' % x for x in llist] # need to quote for R
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
167 R.append('llist=c(%s)' % ','.join(llist))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
168
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
169 plist = range(2,len(llist)+2) # pch - avoid black circles
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
170 R.append('glist=c(%s)' % ','.join(['%d' % x for x in plist]))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
171 pgvec = ['%d' % (plist[i-1]) for i in gvec] # plot symbol/colour for each point
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
172 R.append("par(lab=c(10,10,10))") # so our grid is denser than the default 5
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
173 R.append("par(mai=c(1,1,1,0.5))")
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
174 maint = title
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
175 R.append('pdf("%s",h=%d,w=%d)' % (pdfname,h,w))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
176 R.append("par(lab=c(10,10,10))")
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
177 R.append('pca1 = c(%s)' % ','.join(pca1))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
178 R.append('pca2 = c(%s)' % ','.join(pca2))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
179 R.append('pgvec = c(%s)' % ','.join(pgvec))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
180 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
181 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
182 R.append(s)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
183 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
184 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
185 R.append('dev.off()')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
186 R.append('png("%s.png",h=%d,w=%d,units="in",res=72)' % (pdfname,h,w))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
187 s = "plot(pca1,pca2,type='p',main='%s', ylab='Second ancestry eigenvector'," % maint
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
188 s += "xlab='First ancestry eigenvector',col=pgvec,cex=0.8,pch=pgvec)"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
189 R.append(s)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
190 R.append('legend("top",legend=llist,pch=glist,col=glist,title="Sample")')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
191 R.append('grid(nx = 10, ny = 10, col = "lightgray", lty = "dotted")')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
192 R.append('dev.off()')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
193 rlog,flist = RRun(rcmd=R,title=title,outdir=nfp)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
194 print >> sys.stdout, '\n'.join(R)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
195 print >> sys.stdout, rlog
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
196
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
197
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
198 def getfSize(fpath,outpath):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
199 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
200 format a nice file size string
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
201 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
202 size = ''
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
203 fp = os.path.join(outpath,fpath)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
204 if os.path.isfile(fp):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
205 n = float(os.path.getsize(fp))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
206 if n > 2**20:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
207 size = ' (%1.1f MB)' % (n/2**20)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
208 elif n > 2**10:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
209 size = ' (%1.1f KB)' % (n/2**10)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
210 elif n > 0:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
211 size = ' (%d B)' % (int(n))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
212 return size
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
213
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
214
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
215 def runEigen():
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
216 """ run the smartpca prog - documentation follows
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
217
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
218 smartpca.perl -i fakeped_100.eigenstratgeno -a fakeped_100.map -b fakeped_100.ind -p fakeped_100 -e fakeped_100.eigenvals -l
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
219 fakeped_100.eigenlog -o fakeped_100.eigenout
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
220
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
221 DOCUMENTATION OF smartpca.perl program:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
222
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
223 This program calls the smartpca program (see ../POPGEN/README).
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
224 For this to work, the bin directory containing smartpca MUST be in your path.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
225 See ./example.perl for a toy example.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
226
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
227 ../bin/smartpca.perl
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
228 -i example.geno : genotype file in EIGENSTRAT format (see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
229 -a example.snp : snp file (see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
230 -b example.ind : indiv file (see ../CONVERTF/README)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
231 -k k : (Default is 10) number of principal components to output
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
232 -o example.pca : output file of principal components. Individuals removed
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
233 as outliers will have all values set to 0.0 in this file.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
234 -p example.plot : prefix of output plot files of top 2 principal components.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
235 (labeling individuals according to labels in indiv file)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
236 -e example.eval : output file of all eigenvalues
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
237 -l example.log : output logfile
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
238 -m maxiter : (Default is 5) maximum number of outlier removal iterations.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
239 To turn off outlier removal, set -m 0.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
240 -t topk : (Default is 10) number of principal components along which
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
241 to remove outliers during each outlier removal iteration.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
242 -s sigma : (Default is 6.0) number of standard deviations which an
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
243 individual must exceed, along one of topk top principal
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
244 components, in order to be removed as an outlier.
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
245
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
246 now uses https://www.bx.psu.edu/cgi-bin/trac.cgi/galaxy/changeset/1832
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
247
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
248 All files can be viewed however, by making links in the primary (HTML) history item like:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
249 <img src="display_child?parent_id=2&designation=SomeImage?" alt="Some Image"/>
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
250 <a href="display_child?parent_id=2&designation=SomeText?">Some Text</a>
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
251
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
252 <command interpreter="python">
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
253 rgEigPCA.py "$i.extra_files_path/$i.metadata.base_name" "$title" "$out_file1"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
254 "$out_file1.files_path" "$k" "$m" "$t" "$s" "$pca"
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
255 </command>
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
256
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
257 """
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
258 if len(sys.argv) < 9:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
259 print 'Need an input genotype file root, a title, a temp id and the temp file path for outputs,'
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
260 print ' and the 4 integer tuning parameters k,m,t and s in order. Given that, will run smartpca for eigensoft'
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
261 sys.exit(1)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
262 else:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
263 print >> sys.stdout, 'rgEigPCA.py got %s' % (' '.join(sys.argv))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
264 skillme = ' %s' % string.punctuation
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
265 trantab = string.maketrans(skillme,'_'*len(skillme))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
266 ofname = sys.argv[5]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
267 progname = os.path.basename(sys.argv[0])
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
268 infile = sys.argv[1]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
269 infpath,base_name = os.path.split(infile) # now takes precomputed or autoconverted ldreduced dataset
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
270 title = sys.argv[2].translate(trantab) # must replace all of these for urls containing title
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
271 outfile1 = sys.argv[3]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
272 newfilepath = sys.argv[4]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
273 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
274 os.mkdirs(newfilepath)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
275 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
276 pass
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
277 op = os.path.split(outfile1)[0]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
278 try: # for test - needs this done
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
279 os.makedirs(op)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
280 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
281 pass
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
282 eigen_k = sys.argv[5]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
283 eigen_m = sys.argv[6]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
284 eigen_t = sys.argv[7]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
285 eigen_s = sys.argv[8]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
286 eigpca = sys.argv[9] # path to new dataset for pca results - for later adjustment
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
287 eigentitle = os.path.join(newfilepath,title)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
288 explanations=['Samples plotted in first 2 eigenvector space','Principle components','Eigenvalues',
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
289 'Smartpca log (contents shown below)']
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
290 rplotname = 'PCAPlot.pdf'
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
291 eigenexts = [rplotname, "pca.xls", "eval.xls"]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
292 newfiles = ['%s_%s' % (title,x) for x in eigenexts] # produced by eigenstrat
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
293 rplotout = os.path.join(newfilepath,newfiles[0]) # for R plots
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
294 eigenouts = [x for x in newfiles]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
295 eigenlogf = '%s_log.txt' % title
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
296 newfiles.append(eigenlogf) # so it will also appear in the links
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
297 lfname = outfile1
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
298 lf = file(lfname,'w')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
299 lf.write(galhtmlprefix % progname)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
300 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
301 os.makedirs(newfilepath)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
302 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
303 pass
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
304 smartCL = '%s -i %s.bed -a %s.bim -b %s.fam -o %s -p %s -e %s -l %s -k %s -m %s -t %s -s %s' % \
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
305 (smartpca,infile, infile, infile, eigenouts[1],'%s_eigensoftplot.pdf' % title,eigenouts[2],eigenlogf, \
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
306 eigen_k, eigen_m, eigen_t, eigen_s)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
307 env = os.environ
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
308 p=subprocess.Popen(smartCL,shell=True,cwd=newfilepath)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
309 retval = p.wait()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
310 # copy the eigenvector output file needed for adjustment to the user's eigenstrat library directory
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
311 elog = file(os.path.join(newfilepath,eigenlogf),'r').read()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
312 eeigen = os.path.join(newfilepath,'%s.evec' % eigenouts[1]) # need these for adjusting
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
313 try:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
314 eigpcaRes = file(eeigen,'r').read()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
315 except:
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
316 eigpcaRes = ''
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
317 file(eigpca,'w').write(eigpcaRes)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
318 makePlot(eigpca=eigpca,pdfname=newfiles[0],title=title,nfp=newfilepath,rexe=rexe)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
319 s = 'Output from %s run at %s<br/>\n' % (progname,timenow())
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
320 lf.write('<h4>%s</h4>\n' % s)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
321 lf.write('newfilepath=%s, rexe=%s' % (newfilepath,rexe))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
322 lf.write('(click on the image below to see a much higher quality PDF version)')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
323 thumbnail = '%s.png' % newfiles[0] # foo.pdf.png - who cares?
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
324 if os.path.exists(os.path.join(newfilepath,thumbnail)):
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
325 lf.write('<table border="0" cellpadding="10" cellspacing="10"><tr><td>\n')
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
326 lf.write('<a href="%s"><img src="%s" alt="%s" hspace="10" align="left" /></a></td></tr></table><br/>\n' \
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
327 % (newfiles[0],thumbnail,explanations[0]))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
328 allfiles = os.listdir(newfilepath)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
329 allfiles.sort()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
330 sizes = [getfSize(x,newfilepath) for x in allfiles]
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
331 lallfiles = ['<li><a href="%s">%s %s</a></li>\n' % (x,x,sizes[i]) for i,x in enumerate(allfiles)] # html list
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
332 lf.write('<div class="document">All Files:<ol>%s</ol></div>' % ''.join(lallfiles))
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
333 lf.write('<div class="document">Log %s contents follow below<p/>' % eigenlogf)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
334 lf.write('<pre>%s</pre></div>' % elog) # the eigenlog
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
335 s = 'If you need to rerun this analysis, the command line used was\n%s\n<p/>' % (smartCL)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
336 lf.write(s)
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
337 lf.write(galhtmlpostfix) # end galhtmlprefix div
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
338 lf.close()
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
339
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
340
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
341 if __name__ == "__main__":
9071e359b9a3 Uploaded
xuebing
parents:
diff changeset
342 runEigen()