comparison tools/rgenetics/rgQQ.py @ 0:9071e359b9a3

Uploaded
author xuebing
date Fri, 09 Mar 2012 19:37:19 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:9071e359b9a3
1 """
2 oct 2009 - multiple output files
3 Dear Matthias,
4
5 Yes, you can define number of outputs dynamically in Galaxy. For doing
6 this, you'll have to declare one output dataset in your xml and pass
7 its ID ($out_file.id) to your python script. Also, set
8 force_history_refresh="True" in your tool tag in xml, like this:
9 <tool id="split1" name="Split" force_history_refresh="True">
10 In your script, if your outputs are named in the following format,
11 primary_associatedWithDatasetID_designation_visibility_extension
12 (_DBKEY), all your datasets will show up in the history pane.
13 associatedWithDatasetID is the $out_file.ID passed from xml,
14 designation will be a unique identifier for each output (set in your
15 script),
16 visibility can be set to visible if you want the dataset visible in
17 your history, or notvisible otherwise
18 extension is the required format for your dataset (bed, tabular, fasta
19 etc)
20 DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
21
22 One of our tools "MAF to Interval converter" (tools/maf/
23 maf_to_interval.xml) already uses this feature. You can use it as a
24 reference.
25
26 qq.chisq Quantile-quantile plot for chi-squared tests
27 Description
28 This function plots ranked observed chi-squared test statistics against the corresponding expected
29 order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
30 means of observed and expected values. This is useful for inspecting the results of whole-genome
31 association studies for overdispersion due to population substructure and other sources of bias or
32 confounding.
33 Usage
34 qq.chisq(x, df=1, x.max, main="QQ plot",
35 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
36 xlab="Expected", ylab="Observed",
37 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
38 slope.one=FALSE, slope.lambda=FALSE,
39 thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
40 Arguments
41 x A vector of observed chi-squared test values
42 df The degreees of freedom for the tests
43 x.max If present, truncate the observed value (Y) axis here
44 main The main heading
45 sub The subheading
46 xlab x-axis label (default "Expected")
47 ylab y-axis label (default "Observed")
48 conc Lower and upper probability bounds for concentration band for the plot. Set this
49 to NA to suppress this
50 overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
51 concentration band
52 trim Quantile point for trimmed mean calculations for estimation of lambda. Default
53 is to trim at the median
54 slope.one Is a line of slope one to be superimpsed?
55 slope.lambda Is a line of slope lambda to be superimposed?
56 thin A pair of numbers indicating how points will be thinned before plotting (see
57 Details). If NA, no thinning will be carried out
58 oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
59 the plotting symbol to be used for out-of-range observations
60 col.shade The colour with which the concentration band will be filled
61 ... Further graphical parameter settings to be passed to points()
62
63 Details
64 To reduce plotting time and the size of plot files, the smallest observed and expected points are
65 thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
66 precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
67 The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
68 is to be applied. The second number should be an integer and sets the maximum number of points
69 to be plotted in this section.
70 The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
71 probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
72 simultaneous confidence region; the probability that the plot will stray outside the band at some
73 point exceeds 95
74 When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
75 expected value under the chi-squared assumption.
76 Value
77 The function returns the number of tests, the number of values omitted from the plot (greater than
78 x.max), and the estimated dispersion factor, lambda.
79 Note
80 All tests must have the same number of degrees of freedom. If this is not the case, I suggest
81 transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
82 Author(s)
83 David Clayton hdavid.clayton@cimr.cam.ac.uki
84 References
85 Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
86 """
87
88 import sys, random, math, copy,os, subprocess, tempfile
89 from rgutils import RRun, rexe
90
91 rqq = """
92 # modified by ross lazarus for the rgenetics project may 2000
93 # makes a pdf for galaxy from an x vector of chisquare values
94 # from snpMatrix
95 # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
96 qq.chisq <-
97 function(x, df=1, x.max,
98 main="QQ plot",
99 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
100 xlab="Expected", ylab="Observed",
101 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
102 slope.one=T, slope.lambda=FALSE,
103 thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
104 h=6,w=6,printpdf=F,...) {
105
106 # Function to shade concentration band
107
108 shade <- function(x1, y1, x2, y2, color=col.shade) {
109 n <- length(x2)
110 polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
111 }
112
113 # Sort values and see how many out of range
114
115 obsvd <- sort(x, na.last=NA)
116 N <- length(obsvd)
117 if (missing(x.max)) {
118 Np <- N
119 }
120 else {
121 Np <- sum(obsvd<=x.max)
122 }
123 if(Np==0)
124 stop("Nothing to plot")
125
126 # Expected values
127
128 if (df==2) {
129 expctd <- 2*cumsum(1/(N:1))
130 }
131 else {
132 expctd <- qchisq(p=(1:N)/(N+1), df=df)
133 }
134
135 # Concentration bands
136
137 if (!is.null(conc)) {
138 if(conc[1]>0) {
139 e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
140 }
141 else {
142 e.low <- rep(0, N)
143 }
144 if (conc[2]<1) {
145 e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
146 }
147 else {
148 e.high <- 1.1*rep(max(x),N)
149 }
150 }
151
152 # Plot outline
153
154 if (Np < N)
155 top <- x.max
156 else
157 top <- obsvd[N]
158 right <- expctd[N]
159 if (printpdf) {pdf(ofname,h,w)}
160 plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
161 main=main, sub=sub)
162
163 # Thinning
164
165 if (is.na(thin[1])) {
166 show <- 1:Np
167 }
168 else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
169 warning("invalid thin parameter; no thinning carried out")
170 show <- 1:Np
171 }
172 else {
173 space <- right*thin[1]/floor(thin[2])
174 iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
175 if (max(iat)>thin[2])
176 show <- unique(c(iat, (1+max(iat)):Np))
177 else
178 show <- 1:Np
179 }
180 Nu <- floor(trim*N)
181 if (Nu>0)
182 lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
183 if (!is.null(conc)) {
184 if (Np<N)
185 vert <- c(show, (Np+1):N)
186 else
187 vert <- show
188 if (overdisp)
189 shade(expctd[vert], lambda*e.low[vert],
190 expctd[vert], lambda*e.high[vert])
191 else
192 shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
193 }
194 points(expctd[show], obsvd[show], ...)
195 # Overflow
196 if (Np<N) {
197 over <- (Np+1):N
198 points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
199 }
200 # Lines
201 line.types <- c("solid", "dashed", "dotted")
202 key <- NULL
203 txt <- NULL
204 if (slope.one) {
205 key <- c(key, line.types[1])
206 txt <- c(txt, "y = x")
207 abline(a=0, b=1, lty=line.types[1])
208 }
209 if (slope.lambda && Nu>0) {
210 key <- c(key, line.types[2])
211 txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
212 if (!is.null(conc)) {
213 if (Np<N)
214 vert <- c(show, (Np+1):N)
215 else
216 vert <- show
217 }
218 abline(a=0, b=lambda, lty=line.types[2])
219 }
220 if (printpdf) {dev.off()}
221 # Returned value
222
223 if (!is.null(key))
224 legend(0, top, legend=txt, lty=key)
225 c(N=N, omitted=N-Np, lambda=lambda)
226
227 }
228
229 """
230
231
232
233
234 def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
235 xvar='Sample',h=8,w=8,logscale=True,outdir=None):
236 """
237 y is data for a qq plot and ends up on the x axis go figure
238 if sampling, oversample low values - all the top 1% ?
239 assume we have 0-1 p values
240 """
241 R = []
242 colour="maroon"
243 nrows = len(dat)
244 dat.sort() # small to large
245 fn = float(nrows)
246 unifx = [x/fn for x in range(1,(nrows+1))]
247 if logscale:
248 unifx = [-math.log10(x) for x in unifx] # uniform distribution
249 if sample < 1.0 and len(dat) > maxveclen:
250 # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
251 # oversample part of the distribution
252 always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
253 skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
254 if skip <= 1:
255 skip = 2
256 samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
257 # always oversample first sorted (here lowest) values
258 yvec = [dat[i] for i in samplei] # always get first and last
259 xvec = [unifx[i] for i in samplei] # and sample xvec same way
260 maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
261 else:
262 yvec = [x for x in dat]
263 maint='QQ %s (n=%d)' % (title,nrows)
264 xvec = unifx
265 if logscale:
266 maint = 'Log%s' % maint
267 mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
268 ylab = '-log10(%s) Quantiles' % title
269 xlab = '-log10(Uniform 0-1) Quantiles'
270 yvec = [-math.log10(x) for x in yvec if x > 0.0]
271 else:
272 mx = [0,1]
273 ylab = '%s Quantiles' % title
274 xlab = 'Uniform 0-1 Quantiles'
275
276 xv = ['%f' % x for x in xvec]
277 R.append('xvec = c(%s)' % ','.join(xv))
278 yv = ['%f' % x for x in yvec]
279 R.append('yvec = c(%s)' % ','.join(yv))
280 R.append('mx = c(0,%f)' % (math.log10(fn)))
281 R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
282 R.append("par(lab=c(10,10,10))")
283 R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
284 R.append('points(mx,mx,type="l")')
285 R.append('grid(col="lightgray",lty="dotted")')
286 R.append('dev.off()')
287 RRun(rcmd=R,title='makeQQplot',outdir=outdir)
288
289
290
291 def main():
292 u = """
293 """
294 u = """<command interpreter="python">
295 rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__
296 </command>
297
298 </command>
299 """
300 print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
301 npar = 11
302 if len(sys.argv) < npar:
303 print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
304 print >> sys.stdout, u
305 sys.exit(1)
306 in_fname = sys.argv[1]
307 name = sys.argv[2]
308 sample = float(sys.argv[3])
309 head = None
310 columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
311 allout = sys.argv[5]
312 height = int(sys.argv[6])
313 width = int(sys.argv[7])
314 logscale = (sys.argv[8].lower() == 'true')
315 outid = sys.argv[9] # this is used to allow multiple output files
316 outdir = sys.argv[10]
317 nan_row = False
318 rows = []
319 for i, line in enumerate( file( sys.argv[1] ) ):
320 # Skip comments
321 if line.startswith( '#' ) or ( i == 0 ):
322 if i == 0:
323 head = line.strip().split("\t")
324 continue
325 if len(line.strip()) == 0:
326 continue
327 # Extract values and convert to floats
328 fields = line.strip().split( "\t" )
329 row = []
330 nan_row = False
331 for column in columns:
332 if len( fields ) <= column:
333 return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
334 val = fields[column]
335 if val.lower() == "na":
336 nan_row = True
337 else:
338 try:
339 row.append( float( fields[column] ) )
340 except ValueError:
341 return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
342 if not nan_row:
343 rows.append( row )
344 if i > 1:
345 i = i-1 # remove header row from count
346 if head == None:
347 head = ['Col%d' % (x+1) for x in columns]
348 R = []
349 for c,column in enumerate(columns): # we appended each column in turn
350 outname = allout
351 if c > 0: # after first time
352 outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
353 outname = os.path.join(outdir,outname)
354 dat = []
355 nrows = len(rows) # sometimes lots of NA's!!
356 for arow in rows:
357 dat.append(arow[c]) # remember, we appended each col in turn!
358 cname = head[column]
359 makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
360 xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
361
362
363
364 if __name__ == "__main__":
365 main()