0
|
1 """
|
|
2 oct 2009 - multiple output files
|
|
3 Dear Matthias,
|
|
4
|
|
5 Yes, you can define number of outputs dynamically in Galaxy. For doing
|
|
6 this, you'll have to declare one output dataset in your xml and pass
|
|
7 its ID ($out_file.id) to your python script. Also, set
|
|
8 force_history_refresh="True" in your tool tag in xml, like this:
|
|
9 <tool id="split1" name="Split" force_history_refresh="True">
|
|
10 In your script, if your outputs are named in the following format,
|
|
11 primary_associatedWithDatasetID_designation_visibility_extension
|
|
12 (_DBKEY), all your datasets will show up in the history pane.
|
|
13 associatedWithDatasetID is the $out_file.ID passed from xml,
|
|
14 designation will be a unique identifier for each output (set in your
|
|
15 script),
|
|
16 visibility can be set to visible if you want the dataset visible in
|
|
17 your history, or notvisible otherwise
|
|
18 extension is the required format for your dataset (bed, tabular, fasta
|
|
19 etc)
|
|
20 DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
|
|
21
|
|
22 One of our tools "MAF to Interval converter" (tools/maf/
|
|
23 maf_to_interval.xml) already uses this feature. You can use it as a
|
|
24 reference.
|
|
25
|
|
26 qq.chisq Quantile-quantile plot for chi-squared tests
|
|
27 Description
|
|
28 This function plots ranked observed chi-squared test statistics against the corresponding expected
|
|
29 order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
|
|
30 means of observed and expected values. This is useful for inspecting the results of whole-genome
|
|
31 association studies for overdispersion due to population substructure and other sources of bias or
|
|
32 confounding.
|
|
33 Usage
|
|
34 qq.chisq(x, df=1, x.max, main="QQ plot",
|
|
35 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
|
|
36 xlab="Expected", ylab="Observed",
|
|
37 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
|
|
38 slope.one=FALSE, slope.lambda=FALSE,
|
|
39 thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
|
|
40 Arguments
|
|
41 x A vector of observed chi-squared test values
|
|
42 df The degreees of freedom for the tests
|
|
43 x.max If present, truncate the observed value (Y) axis here
|
|
44 main The main heading
|
|
45 sub The subheading
|
|
46 xlab x-axis label (default "Expected")
|
|
47 ylab y-axis label (default "Observed")
|
|
48 conc Lower and upper probability bounds for concentration band for the plot. Set this
|
|
49 to NA to suppress this
|
|
50 overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
|
|
51 concentration band
|
|
52 trim Quantile point for trimmed mean calculations for estimation of lambda. Default
|
|
53 is to trim at the median
|
|
54 slope.one Is a line of slope one to be superimpsed?
|
|
55 slope.lambda Is a line of slope lambda to be superimposed?
|
|
56 thin A pair of numbers indicating how points will be thinned before plotting (see
|
|
57 Details). If NA, no thinning will be carried out
|
|
58 oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
|
|
59 the plotting symbol to be used for out-of-range observations
|
|
60 col.shade The colour with which the concentration band will be filled
|
|
61 ... Further graphical parameter settings to be passed to points()
|
|
62
|
|
63 Details
|
|
64 To reduce plotting time and the size of plot files, the smallest observed and expected points are
|
|
65 thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
|
|
66 precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
|
|
67 The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
|
|
68 is to be applied. The second number should be an integer and sets the maximum number of points
|
|
69 to be plotted in this section.
|
|
70 The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
|
|
71 probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
|
|
72 simultaneous confidence region; the probability that the plot will stray outside the band at some
|
|
73 point exceeds 95
|
|
74 When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
|
|
75 expected value under the chi-squared assumption.
|
|
76 Value
|
|
77 The function returns the number of tests, the number of values omitted from the plot (greater than
|
|
78 x.max), and the estimated dispersion factor, lambda.
|
|
79 Note
|
|
80 All tests must have the same number of degrees of freedom. If this is not the case, I suggest
|
|
81 transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
|
|
82 Author(s)
|
|
83 David Clayton hdavid.clayton@cimr.cam.ac.uki
|
|
84 References
|
|
85 Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
|
|
86 """
|
|
87
|
|
88 import sys, random, math, copy,os, subprocess, tempfile
|
|
89 from rgutils import RRun, rexe
|
|
90
|
|
91 rqq = """
|
|
92 # modified by ross lazarus for the rgenetics project may 2000
|
|
93 # makes a pdf for galaxy from an x vector of chisquare values
|
|
94 # from snpMatrix
|
|
95 # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
|
|
96 qq.chisq <-
|
|
97 function(x, df=1, x.max,
|
|
98 main="QQ plot",
|
|
99 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
|
|
100 xlab="Expected", ylab="Observed",
|
|
101 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
|
|
102 slope.one=T, slope.lambda=FALSE,
|
|
103 thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
|
|
104 h=6,w=6,printpdf=F,...) {
|
|
105
|
|
106 # Function to shade concentration band
|
|
107
|
|
108 shade <- function(x1, y1, x2, y2, color=col.shade) {
|
|
109 n <- length(x2)
|
|
110 polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
|
|
111 }
|
|
112
|
|
113 # Sort values and see how many out of range
|
|
114
|
|
115 obsvd <- sort(x, na.last=NA)
|
|
116 N <- length(obsvd)
|
|
117 if (missing(x.max)) {
|
|
118 Np <- N
|
|
119 }
|
|
120 else {
|
|
121 Np <- sum(obsvd<=x.max)
|
|
122 }
|
|
123 if(Np==0)
|
|
124 stop("Nothing to plot")
|
|
125
|
|
126 # Expected values
|
|
127
|
|
128 if (df==2) {
|
|
129 expctd <- 2*cumsum(1/(N:1))
|
|
130 }
|
|
131 else {
|
|
132 expctd <- qchisq(p=(1:N)/(N+1), df=df)
|
|
133 }
|
|
134
|
|
135 # Concentration bands
|
|
136
|
|
137 if (!is.null(conc)) {
|
|
138 if(conc[1]>0) {
|
|
139 e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
|
|
140 }
|
|
141 else {
|
|
142 e.low <- rep(0, N)
|
|
143 }
|
|
144 if (conc[2]<1) {
|
|
145 e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
|
|
146 }
|
|
147 else {
|
|
148 e.high <- 1.1*rep(max(x),N)
|
|
149 }
|
|
150 }
|
|
151
|
|
152 # Plot outline
|
|
153
|
|
154 if (Np < N)
|
|
155 top <- x.max
|
|
156 else
|
|
157 top <- obsvd[N]
|
|
158 right <- expctd[N]
|
|
159 if (printpdf) {pdf(ofname,h,w)}
|
|
160 plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
|
|
161 main=main, sub=sub)
|
|
162
|
|
163 # Thinning
|
|
164
|
|
165 if (is.na(thin[1])) {
|
|
166 show <- 1:Np
|
|
167 }
|
|
168 else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
|
|
169 warning("invalid thin parameter; no thinning carried out")
|
|
170 show <- 1:Np
|
|
171 }
|
|
172 else {
|
|
173 space <- right*thin[1]/floor(thin[2])
|
|
174 iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
|
|
175 if (max(iat)>thin[2])
|
|
176 show <- unique(c(iat, (1+max(iat)):Np))
|
|
177 else
|
|
178 show <- 1:Np
|
|
179 }
|
|
180 Nu <- floor(trim*N)
|
|
181 if (Nu>0)
|
|
182 lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
|
|
183 if (!is.null(conc)) {
|
|
184 if (Np<N)
|
|
185 vert <- c(show, (Np+1):N)
|
|
186 else
|
|
187 vert <- show
|
|
188 if (overdisp)
|
|
189 shade(expctd[vert], lambda*e.low[vert],
|
|
190 expctd[vert], lambda*e.high[vert])
|
|
191 else
|
|
192 shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
|
|
193 }
|
|
194 points(expctd[show], obsvd[show], ...)
|
|
195 # Overflow
|
|
196 if (Np<N) {
|
|
197 over <- (Np+1):N
|
|
198 points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
|
|
199 }
|
|
200 # Lines
|
|
201 line.types <- c("solid", "dashed", "dotted")
|
|
202 key <- NULL
|
|
203 txt <- NULL
|
|
204 if (slope.one) {
|
|
205 key <- c(key, line.types[1])
|
|
206 txt <- c(txt, "y = x")
|
|
207 abline(a=0, b=1, lty=line.types[1])
|
|
208 }
|
|
209 if (slope.lambda && Nu>0) {
|
|
210 key <- c(key, line.types[2])
|
|
211 txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
|
|
212 if (!is.null(conc)) {
|
|
213 if (Np<N)
|
|
214 vert <- c(show, (Np+1):N)
|
|
215 else
|
|
216 vert <- show
|
|
217 }
|
|
218 abline(a=0, b=lambda, lty=line.types[2])
|
|
219 }
|
|
220 if (printpdf) {dev.off()}
|
|
221 # Returned value
|
|
222
|
|
223 if (!is.null(key))
|
|
224 legend(0, top, legend=txt, lty=key)
|
|
225 c(N=N, omitted=N-Np, lambda=lambda)
|
|
226
|
|
227 }
|
|
228
|
|
229 """
|
|
230
|
|
231
|
|
232
|
|
233
|
|
234 def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
|
|
235 xvar='Sample',h=8,w=8,logscale=True,outdir=None):
|
|
236 """
|
|
237 y is data for a qq plot and ends up on the x axis go figure
|
|
238 if sampling, oversample low values - all the top 1% ?
|
|
239 assume we have 0-1 p values
|
|
240 """
|
|
241 R = []
|
|
242 colour="maroon"
|
|
243 nrows = len(dat)
|
|
244 dat.sort() # small to large
|
|
245 fn = float(nrows)
|
|
246 unifx = [x/fn for x in range(1,(nrows+1))]
|
|
247 if logscale:
|
|
248 unifx = [-math.log10(x) for x in unifx] # uniform distribution
|
|
249 if sample < 1.0 and len(dat) > maxveclen:
|
|
250 # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
|
|
251 # oversample part of the distribution
|
|
252 always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
|
|
253 skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
|
|
254 if skip <= 1:
|
|
255 skip = 2
|
|
256 samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
|
|
257 # always oversample first sorted (here lowest) values
|
|
258 yvec = [dat[i] for i in samplei] # always get first and last
|
|
259 xvec = [unifx[i] for i in samplei] # and sample xvec same way
|
|
260 maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
|
|
261 else:
|
|
262 yvec = [x for x in dat]
|
|
263 maint='QQ %s (n=%d)' % (title,nrows)
|
|
264 xvec = unifx
|
|
265 if logscale:
|
|
266 maint = 'Log%s' % maint
|
|
267 mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
|
|
268 ylab = '-log10(%s) Quantiles' % title
|
|
269 xlab = '-log10(Uniform 0-1) Quantiles'
|
|
270 yvec = [-math.log10(x) for x in yvec if x > 0.0]
|
|
271 else:
|
|
272 mx = [0,1]
|
|
273 ylab = '%s Quantiles' % title
|
|
274 xlab = 'Uniform 0-1 Quantiles'
|
|
275
|
|
276 xv = ['%f' % x for x in xvec]
|
|
277 R.append('xvec = c(%s)' % ','.join(xv))
|
|
278 yv = ['%f' % x for x in yvec]
|
|
279 R.append('yvec = c(%s)' % ','.join(yv))
|
|
280 R.append('mx = c(0,%f)' % (math.log10(fn)))
|
|
281 R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
|
|
282 R.append("par(lab=c(10,10,10))")
|
|
283 R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
|
|
284 R.append('points(mx,mx,type="l")')
|
|
285 R.append('grid(col="lightgray",lty="dotted")')
|
|
286 R.append('dev.off()')
|
|
287 RRun(rcmd=R,title='makeQQplot',outdir=outdir)
|
|
288
|
|
289
|
|
290
|
|
291 def main():
|
|
292 u = """
|
|
293 """
|
|
294 u = """<command interpreter="python">
|
|
295 rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__
|
|
296 </command>
|
|
297
|
|
298 </command>
|
|
299 """
|
|
300 print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
|
|
301 npar = 11
|
|
302 if len(sys.argv) < npar:
|
|
303 print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
|
|
304 print >> sys.stdout, u
|
|
305 sys.exit(1)
|
|
306 in_fname = sys.argv[1]
|
|
307 name = sys.argv[2]
|
|
308 sample = float(sys.argv[3])
|
|
309 head = None
|
|
310 columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
|
|
311 allout = sys.argv[5]
|
|
312 height = int(sys.argv[6])
|
|
313 width = int(sys.argv[7])
|
|
314 logscale = (sys.argv[8].lower() == 'true')
|
|
315 outid = sys.argv[9] # this is used to allow multiple output files
|
|
316 outdir = sys.argv[10]
|
|
317 nan_row = False
|
|
318 rows = []
|
|
319 for i, line in enumerate( file( sys.argv[1] ) ):
|
|
320 # Skip comments
|
|
321 if line.startswith( '#' ) or ( i == 0 ):
|
|
322 if i == 0:
|
|
323 head = line.strip().split("\t")
|
|
324 continue
|
|
325 if len(line.strip()) == 0:
|
|
326 continue
|
|
327 # Extract values and convert to floats
|
|
328 fields = line.strip().split( "\t" )
|
|
329 row = []
|
|
330 nan_row = False
|
|
331 for column in columns:
|
|
332 if len( fields ) <= column:
|
|
333 return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
|
|
334 val = fields[column]
|
|
335 if val.lower() == "na":
|
|
336 nan_row = True
|
|
337 else:
|
|
338 try:
|
|
339 row.append( float( fields[column] ) )
|
|
340 except ValueError:
|
|
341 return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
|
|
342 if not nan_row:
|
|
343 rows.append( row )
|
|
344 if i > 1:
|
|
345 i = i-1 # remove header row from count
|
|
346 if head == None:
|
|
347 head = ['Col%d' % (x+1) for x in columns]
|
|
348 R = []
|
|
349 for c,column in enumerate(columns): # we appended each column in turn
|
|
350 outname = allout
|
|
351 if c > 0: # after first time
|
|
352 outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
|
|
353 outname = os.path.join(outdir,outname)
|
|
354 dat = []
|
|
355 nrows = len(rows) # sometimes lots of NA's!!
|
|
356 for arow in rows:
|
|
357 dat.append(arow[c]) # remember, we appended each col in turn!
|
|
358 cname = head[column]
|
|
359 makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
|
|
360 xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
|
|
361
|
|
362
|
|
363
|
|
364 if __name__ == "__main__":
|
|
365 main()
|