| 0 | 1 """ | 
|  | 2 oct 2009 - multiple output files | 
|  | 3 Dear Matthias, | 
|  | 4 | 
|  | 5 Yes, you can define number of outputs dynamically in Galaxy. For doing | 
|  | 6 this, you'll have to declare one output dataset in your xml and pass | 
|  | 7 its ID ($out_file.id) to your python script. Also, set | 
|  | 8 force_history_refresh="True" in your tool tag in xml, like this: | 
|  | 9 <tool id="split1" name="Split" force_history_refresh="True"> | 
|  | 10 In your script, if your outputs are named in the following format, | 
|  | 11 primary_associatedWithDatasetID_designation_visibility_extension | 
|  | 12 (_DBKEY), all your datasets will show up in the history pane. | 
|  | 13 associatedWithDatasetID is the $out_file.ID passed from xml, | 
|  | 14 designation will be a unique identifier for each output (set in your | 
|  | 15 script), | 
|  | 16 visibility can be set to visible if you want the dataset visible in | 
|  | 17 your history, or notvisible otherwise | 
|  | 18 extension is the required format for your dataset (bed, tabular, fasta | 
|  | 19 etc) | 
|  | 20 DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc) | 
|  | 21 | 
|  | 22 One of our tools "MAF to Interval converter" (tools/maf/ | 
|  | 23 maf_to_interval.xml) already uses this feature. You can use it as a | 
|  | 24 reference. | 
|  | 25 | 
|  | 26 qq.chisq Quantile-quantile plot for chi-squared tests | 
|  | 27 Description | 
|  | 28 This function plots ranked observed chi-squared test statistics against the corresponding expected | 
|  | 29 order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed | 
|  | 30 means of observed and expected values. This is useful for inspecting the results of whole-genome | 
|  | 31 association studies for overdispersion due to population substructure and other sources of bias or | 
|  | 32 confounding. | 
|  | 33 Usage | 
|  | 34 qq.chisq(x, df=1, x.max, main="QQ plot", | 
|  | 35 sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), | 
|  | 36 xlab="Expected", ylab="Observed", | 
|  | 37 conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, | 
|  | 38 slope.one=FALSE, slope.lambda=FALSE, | 
|  | 39 thin=c(0.25,50), oor.pch=24, col.shade="gray", ...) | 
|  | 40 Arguments | 
|  | 41 x A vector of observed chi-squared test values | 
|  | 42 df The degreees of freedom for the tests | 
|  | 43 x.max If present, truncate the observed value (Y) axis here | 
|  | 44 main The main heading | 
|  | 45 sub The subheading | 
|  | 46 xlab x-axis label (default "Expected") | 
|  | 47 ylab y-axis label (default "Observed") | 
|  | 48 conc Lower and upper probability bounds for concentration band for the plot. Set this | 
|  | 49 to NA to suppress this | 
|  | 50 overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating | 
|  | 51 concentration band | 
|  | 52 trim Quantile point for trimmed mean calculations for estimation of lambda. Default | 
|  | 53 is to trim at the median | 
|  | 54 slope.one Is a line of slope one to be superimpsed? | 
|  | 55 slope.lambda Is a line of slope lambda to be superimposed? | 
|  | 56 thin A pair of numbers indicating how points will be thinned before plotting (see | 
|  | 57 Details). If NA, no thinning will be carried out | 
|  | 58 oor.pch Observed values greater than x.max are plotted at x.max. This argument sets | 
|  | 59 the plotting symbol to be used for out-of-range observations | 
|  | 60 col.shade The colour with which the concentration band will be filled | 
|  | 61 ... Further graphical parameter settings to be passed to points() | 
|  | 62 | 
|  | 63 Details | 
|  | 64 To reduce plotting time and the size of plot files, the smallest observed and expected points are | 
|  | 65 thinned so that only a reduced number of (approximately equally spaced) points are plotted. The | 
|  | 66 precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers. | 
|  | 67 The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning | 
|  | 68 is to be applied. The second number should be an integer and sets the maximum number of points | 
|  | 69 to be plotted in this section. | 
|  | 70 The "concentration band" for the plot is shown in grey. This region is defined by upper and lower | 
|  | 71 probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a | 
|  | 72 simultaneous confidence region; the probability that the plot will stray outside the band at some | 
|  | 73 point exceeds 95 | 
|  | 74 When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its | 
|  | 75 expected value under the chi-squared assumption. | 
|  | 76 Value | 
|  | 77 The function returns the number of tests, the number of values omitted from the plot (greater than | 
|  | 78 x.max), and the estimated dispersion factor, lambda. | 
|  | 79 Note | 
|  | 80 All tests must have the same number of degrees of freedom. If this is not the case, I suggest | 
|  | 81 transforming to p-values and then plotting -2log(p) as chi-squared on 2 df. | 
|  | 82 Author(s) | 
|  | 83 David Clayton hdavid.clayton@cimr.cam.ac.uki | 
|  | 84 References | 
|  | 85 Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004 | 
|  | 86 """ | 
|  | 87 | 
|  | 88 import sys, random, math, copy,os, subprocess, tempfile | 
|  | 89 from rgutils import RRun, rexe | 
|  | 90 | 
|  | 91 rqq = """ | 
|  | 92 # modified by ross lazarus for the rgenetics project may 2000 | 
|  | 93 # makes a pdf for galaxy from an x vector of chisquare values | 
|  | 94 # from snpMatrix | 
|  | 95 # http://www.bioconductor.org/packages/bioc/html/snpMatrix.html | 
|  | 96  qq.chisq <- | 
|  | 97   function(x, df=1, x.max, | 
|  | 98     main="QQ plot", | 
|  | 99     sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), | 
|  | 100     xlab="Expected", ylab="Observed", | 
|  | 101     conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, | 
|  | 102     slope.one=T, slope.lambda=FALSE, | 
|  | 103     thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf", | 
|  | 104     h=6,w=6,printpdf=F,...) { | 
|  | 105 | 
|  | 106     # Function to shade concentration band | 
|  | 107 | 
|  | 108     shade <- function(x1, y1, x2, y2, color=col.shade) { | 
|  | 109       n <- length(x2) | 
|  | 110       polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color) | 
|  | 111     } | 
|  | 112 | 
|  | 113     # Sort values and see how many out of range | 
|  | 114 | 
|  | 115     obsvd <- sort(x, na.last=NA) | 
|  | 116     N <- length(obsvd) | 
|  | 117     if (missing(x.max)) { | 
|  | 118       Np <- N | 
|  | 119     } | 
|  | 120     else { | 
|  | 121       Np <- sum(obsvd<=x.max) | 
|  | 122     } | 
|  | 123     if(Np==0) | 
|  | 124       stop("Nothing to plot") | 
|  | 125 | 
|  | 126     # Expected values | 
|  | 127 | 
|  | 128     if (df==2) { | 
|  | 129       expctd <- 2*cumsum(1/(N:1)) | 
|  | 130     } | 
|  | 131     else { | 
|  | 132       expctd <- qchisq(p=(1:N)/(N+1), df=df) | 
|  | 133     } | 
|  | 134 | 
|  | 135     # Concentration bands | 
|  | 136 | 
|  | 137     if (!is.null(conc)) { | 
|  | 138       if(conc[1]>0) { | 
|  | 139         e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df) | 
|  | 140       } | 
|  | 141       else { | 
|  | 142         e.low <- rep(0, N) | 
|  | 143       } | 
|  | 144       if (conc[2]<1) { | 
|  | 145         e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df) | 
|  | 146       } | 
|  | 147       else { | 
|  | 148         e.high <- 1.1*rep(max(x),N) | 
|  | 149       } | 
|  | 150     } | 
|  | 151 | 
|  | 152     # Plot outline | 
|  | 153 | 
|  | 154     if (Np < N) | 
|  | 155       top <- x.max | 
|  | 156     else | 
|  | 157       top <- obsvd[N] | 
|  | 158     right <- expctd[N] | 
|  | 159     if (printpdf) {pdf(ofname,h,w)} | 
|  | 160     plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab, | 
|  | 161          main=main, sub=sub) | 
|  | 162 | 
|  | 163     # Thinning | 
|  | 164 | 
|  | 165     if (is.na(thin[1])) { | 
|  | 166       show <- 1:Np | 
|  | 167     } | 
|  | 168     else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) { | 
|  | 169       warning("invalid thin parameter; no thinning carried out") | 
|  | 170       show <- 1:Np | 
|  | 171     } | 
|  | 172     else { | 
|  | 173       space <- right*thin[1]/floor(thin[2]) | 
|  | 174       iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df)) | 
|  | 175       if (max(iat)>thin[2]) | 
|  | 176         show <- unique(c(iat, (1+max(iat)):Np)) | 
|  | 177       else | 
|  | 178         show <- 1:Np | 
|  | 179     } | 
|  | 180     Nu <- floor(trim*N) | 
|  | 181     if (Nu>0) | 
|  | 182       lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu]) | 
|  | 183     if (!is.null(conc)) { | 
|  | 184       if (Np<N) | 
|  | 185         vert <- c(show, (Np+1):N) | 
|  | 186       else | 
|  | 187         vert <- show | 
|  | 188       if (overdisp) | 
|  | 189         shade(expctd[vert], lambda*e.low[vert], | 
|  | 190               expctd[vert], lambda*e.high[vert]) | 
|  | 191       else | 
|  | 192         shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert]) | 
|  | 193     } | 
|  | 194     points(expctd[show], obsvd[show], ...) | 
|  | 195     # Overflow | 
|  | 196     if (Np<N) { | 
|  | 197       over <- (Np+1):N | 
|  | 198       points(expctd[over], rep(x.max, N-Np), pch=oor.pch) | 
|  | 199     } | 
|  | 200     # Lines | 
|  | 201     line.types <- c("solid", "dashed", "dotted") | 
|  | 202     key <- NULL | 
|  | 203     txt <- NULL | 
|  | 204     if (slope.one) { | 
|  | 205       key <- c(key, line.types[1]) | 
|  | 206       txt <- c(txt, "y = x") | 
|  | 207       abline(a=0, b=1, lty=line.types[1]) | 
|  | 208     } | 
|  | 209     if (slope.lambda && Nu>0) { | 
|  | 210       key <- c(key, line.types[2]) | 
|  | 211       txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep="")) | 
|  | 212       if (!is.null(conc)) { | 
|  | 213         if (Np<N) | 
|  | 214           vert <- c(show, (Np+1):N) | 
|  | 215         else | 
|  | 216           vert <- show | 
|  | 217       } | 
|  | 218       abline(a=0, b=lambda, lty=line.types[2]) | 
|  | 219     } | 
|  | 220     if (printpdf) {dev.off()} | 
|  | 221     # Returned value | 
|  | 222 | 
|  | 223     if (!is.null(key)) | 
|  | 224        legend(0, top, legend=txt, lty=key) | 
|  | 225     c(N=N, omitted=N-Np, lambda=lambda) | 
|  | 226 | 
|  | 227   } | 
|  | 228 | 
|  | 229 """ | 
|  | 230 | 
|  | 231 | 
|  | 232 | 
|  | 233 | 
|  | 234 def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title', | 
|  | 235            xvar='Sample',h=8,w=8,logscale=True,outdir=None): | 
|  | 236     """ | 
|  | 237     y is data for a qq plot and ends up on the x axis go figure | 
|  | 238     if sampling, oversample low values - all the top 1% ? | 
|  | 239     assume we have 0-1 p values | 
|  | 240     """ | 
|  | 241     R = [] | 
|  | 242     colour="maroon" | 
|  | 243     nrows = len(dat) | 
|  | 244     dat.sort() # small to large | 
|  | 245     fn = float(nrows) | 
|  | 246     unifx = [x/fn for x in range(1,(nrows+1))] | 
|  | 247     if logscale: | 
|  | 248         unifx = [-math.log10(x) for x in unifx] # uniform distribution | 
|  | 249     if sample < 1.0 and len(dat) > maxveclen: | 
|  | 250         # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points | 
|  | 251         # oversample part of the distribution | 
|  | 252         always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5% | 
|  | 253         skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points | 
|  | 254         if skip <= 1: | 
|  | 255             skip = 2 | 
|  | 256         samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)] | 
|  | 257         # always oversample first sorted (here lowest) values | 
|  | 258         yvec = [dat[i] for i in samplei] # always get first and last | 
|  | 259         xvec = [unifx[i] for i in samplei] # and sample xvec same way | 
|  | 260         maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows) | 
|  | 261     else: | 
|  | 262         yvec = [x for x in dat] | 
|  | 263         maint='QQ %s (n=%d)' % (title,nrows) | 
|  | 264         xvec = unifx | 
|  | 265     if logscale: | 
|  | 266         maint = 'Log%s' % maint | 
|  | 267         mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line | 
|  | 268         ylab = '-log10(%s) Quantiles' % title | 
|  | 269         xlab = '-log10(Uniform 0-1) Quantiles' | 
|  | 270         yvec = [-math.log10(x) for x in yvec if x > 0.0] | 
|  | 271     else: | 
|  | 272         mx = [0,1] | 
|  | 273         ylab = '%s Quantiles' % title | 
|  | 274         xlab = 'Uniform 0-1 Quantiles' | 
|  | 275 | 
|  | 276     xv = ['%f' % x for x in xvec] | 
|  | 277     R.append('xvec = c(%s)' % ','.join(xv)) | 
|  | 278     yv = ['%f' % x for x in yvec] | 
|  | 279     R.append('yvec = c(%s)' % ','.join(yv)) | 
|  | 280     R.append('mx = c(0,%f)' % (math.log10(fn))) | 
|  | 281     R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w)) | 
|  | 282     R.append("par(lab=c(10,10,10))") | 
|  | 283     R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour)) | 
|  | 284     R.append('points(mx,mx,type="l")') | 
|  | 285     R.append('grid(col="lightgray",lty="dotted")') | 
|  | 286     R.append('dev.off()') | 
|  | 287     RRun(rcmd=R,title='makeQQplot',outdir=outdir) | 
|  | 288 | 
|  | 289 | 
|  | 290 | 
|  | 291 def main(): | 
|  | 292     u = """ | 
|  | 293     """ | 
|  | 294     u = """<command interpreter="python"> | 
|  | 295         rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ | 
|  | 296     </command> | 
|  | 297 | 
|  | 298     </command> | 
|  | 299     """ | 
|  | 300     print >> sys.stdout,'## rgQQ.py. cl=',sys.argv | 
|  | 301     npar = 11 | 
|  | 302     if len(sys.argv) < npar: | 
|  | 303             print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar | 
|  | 304             print >> sys.stdout, u | 
|  | 305             sys.exit(1) | 
|  | 306     in_fname = sys.argv[1] | 
|  | 307     name = sys.argv[2] | 
|  | 308     sample = float(sys.argv[3]) | 
|  | 309     head = None | 
|  | 310     columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns! | 
|  | 311     allout = sys.argv[5] | 
|  | 312     height = int(sys.argv[6]) | 
|  | 313     width = int(sys.argv[7]) | 
|  | 314     logscale = (sys.argv[8].lower() == 'true') | 
|  | 315     outid = sys.argv[9] # this is used to allow multiple output files | 
|  | 316     outdir = sys.argv[10] | 
|  | 317     nan_row = False | 
|  | 318     rows = [] | 
|  | 319     for i, line in enumerate( file( sys.argv[1] ) ): | 
|  | 320         # Skip comments | 
|  | 321         if  line.startswith( '#' ) or ( i == 0 ): | 
|  | 322             if i == 0: | 
|  | 323                  head = line.strip().split("\t") | 
|  | 324             continue | 
|  | 325         if len(line.strip()) == 0: | 
|  | 326             continue | 
|  | 327         # Extract values and convert to floats | 
|  | 328         fields = line.strip().split( "\t" ) | 
|  | 329         row = [] | 
|  | 330         nan_row = False | 
|  | 331         for column in columns: | 
|  | 332             if len( fields ) <= column: | 
|  | 333                 return fail( "No column %d on line %d: %s" % ( column, i, fields ) ) | 
|  | 334             val = fields[column] | 
|  | 335             if val.lower() == "na": | 
|  | 336                 nan_row = True | 
|  | 337             else: | 
|  | 338                 try: | 
|  | 339                     row.append( float( fields[column] ) ) | 
|  | 340                 except ValueError: | 
|  | 341                     return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) ) | 
|  | 342         if not nan_row: | 
|  | 343            rows.append( row ) | 
|  | 344     if i > 1: | 
|  | 345        i = i-1 # remove header row from count | 
|  | 346     if head == None: | 
|  | 347        head = ['Col%d' % (x+1) for x in columns] | 
|  | 348     R = [] | 
|  | 349     for c,column in enumerate(columns): # we appended each column in turn | 
|  | 350         outname = allout | 
|  | 351         if c > 0: # after first time | 
|  | 352             outname = 'primary_%s_col%s_visible_pdf' % (outid,column) | 
|  | 353             outname = os.path.join(outdir,outname) | 
|  | 354         dat = [] | 
|  | 355         nrows = len(rows) # sometimes lots of NA's!! | 
|  | 356         for arow in rows: | 
|  | 357            dat.append(arow[c]) # remember, we appended each col in turn! | 
|  | 358         cname = head[column] | 
|  | 359         makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname), | 
|  | 360                    xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir) | 
|  | 361 | 
|  | 362 | 
|  | 363 | 
|  | 364 if __name__ == "__main__": | 
|  | 365     main() |