Mercurial > repos > xuebing > sharplabtool
diff tools/rgenetics/rgQQ.py @ 0:9071e359b9a3
Uploaded
author | xuebing |
---|---|
date | Fri, 09 Mar 2012 19:37:19 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/rgenetics/rgQQ.py Fri Mar 09 19:37:19 2012 -0500 @@ -0,0 +1,365 @@ +""" +oct 2009 - multiple output files +Dear Matthias, + +Yes, you can define number of outputs dynamically in Galaxy. For doing +this, you'll have to declare one output dataset in your xml and pass +its ID ($out_file.id) to your python script. Also, set +force_history_refresh="True" in your tool tag in xml, like this: +<tool id="split1" name="Split" force_history_refresh="True"> +In your script, if your outputs are named in the following format, +primary_associatedWithDatasetID_designation_visibility_extension +(_DBKEY), all your datasets will show up in the history pane. +associatedWithDatasetID is the $out_file.ID passed from xml, +designation will be a unique identifier for each output (set in your +script), +visibility can be set to visible if you want the dataset visible in +your history, or notvisible otherwise +extension is the required format for your dataset (bed, tabular, fasta +etc) +DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc) + +One of our tools "MAF to Interval converter" (tools/maf/ +maf_to_interval.xml) already uses this feature. You can use it as a +reference. + +qq.chisq Quantile-quantile plot for chi-squared tests +Description +This function plots ranked observed chi-squared test statistics against the corresponding expected +order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed +means of observed and expected values. This is useful for inspecting the results of whole-genome +association studies for overdispersion due to population substructure and other sources of bias or +confounding. +Usage +qq.chisq(x, df=1, x.max, main="QQ plot", +sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), +xlab="Expected", ylab="Observed", +conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, +slope.one=FALSE, slope.lambda=FALSE, +thin=c(0.25,50), oor.pch=24, col.shade="gray", ...) +Arguments +x A vector of observed chi-squared test values +df The degreees of freedom for the tests +x.max If present, truncate the observed value (Y) axis here +main The main heading +sub The subheading +xlab x-axis label (default "Expected") +ylab y-axis label (default "Observed") +conc Lower and upper probability bounds for concentration band for the plot. Set this +to NA to suppress this +overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating +concentration band +trim Quantile point for trimmed mean calculations for estimation of lambda. Default +is to trim at the median +slope.one Is a line of slope one to be superimpsed? +slope.lambda Is a line of slope lambda to be superimposed? +thin A pair of numbers indicating how points will be thinned before plotting (see +Details). If NA, no thinning will be carried out +oor.pch Observed values greater than x.max are plotted at x.max. This argument sets +the plotting symbol to be used for out-of-range observations +col.shade The colour with which the concentration band will be filled +... Further graphical parameter settings to be passed to points() + +Details +To reduce plotting time and the size of plot files, the smallest observed and expected points are +thinned so that only a reduced number of (approximately equally spaced) points are plotted. The +precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers. +The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning +is to be applied. The second number should be an integer and sets the maximum number of points +to be plotted in this section. +The "concentration band" for the plot is shown in grey. This region is defined by upper and lower +probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a +simultaneous confidence region; the probability that the plot will stray outside the band at some +point exceeds 95 +When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its +expected value under the chi-squared assumption. +Value +The function returns the number of tests, the number of values omitted from the plot (greater than +x.max), and the estimated dispersion factor, lambda. +Note +All tests must have the same number of degrees of freedom. If this is not the case, I suggest +transforming to p-values and then plotting -2log(p) as chi-squared on 2 df. +Author(s) +David Clayton hdavid.clayton@cimr.cam.ac.uki +References +Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004 +""" + +import sys, random, math, copy,os, subprocess, tempfile +from rgutils import RRun, rexe + +rqq = """ +# modified by ross lazarus for the rgenetics project may 2000 +# makes a pdf for galaxy from an x vector of chisquare values +# from snpMatrix +# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html + qq.chisq <- + function(x, df=1, x.max, + main="QQ plot", + sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), + xlab="Expected", ylab="Observed", + conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, + slope.one=T, slope.lambda=FALSE, + thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf", + h=6,w=6,printpdf=F,...) { + + # Function to shade concentration band + + shade <- function(x1, y1, x2, y2, color=col.shade) { + n <- length(x2) + polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color) + } + + # Sort values and see how many out of range + + obsvd <- sort(x, na.last=NA) + N <- length(obsvd) + if (missing(x.max)) { + Np <- N + } + else { + Np <- sum(obsvd<=x.max) + } + if(Np==0) + stop("Nothing to plot") + + # Expected values + + if (df==2) { + expctd <- 2*cumsum(1/(N:1)) + } + else { + expctd <- qchisq(p=(1:N)/(N+1), df=df) + } + + # Concentration bands + + if (!is.null(conc)) { + if(conc[1]>0) { + e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df) + } + else { + e.low <- rep(0, N) + } + if (conc[2]<1) { + e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df) + } + else { + e.high <- 1.1*rep(max(x),N) + } + } + + # Plot outline + + if (Np < N) + top <- x.max + else + top <- obsvd[N] + right <- expctd[N] + if (printpdf) {pdf(ofname,h,w)} + plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab, + main=main, sub=sub) + + # Thinning + + if (is.na(thin[1])) { + show <- 1:Np + } + else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) { + warning("invalid thin parameter; no thinning carried out") + show <- 1:Np + } + else { + space <- right*thin[1]/floor(thin[2]) + iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df)) + if (max(iat)>thin[2]) + show <- unique(c(iat, (1+max(iat)):Np)) + else + show <- 1:Np + } + Nu <- floor(trim*N) + if (Nu>0) + lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu]) + if (!is.null(conc)) { + if (Np<N) + vert <- c(show, (Np+1):N) + else + vert <- show + if (overdisp) + shade(expctd[vert], lambda*e.low[vert], + expctd[vert], lambda*e.high[vert]) + else + shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert]) + } + points(expctd[show], obsvd[show], ...) + # Overflow + if (Np<N) { + over <- (Np+1):N + points(expctd[over], rep(x.max, N-Np), pch=oor.pch) + } + # Lines + line.types <- c("solid", "dashed", "dotted") + key <- NULL + txt <- NULL + if (slope.one) { + key <- c(key, line.types[1]) + txt <- c(txt, "y = x") + abline(a=0, b=1, lty=line.types[1]) + } + if (slope.lambda && Nu>0) { + key <- c(key, line.types[2]) + txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep="")) + if (!is.null(conc)) { + if (Np<N) + vert <- c(show, (Np+1):N) + else + vert <- show + } + abline(a=0, b=lambda, lty=line.types[2]) + } + if (printpdf) {dev.off()} + # Returned value + + if (!is.null(key)) + legend(0, top, legend=txt, lty=key) + c(N=N, omitted=N-Np, lambda=lambda) + + } + +""" + + + + +def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title', + xvar='Sample',h=8,w=8,logscale=True,outdir=None): + """ + y is data for a qq plot and ends up on the x axis go figure + if sampling, oversample low values - all the top 1% ? + assume we have 0-1 p values + """ + R = [] + colour="maroon" + nrows = len(dat) + dat.sort() # small to large + fn = float(nrows) + unifx = [x/fn for x in range(1,(nrows+1))] + if logscale: + unifx = [-math.log10(x) for x in unifx] # uniform distribution + if sample < 1.0 and len(dat) > maxveclen: + # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points + # oversample part of the distribution + always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5% + skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points + if skip <= 1: + skip = 2 + samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)] + # always oversample first sorted (here lowest) values + yvec = [dat[i] for i in samplei] # always get first and last + xvec = [unifx[i] for i in samplei] # and sample xvec same way + maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows) + else: + yvec = [x for x in dat] + maint='QQ %s (n=%d)' % (title,nrows) + xvec = unifx + if logscale: + maint = 'Log%s' % maint + mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line + ylab = '-log10(%s) Quantiles' % title + xlab = '-log10(Uniform 0-1) Quantiles' + yvec = [-math.log10(x) for x in yvec if x > 0.0] + else: + mx = [0,1] + ylab = '%s Quantiles' % title + xlab = 'Uniform 0-1 Quantiles' + + xv = ['%f' % x for x in xvec] + R.append('xvec = c(%s)' % ','.join(xv)) + yv = ['%f' % x for x in yvec] + R.append('yvec = c(%s)' % ','.join(yv)) + R.append('mx = c(0,%f)' % (math.log10(fn))) + R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w)) + R.append("par(lab=c(10,10,10))") + R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour)) + R.append('points(mx,mx,type="l")') + R.append('grid(col="lightgray",lty="dotted")') + R.append('dev.off()') + RRun(rcmd=R,title='makeQQplot',outdir=outdir) + + + +def main(): + u = """ + """ + u = """<command interpreter="python"> + rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ + </command> + + </command> + """ + print >> sys.stdout,'## rgQQ.py. cl=',sys.argv + npar = 11 + if len(sys.argv) < npar: + print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar + print >> sys.stdout, u + sys.exit(1) + in_fname = sys.argv[1] + name = sys.argv[2] + sample = float(sys.argv[3]) + head = None + columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns! + allout = sys.argv[5] + height = int(sys.argv[6]) + width = int(sys.argv[7]) + logscale = (sys.argv[8].lower() == 'true') + outid = sys.argv[9] # this is used to allow multiple output files + outdir = sys.argv[10] + nan_row = False + rows = [] + for i, line in enumerate( file( sys.argv[1] ) ): + # Skip comments + if line.startswith( '#' ) or ( i == 0 ): + if i == 0: + head = line.strip().split("\t") + continue + if len(line.strip()) == 0: + continue + # Extract values and convert to floats + fields = line.strip().split( "\t" ) + row = [] + nan_row = False + for column in columns: + if len( fields ) <= column: + return fail( "No column %d on line %d: %s" % ( column, i, fields ) ) + val = fields[column] + if val.lower() == "na": + nan_row = True + else: + try: + row.append( float( fields[column] ) ) + except ValueError: + return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) ) + if not nan_row: + rows.append( row ) + if i > 1: + i = i-1 # remove header row from count + if head == None: + head = ['Col%d' % (x+1) for x in columns] + R = [] + for c,column in enumerate(columns): # we appended each column in turn + outname = allout + if c > 0: # after first time + outname = 'primary_%s_col%s_visible_pdf' % (outid,column) + outname = os.path.join(outdir,outname) + dat = [] + nrows = len(rows) # sometimes lots of NA's!! + for arow in rows: + dat.append(arow[c]) # remember, we appended each col in turn! + cname = head[column] + makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname), + xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir) + + + +if __name__ == "__main__": + main()