diff tools/rgenetics/rgQQ.py @ 0:9071e359b9a3

Uploaded
author xuebing
date Fri, 09 Mar 2012 19:37:19 -0500
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/rgenetics/rgQQ.py	Fri Mar 09 19:37:19 2012 -0500
@@ -0,0 +1,365 @@
+"""
+oct 2009 - multiple output files 
+Dear Matthias,
+
+Yes, you can define number of outputs dynamically in Galaxy. For doing
+this, you'll have to declare one output dataset in your xml and pass
+its ID ($out_file.id) to your python script. Also, set
+force_history_refresh="True" in your tool tag in xml, like this:
+<tool id="split1" name="Split" force_history_refresh="True">
+In your script, if your outputs are named in the following format,
+primary_associatedWithDatasetID_designation_visibility_extension
+(_DBKEY), all your datasets will show up in the history pane.
+associatedWithDatasetID is the $out_file.ID passed from xml,
+designation will be a unique identifier for each output (set in your
+script),
+visibility can be set to visible if you want the dataset visible in
+your history, or notvisible otherwise
+extension is the required format for your dataset (bed, tabular, fasta
+etc)
+DBKEY is optional, and can be set if required (e.g. hg18, mm9 etc)
+
+One of our tools "MAF to Interval converter" (tools/maf/
+maf_to_interval.xml) already uses this feature. You can use it as a
+reference.
+
+qq.chisq Quantile-quantile plot for chi-squared tests
+Description
+This function plots ranked observed chi-squared test statistics against the corresponding expected
+order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed
+means of observed and expected values. This is useful for inspecting the results of whole-genome
+association studies for overdispersion due to population substructure and other sources of bias or
+confounding.
+Usage
+qq.chisq(x, df=1, x.max, main="QQ plot",
+sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
+xlab="Expected", ylab="Observed",
+conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
+slope.one=FALSE, slope.lambda=FALSE,
+thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
+Arguments
+x A vector of observed chi-squared test values
+df The degreees of freedom for the tests
+x.max If present, truncate the observed value (Y) axis here
+main The main heading
+sub The subheading
+xlab x-axis label (default "Expected")
+ylab y-axis label (default "Observed")
+conc Lower and upper probability bounds for concentration band for the plot. Set this
+to NA to suppress this
+overdisp If TRUE, an overdispersion factor, lambda, will be estimated and used in calculating
+concentration band
+trim Quantile point for trimmed mean calculations for estimation of lambda. Default
+is to trim at the median
+slope.one Is a line of slope one to be superimpsed?
+slope.lambda Is a line of slope lambda to be superimposed?
+thin A pair of numbers indicating how points will be thinned before plotting (see
+Details). If NA, no thinning will be carried out
+oor.pch Observed values greater than x.max are plotted at x.max. This argument sets
+the plotting symbol to be used for out-of-range observations
+col.shade The colour with which the concentration band will be filled
+... Further graphical parameter settings to be passed to points()
+
+Details
+To reduce plotting time and the size of plot files, the smallest observed and expected points are
+thinned so that only a reduced number of (approximately equally spaced) points are plotted. The
+precise behaviour is controlled by the parameter thin, whose value should be a pair of numbers.
+The first number must lie between 0 and 1 and sets the proportion of the X axis over which thinning
+is to be applied. The second number should be an integer and sets the maximum number of points
+to be plotted in this section.
+The "concentration band" for the plot is shown in grey. This region is defined by upper and lower
+probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a
+simultaneous confidence region; the probability that the plot will stray outside the band at some
+point exceeds 95
+When required, he dispersion factor is estimated by the ratio of the observed trimmed mean to its
+expected value under the chi-squared assumption.
+Value
+The function returns the number of tests, the number of values omitted from the plot (greater than
+x.max), and the estimated dispersion factor, lambda.
+Note
+All tests must have the same number of degrees of freedom. If this is not the case, I suggest
+transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
+Author(s)
+David Clayton hdavid.clayton@cimr.cam.ac.uki
+References
+Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
+"""
+
+import sys, random, math, copy,os, subprocess, tempfile
+from rgutils import RRun, rexe
+
+rqq = """
+# modified by ross lazarus for the rgenetics project may 2000
+# makes a pdf for galaxy from an x vector of chisquare values
+# from snpMatrix
+# http://www.bioconductor.org/packages/bioc/html/snpMatrix.html
+ qq.chisq <-
+  function(x, df=1, x.max,
+    main="QQ plot",
+    sub=paste("Expected distribution: chi-squared (",df," df)", sep=""),
+    xlab="Expected", ylab="Observed",
+    conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5,
+    slope.one=T, slope.lambda=FALSE,
+    thin=c(0.5,200), oor.pch=24, col.shade="gray", ofname="qqchi.pdf",
+    h=6,w=6,printpdf=F,...) {
+
+    # Function to shade concentration band
+
+    shade <- function(x1, y1, x2, y2, color=col.shade) {
+      n <- length(x2)
+      polygon(c(x1, x2[n:1]), c(y1, y2[n:1]), border=NA, col=color)
+    }
+
+    # Sort values and see how many out of range
+
+    obsvd <- sort(x, na.last=NA)
+    N <- length(obsvd)
+    if (missing(x.max)) {
+      Np <- N
+    }
+    else {
+      Np <- sum(obsvd<=x.max)
+    }
+    if(Np==0)
+      stop("Nothing to plot")
+
+    # Expected values
+
+    if (df==2) {
+      expctd <- 2*cumsum(1/(N:1))
+    }
+    else {
+      expctd <- qchisq(p=(1:N)/(N+1), df=df)
+    }
+
+    # Concentration bands
+
+    if (!is.null(conc)) {
+      if(conc[1]>0) {
+        e.low <- qchisq(p=qbeta(conc[1], 1:N, N:1), df=df)
+      }
+      else {
+        e.low <- rep(0, N)
+      }
+      if (conc[2]<1) {
+        e.high <- qchisq(p=qbeta(conc[2], 1:N, N:1), df=df)
+      }
+      else {
+        e.high <- 1.1*rep(max(x),N)
+      }
+    }
+
+    # Plot outline
+
+    if (Np < N)
+      top <- x.max
+    else
+      top <- obsvd[N]
+    right <- expctd[N]
+    if (printpdf) {pdf(ofname,h,w)}
+    plot(c(0, right), c(0, top), type="n", xlab=xlab, ylab=ylab,
+         main=main, sub=sub)
+
+    # Thinning
+
+    if (is.na(thin[1])) {
+      show <- 1:Np
+    }
+    else if (length(thin)!=2 || thin[1]<0 || thin[1]>1 || thin[2]<1) {
+      warning("invalid thin parameter; no thinning carried out")
+      show <- 1:Np
+    }
+    else {
+      space <- right*thin[1]/floor(thin[2])
+      iat <- round((N+1)*pchisq(q=(1:floor(thin[2]))*space, df=df))
+      if (max(iat)>thin[2])
+        show <- unique(c(iat, (1+max(iat)):Np))
+      else
+        show <- 1:Np
+    }
+    Nu <- floor(trim*N)
+    if (Nu>0)
+      lambda <- mean(obsvd[1:Nu])/mean(expctd[1:Nu])
+    if (!is.null(conc)) {
+      if (Np<N)
+        vert <- c(show, (Np+1):N)
+      else
+        vert <- show
+      if (overdisp)
+        shade(expctd[vert], lambda*e.low[vert],
+              expctd[vert], lambda*e.high[vert])
+      else
+        shade(expctd[vert], e.low[vert], expctd[vert], e.high[vert])
+    }
+    points(expctd[show], obsvd[show], ...)
+    # Overflow
+    if (Np<N) {
+      over <- (Np+1):N
+      points(expctd[over], rep(x.max, N-Np), pch=oor.pch)
+    }
+    # Lines
+    line.types <- c("solid", "dashed", "dotted")
+    key <- NULL
+    txt <- NULL
+    if (slope.one) {
+      key <- c(key, line.types[1])
+      txt <- c(txt, "y = x")
+      abline(a=0, b=1, lty=line.types[1])
+    }
+    if (slope.lambda && Nu>0) {
+      key <- c(key, line.types[2])
+      txt <- c(txt, paste("y = ", format(lambda, digits=4), "x", sep=""))
+      if (!is.null(conc)) {
+        if (Np<N)
+          vert <- c(show, (Np+1):N)
+        else
+          vert <- show
+      }
+      abline(a=0, b=lambda, lty=line.types[2])
+    }
+    if (printpdf) {dev.off()}
+    # Returned value
+
+    if (!is.null(key))
+       legend(0, top, legend=txt, lty=key)
+    c(N=N, omitted=N-Np, lambda=lambda)
+
+  }
+
+"""
+
+
+               
+    
+def makeQQ(dat=[], sample=1.0, maxveclen=4000, fname='fname',title='title',
+           xvar='Sample',h=8,w=8,logscale=True,outdir=None):
+    """
+    y is data for a qq plot and ends up on the x axis go figure
+    if sampling, oversample low values - all the top 1% ?
+    assume we have 0-1 p values
+    """
+    R = []
+    colour="maroon"
+    nrows = len(dat)
+    dat.sort() # small to large
+    fn = float(nrows)
+    unifx = [x/fn for x in range(1,(nrows+1))]
+    if logscale:
+        unifx = [-math.log10(x) for x in unifx] # uniform distribution
+    if sample < 1.0 and len(dat) > maxveclen:
+        # now have half a million markers eg - too many to plot all for a pdf - sample to get 10k or so points
+        # oversample part of the distribution
+        always = min(1000,nrows/20) # oversample smaller of lowest few hundred items or 5%
+        skip = int(nrows/float(maxveclen)) # take 1 in skip to get about maxveclen points
+        if skip <= 1:
+            skip = 2
+        samplei = [i for i in range(nrows) if (i < always) or (i % skip == 0)]
+        # always oversample first sorted (here lowest) values
+        yvec = [dat[i] for i in samplei] # always get first and last
+        xvec = [unifx[i] for i in samplei] # and sample xvec same way
+        maint='QQ %s (random %d of %d)' % (title,len(yvec),nrows)
+    else:
+        yvec = [x for x in dat] 
+        maint='QQ %s (n=%d)' % (title,nrows)
+        xvec = unifx
+    if logscale:
+        maint = 'Log%s' % maint
+        mx = [0,math.log10(nrows)] # if 1000, becomes 3 for the null line
+        ylab = '-log10(%s) Quantiles' % title
+        xlab = '-log10(Uniform 0-1) Quantiles'
+        yvec = [-math.log10(x) for x in yvec if x > 0.0]
+    else:
+        mx = [0,1]
+        ylab = '%s Quantiles' % title
+        xlab = 'Uniform 0-1 Quantiles'
+
+    xv = ['%f' % x for x in xvec]
+    R.append('xvec = c(%s)' % ','.join(xv))
+    yv = ['%f' % x for x in yvec]
+    R.append('yvec = c(%s)' % ','.join(yv))
+    R.append('mx = c(0,%f)' % (math.log10(fn)))
+    R.append('pdf("%s",h=%d,w=%d)' % (fname,h,w))
+    R.append("par(lab=c(10,10,10))")
+    R.append('qqplot(xvec,yvec,xlab="%s",ylab="%s",main="%s",sub="%s",pch=19,col="%s",cex=0.8)' % (xlab,ylab,maint,title,colour))
+    R.append('points(mx,mx,type="l")')
+    R.append('grid(col="lightgray",lty="dotted")')
+    R.append('dev.off()')
+    RRun(rcmd=R,title='makeQQplot',outdir=outdir)
+
+
+
+def main():
+    u = """
+    """
+    u = """<command interpreter="python">
+        rgQQ.py "$input1" "$name" $sample "$cols" $allqq $height $width $logtrans $allqq.id $__new_file_path__ 
+    </command>                                                                                                 
+
+    </command>
+    """
+    print >> sys.stdout,'## rgQQ.py. cl=',sys.argv
+    npar = 11
+    if len(sys.argv) < npar:
+            print >> sys.stdout, '## error - too few command line parameters - wanting %d' % npar
+            print >> sys.stdout, u
+            sys.exit(1)
+    in_fname = sys.argv[1]
+    name = sys.argv[2]
+    sample = float(sys.argv[3])
+    head = None
+    columns = [int(x) for x in sys.argv[4].strip().split(',')] # work with python columns!
+    allout = sys.argv[5]
+    height = int(sys.argv[6])
+    width = int(sys.argv[7])
+    logscale = (sys.argv[8].lower() == 'true')
+    outid = sys.argv[9] # this is used to allow multiple output files 
+    outdir = sys.argv[10]
+    nan_row = False
+    rows = []
+    for i, line in enumerate( file( sys.argv[1] ) ):
+        # Skip comments
+        if  line.startswith( '#' ) or ( i == 0 ):
+            if i == 0:
+                 head = line.strip().split("\t")
+            continue
+        if len(line.strip()) == 0:
+            continue
+        # Extract values and convert to floats
+        fields = line.strip().split( "\t" )
+        row = []
+        nan_row = False
+        for column in columns:
+            if len( fields ) <= column:
+                return fail( "No column %d on line %d: %s" % ( column, i, fields ) )
+            val = fields[column]
+            if val.lower() == "na":
+                nan_row = True
+            else:
+                try:
+                    row.append( float( fields[column] ) )
+                except ValueError:
+                    return fail( "Value '%s' in column %d on line %d is not numeric" % ( fields[column], column+1, i ) )
+        if not nan_row:
+           rows.append( row )
+    if i > 1:
+       i = i-1 # remove header row from count
+    if head == None:
+       head = ['Col%d' % (x+1) for x in columns]
+    R = []
+    for c,column in enumerate(columns): # we appended each column in turn
+        outname = allout
+        if c > 0: # after first time
+            outname = 'primary_%s_col%s_visible_pdf' % (outid,column)
+            outname = os.path.join(outdir,outname)
+        dat = []
+        nrows = len(rows) # sometimes lots of NA's!!
+        for arow in rows:
+           dat.append(arow[c]) # remember, we appended each col in turn!
+        cname = head[column]        
+        makeQQ(dat=dat,sample=sample,fname=outname,title='%s_%s' % (name,cname),
+                   xvar=cname,h=height,w=width,logscale=logscale,outdir=outdir)
+
+
+
+if __name__ == "__main__":
+    main()