Mercurial > repos > fubar > toolfactory
diff old.xml @ 34:c6fdf2c6d0f4 draft
Citations added (thanks John!) and a few more output formats for Alistair Chilcott
author | fubar |
---|---|
date | Thu, 28 Aug 2014 02:33:05 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/old.xml Thu Aug 28 02:33:05 2014 -0400 @@ -0,0 +1,835 @@ +<tool id="rgedgeRpaired" name="edgeR" version="0.20"> + <description>1 or 2 level models for count data</description> + <requirements> + <requirement type="package" version="2.12">biocbasics</requirement> + <requirement type="package" version="3.0.1">package_r3</requirement> + </requirements> + + <command interpreter="python"> + rgToolFactory.py --script_path "$runme" --interpreter "Rscript" --tool_name "edgeR" + --output_dir "$html_file.files_path" --output_html "$html_file" --output_tab "$outtab" --make_HTML "yes" + </command> + <inputs> + <param name="input1" type="data" format="tabular" label="Select an input matrix - rows are contigs, columns are counts for each sample" + help="Use the HTSeq based count matrix preparation tool to create these matrices from BAM/SAM files and a GTF file of genomic features"/> + <param name="title" type="text" value="edgeR" size="80" label="Title for job outputs" help="Supply a meaningful name here to remind you what the outputs contain"> + <sanitizer invalid_char=""> + <valid initial="string.letters,string.digits"><add value="_" /> </valid> + </sanitizer> + </param> + <param name="treatment_name" type="text" value="Treatment" size="50" label="Treatment Name"/> + <param name="Treat_cols" label="Select columns containing treatment." type="data_column" data_ref="input1" numerical="True" + multiple="true" use_header_names="true" size="120" display="checkboxes"> + <validator type="no_options" message="Please select at least one column."/> + </param> + <param name="control_name" type="text" value="Control" size="50" label="Control Name"/> + <param name="Control_cols" label="Select columns containing control." type="data_column" data_ref="input1" numerical="True" + multiple="true" use_header_names="true" size="120" display="checkboxes" optional="true"> + </param> + <param name="subjectids" type="text" optional="true" size="120" value = "" + label="IF SUBJECTS NOT ALL INDEPENDENT! Enter integers to indicate sample pairing for every column in input" + help="Leave blank if no pairing, but eg if data from sample id A99 is in columns 2,4 and id C21 is in 3,5 then enter '1,2,1,2'"> + <sanitizer> + <valid initial="string.digits"><add value="," /> </valid> + </sanitizer> + </param> + <param name="fQ" type="float" value="0.3" size="5" label="Non-differential contig count quantile threshold - zero to analyze all non-zero read count contigs" + help="May be a good or a bad idea depending on the biology and the question. EG 0.3 = sparsest 30% of contigs with at least one read are removed before analysis"/> + <param name="useNDF" type="boolean" truevalue="T" falsevalue="F" checked="false" size="1" + label="Non differential filter - remove contigs below a threshold (1 per million) for half or more samples" + help="May be a good or a bad idea depending on the biology and the question. This was the old default. Quantile based is available as an alternative"/> + <conditional name="DESeq"> + <param name="doDESeq" type="select" + label="Run the same model with DESeq2 and compare findings" + help="DESeq2 is an update to the DESeq package. It uses different assumptions and methods to edgeR"> + <option value="F" selected="true">Do not run DESeq2</option> + <option value="T">Run DESeq2 (only works if NO second GLM factor supplied at present)</option> + </param> + <when value="T"> + <param name="DESeq_fitType" type="select"> + <option value="parametric" selected="true">Parametric (default) fit for dispersions</option> + <option value="local">Local fit - use this if parametric fails</option> + <option value="mean">Mean dispersion fit- use this if you really understand what you're doing - read the fine manual</option> + </param> + </when> + <when value="F"> </when> + </conditional> + <param name="doVoom" type="boolean" truevalue="T" checked='false' falsevalue="F" size="1" label="Run the same model with VOOM transformation and limma."/> + <conditional name="camera"> + <param name="doCamera" type="select" label="Run the edgeR implementation of Camera GSEA for up/down gene sets" + help="If yes, you can choose a set of genesets to test and/or supply a gmt format geneset collection from your history"> + <option value="F" selected="true">Do not run GSEA tests with the Camera algorithm</option> + <option value="T">Run GSEA tests with the Camera algorithm</option> + </param> + <when value="T"> + <conditional name="gmtSource"> + <param name="refgmtSource" type="select" + label="Use a gene set (.gmt) from your history and/or use a built-in (MSigDB etc) gene set"> + <option value="indexed" selected="true">Use a built-in gene set</option> + <option value="history">Use a gene set from my history</option> + <option value="both">Add a gene set from my history to a built in gene set</option> + </param> + <when value="indexed"> + <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis"> + <options from_data_table="gseaGMT_3.1"> + <filter type="sort_by" column="2" /> + <validator type="no_options" message="No GMT v3.1 files are available - please install them"/> + </options> + </param> + </when> + <when value="history"> + <param name="ownGMT" type="data" format="gmt" label="Select a Gene Set from your history" /> + </when> + <when value="both"> + <param name="ownGMT" type="data" format="gseagmt" label="Select a Gene Set from your history" /> + <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis"> + <options from_data_table="gseaGMT_3.1"> + <filter type="sort_by" column="2" /> + <validator type="no_options" message="No GMT v3.1 files are available - please fix tool_data_table and loc files"/> + </options> + </param> + </when> + </conditional> + </when> + <when value="F"> + </when> + </conditional> + <param name="priordf" type="integer" value="20" size="3" label="prior.df for tagwise dispersion - lower value = more emphasis on each tag's variance. Replaces prior.n and prior.df = prior.n * residual.df" + help="0 = Use edgeR default. Use a small value to 'smooth' small samples. See edgeR docs and note below"/> + <param name="fdrthresh" type="float" value="0.05" size="5" label="P value threshold for FDR filtering for amily wise error rate control" + help="Conventional default value of 0.05 recommended"/> + <param name="fdrtype" type="select" label="FDR (Type II error) control method" + help="Use fdr or bh typically to control for the number of tests in a reliable way"> + <option value="fdr" selected="true">fdr</option> + <option value="BH">Benjamini Hochberg</option> + <option value="BY">Benjamini Yukateli</option> + <option value="bonferroni">Bonferroni</option> + <option value="hochberg">Hochberg</option> + <option value="holm">Holm</option> + <option value="hommel">Hommel</option> + <option value="none">no control for multiple tests</option> + </param> + </inputs> + <outputs> + <data format="tabular" name="outtab" label="${title}.xls"/> + <data format="html" name="html_file" label="${title}.html"/> + </outputs> + <stdio> + <exit_code range="4" level="fatal" description="Number of subject ids must match total number of samples in the input matrix" /> + </stdio> + <tests> +<test> +<param name='input1' value='test_bams2mx.xls' ftype='tabular' /> + <param name='treatment_name' value='case' /> + <param name='title' value='edgeRtest' /> + <param name='useNDF' value='' /> + <param name='fdrtype' value='fdr' /> + <param name='priordf' value="0" /> + <param name='fdrthresh' value="0.05" /> + <param name='control_name' value='control' /> + <param name='subjectids' value='' /> + <param name='Treat_cols' value='3,4,5,9' /> + <param name='Control_cols' value='2,6,7,8' /> + <output name='outtab' file='edgeRtest1out.xls' compare='diff' /> + <output name='html_file' file='edgeRtest1out.html' compare='diff' lines_diff='20' /> +</test> +</tests> + +<configfiles> +<configfile name="runme"> +<![CDATA[ +## +## edgeR.Rscript +## updated npv 2011 for R 2.14.0 and edgeR 2.4.0 by ross +## Performs DGE on a count table containing n replicates of two conditions +## +### Original edgeR code by: S.Lunke and A.Kaspi +reallybig = log10(.Machine\$double.xmax) +reallysmall = log10(.Machine\$double.xmin) +library('stringr') +library('gplots') +library('edgeR') + +hmap2 = function(cmat,nsamp=100,outpdfname='heatmap2.pdf', TName='Treatment',group=NA,myTitle='title goes here') +{ + ### Perform clustering for significant pvalues after controlling FWER + samples = colnames(cmat) + gu = unique(group) + if (length(gu) == 2) { + col.map = function(g) {if (g==gu[1]) "#FF0000" else "#0000FF"} + pcols = unlist(lapply(group,col.map)) + } else { + colours = rainbow(length(gu),start=0,end=4/6) + pcols = colours[match(group,gu)] + } + gn = rownames(cmat) + dm = cmat[(! is.na(gn)),] + ### remove unlabelled hm rows + nprobes = nrow(dm) + if (nprobes > nsamp) { + dm =dm[1:nsamp,] + } + newcolnames = substr(colnames(dm),1,20) + colnames(dm) = newcolnames + pdf(outpdfname) + heatmap.2(dm,main=myTitle,ColSideColors=pcols,col=topo.colors(100),dendrogram="col",key=T,density.info='none', + Rowv=F,scale='row',trace='none',margins=c(8,8),cexRow=0.4,cexCol=0.5) + dev.off() +} + +hmap = function(cmat,nmeans=4,outpdfname="heatMap.pdf",nsamp=250,TName='Treatment',group=NA,myTitle="Title goes here") +{ + ## for 2 groups only was + ## col.map = function(g) {if (g==TName) "#FF0000" else "#0000FF"} + ## pcols = unlist(lapply(group,col.map)) + gu = unique(group) + colours = rainbow(length(gu),start=0.3,end=0.6) + pcols = colours[match(group,gu)] + nrows = nrow(cmat) + mtitle = paste(myTitle,'Heatmap: n contigs =',nrows) + if (nrows > nsamp) { + cmat = cmat[c(1:nsamp),] + mtitle = paste('Heatmap: Top ',nsamp,' DE contigs (of ',nrows,')',sep='') + } + newcolnames = substr(colnames(cmat),1,20) + colnames(cmat) = newcolnames + pdf(outpdfname) + heatmap(cmat,scale='row',main=mtitle,cexRow=0.3,cexCol=0.4,Rowv=NA,ColSideColors=pcols) + dev.off() +} + +qqPlot = function(descr='Title',pvector, ...) +## stolen from https://gist.github.com/703512 +{ + o = -log10(sort(pvector,decreasing=F)) + e = -log10( 1:length(o)/length(o) ) + o[o==-Inf] = reallysmall + o[o==Inf] = reallybig + pdfname = paste(gsub(" ","", descr , fixed=TRUE),'pval_qq.pdf',sep='_') + maint = paste(descr,'QQ Plot') + pdf(pdfname) + plot(e,o,pch=19,cex=1, main=maint, ..., + xlab=expression(Expected~~-log[10](italic(p))), + ylab=expression(Observed~~-log[10](italic(p))), + xlim=c(0,max(e)), ylim=c(0,max(o))) + lines(e,e,col="red") + grid(col = "lightgray", lty = "dotted") + dev.off() +} + +smearPlot = function(DGEList,deTags, outSmear, outMain) + { + pdf(outSmear) + plotSmear(DGEList,de.tags=deTags,main=outMain) + grid(col="blue") + dev.off() + } + +boxPlot = function(rawrs,cleanrs,maint,myTitle) +{ + nc = ncol(rawrs) + for (i in c(1:nc)) {rawrs[(rawrs[,i] < 0),i] = NA} + fullnames = colnames(rawrs) + newcolnames = substr(colnames(rawrs),1,20) + colnames(rawrs) = newcolnames + newcolnames = substr(colnames(cleanrs),1,20) + colnames(cleanrs) = newcolnames + pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"sampleBoxplot.pdf",sep='_') + defpar = par(no.readonly=T) + pdf(pdfname) + l = layout(matrix(c(1,2),1,2,byrow=T)) + print.noquote('raw contig counts by sample:') + print.noquote(summary(rawrs)) + print.noquote('normalised contig counts by sample:') + print.noquote(summary(cleanrs)) + boxplot(rawrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('Raw:',maint)) + grid(col="blue") + boxplot(cleanrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('After ',maint)) + grid(col="blue") + dev.off() + pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"samplehistplot.pdf",sep='_') + nc = ncol(rawrs) + print.noquote(paste('Using ncol rawrs=',nc)) + ncroot = round(sqrt(nc)) + if (ncroot*ncroot < nc) { ncroot = ncroot + 1 } + m = c() + for (i in c(1:nc)) { + rhist = hist(rawrs[,i],breaks=100,plot=F) + m = append(m,max(rhist\$counts)) + } + ymax = max(m) + pdf(pdfname) + par(mfrow=c(ncroot,ncroot)) + for (i in c(1:nc)) { + hist(rawrs[,i], main=paste("Contig logcount",i), xlab='log raw count', col="maroon", + breaks=100,sub=fullnames[i],cex=0.8,ylim=c(0,ymax)) + } + dev.off() + par(defpar) + +} + +cumPlot = function(rawrs,cleanrs,maint,myTitle) +{ + pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_') + defpar = par(no.readonly=T) + pdf(pdfname) + par(mfrow=c(2,1)) + lrs = log(rawrs,10) + lim = max(lrs) + hist(lrs,breaks=100,main=paste('Before:',maint),xlab="Reads (log)", + ylab="Count",col="maroon",sub=myTitle, xlim=c(0,lim),las=1) + grid(col="blue") + lrs = log(cleanrs,10) + hist(lrs,breaks=100,main=paste('After:',maint),xlab="Reads (log)", + ylab="Count",col="maroon",sub=myTitle,xlim=c(0,lim),las=1) + grid(col="blue") + dev.off() + par(defpar) +} + +cumPlot1 = function(rawrs,cleanrs,maint,myTitle) +{ + pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_') + pdf(pdfname) + par(mfrow=c(2,1)) + lastx = max(rawrs) + rawe = knots(ecdf(rawrs)) + cleane = knots(ecdf(cleanrs)) + cy = 1:length(cleane)/length(cleane) + ry = 1:length(rawe)/length(rawe) + plot(rawe,ry,type='l',main=paste('Before',maint),xlab="Log Contig Total Reads", + ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle) + grid(col="blue") + plot(cleane,cy,type='l',main=paste('After',maint),xlab="Log Contig Total Reads", + ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle) + grid(col="blue") + dev.off() +} + + + +doGSEA = function(y=NULL,design=NULL,histgmt="", + bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt", + ntest=0, myTitle="myTitle", outfname="GSEA.xls", minnin=5, maxnin=2000,fdrthresh=0.05,fdrtype="BH") +{ + genesets = c() + if (bigmt > "") + { + bigenesets = readLines(bigmt) + genesets = bigenesets + } + if (histgmt > "") + { + hgenesets = readLines(histgmt) + if (bigmt > "") { + genesets = rbind(genesets,hgenesets) + } else { + genesets = hgenesets + } + } + print.noquote(paste("@@@read",length(genesets), 'genesets from',histgmt,bigmt)) + genesets = strsplit(genesets,'\t') + ##### tabular. genesetid\tURLorwhatever\tgene_1\t..\tgene_n + outf = outfname + head=paste(myTitle,'edgeR GSEA') + write(head,file=outfname,append=F) + ntest=length(genesets) + urownames = toupper(rownames(y)) + upcam = c() + downcam = c() + for (i in 1:ntest) { + gs = unlist(genesets[i]) + g = gs[1] #### geneset_id + u = gs[2] + if (u > "") { u = paste("<a href=\'",u,"\'>",u,"</a>",sep="") } + glist = gs[3:length(gs)] #### member gene symbols + glist = toupper(glist) + inglist = urownames %in% glist + nin = sum(inglist) + if ((nin > minnin) && (nin < maxnin)) { + ### print(paste('@@found',sum(inglist),'genes in glist')) + camres = camera(y=y,index=inglist,design=design) + if (camres) { + rownames(camres) = g + ##### gene set name + camres = cbind(GeneSet=g,URL=u,camres) + if (camres\$Direction == "Up") + { + upcam = rbind(upcam,camres) } else { + downcam = rbind(downcam,camres) + } + } + } + } + uscam = upcam[order(upcam\$PValue),] + unadjp = uscam\$PValue + uscam\$adjPValue = p.adjust(unadjp,method=fdrtype) + nup = max(10,sum((uscam\$adjPValue < fdrthresh))) + dscam = downcam[order(downcam\$PValue),] + unadjp = dscam\$PValue + dscam\$adjPValue = p.adjust(unadjp,method=fdrtype) + ndown = max(10,sum((dscam\$adjPValue < fdrthresh))) + write.table(uscam,file=paste('upCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F) + write.table(dscam,file=paste('downCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F) + print.noquote(paste('@@@@@ Camera up top',nup,'gene sets:')) + write.table(head(uscam,nup),file="",quote=F,sep='\t',row.names=F) + print.noquote(paste('@@@@@ Camera down top',ndown,'gene sets:')) + write.table(head(dscam,ndown),file="",quote=F,sep='\t',row.names=F) +} + + + +edgeIt = function (Count_Matrix,group,outputfilename,fdrtype='fdr',priordf=5, + fdrthresh=0.05,outputdir='.', myTitle='edgeR',libSize=c(),useNDF=F, + filterquantile=0.2, subjects=c(),mydesign=NULL, + doDESeq=T,doVoom=T,doCamera=T,org='hg19', + histgmt="", bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt", + doCook=F,DESeq_fittype="parameteric") +{ + if (length(unique(group))!=2){ + print("Number of conditions identified in experiment does not equal 2") + q() + } + require(edgeR) + options(width = 512) + mt = paste(unlist(strsplit(myTitle,'_')),collapse=" ") + allN = nrow(Count_Matrix) + nscut = round(ncol(Count_Matrix)/2) + colTotmillionreads = colSums(Count_Matrix)/1e6 + rawrs = rowSums(Count_Matrix) + nonzerod = Count_Matrix[(rawrs > 0),] + nzN = nrow(nonzerod) + nzrs = rowSums(nonzerod) + zN = allN - nzN + print('**** Quantiles for non-zero row counts:',quote=F) + print(quantile(nzrs,probs=seq(0,1,0.1)),quote=F) + if (useNDF == "T") + { + gt1rpin3 = rowSums(Count_Matrix/expandAsMatrix(colTotmillionreads,dim(Count_Matrix)) >= 1) >= nscut + lo = colSums(Count_Matrix[!gt1rpin3,]) + workCM = Count_Matrix[gt1rpin3,] + cleanrs = rowSums(workCM) + cleanN = length(cleanrs) + meth = paste( "After removing",length(lo),"contigs with fewer than ",nscut," sample read counts >= 1 per million, there are",sep="") + print(paste("Read",allN,"contigs. Removed",zN,"contigs with no reads.",meth,cleanN,"contigs"),quote=F) + maint = paste('Filter >= 1/million reads in >=',nscut,'samples') + } else { + useme = (nzrs > quantile(nzrs,filterquantile)) + workCM = nonzerod[useme,] + lo = colSums(nonzerod[!useme,]) + cleanrs = rowSums(workCM) + cleanN = length(cleanrs) + meth = paste("After filtering at count quantile =",filterquantile,", there are",sep="") + print(paste('Read',allN,"contigs. Removed",zN,"with no reads.",meth,cleanN,"contigs"),quote=F) + maint = paste('Filter below',filterquantile,'quantile') + } + cumPlot(rawrs=rawrs,cleanrs=cleanrs,maint=maint,myTitle=myTitle) + allgenes <- rownames(workCM) + print(paste("*** Total low count contigs per sample = ",paste(lo,collapse=',')),quote=F) + rsums = rowSums(workCM) + TName=unique(group)[1] + CName=unique(group)[2] + DGEList = DGEList(counts=workCM, group = group) + DGEList = calcNormFactors(DGEList) + + if (is.null(mydesign)) { + if (length(subjects) == 0) + { + mydesign = model.matrix(~group) + } + else { + subjf = factor(subjects) + mydesign = model.matrix(~subjf+group) + ### we block on subject so make group last to simplify finding it + } + } + print.noquote(paste('Using samples:',paste(colnames(workCM),collapse=','))) + print.noquote('Using design matrix:') + print.noquote(mydesign) + DGEList = estimateGLMCommonDisp(DGEList,mydesign) + comdisp = DGEList\$common.dispersion + DGEList = estimateGLMTrendedDisp(DGEList,mydesign) + if (priordf > 0) { + print.noquote(paste("prior.df =",priordf)) + DGEList = estimateGLMTagwiseDisp(DGEList,mydesign,prior.df = priordf) + } else { + DGEList = estimateGLMTagwiseDisp(DGEList,mydesign) + } + lastcoef=ncol(mydesign) + print.noquote(paste('*** lastcoef = ',lastcoef)) + estpriorn = getPriorN(DGEList) + predLFC1 = predFC(DGEList,prior.count=1,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList)) + predLFC3 = predFC(DGEList,prior.count=3,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList)) + predLFC5 = predFC(DGEList,prior.count=5,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList)) + DGLM = glmFit(DGEList,design=mydesign) + DE = glmLRT(DGLM) + #### always last one - subject is first if needed + logCPMnorm = cpm(DGEList,log=T,normalized.lib.sizes=T) + logCPMraw = cpm(DGEList,log=T,normalized.lib.sizes=F) + uoutput = cbind( + Name=as.character(rownames(DGEList\$counts)), + DE\$table, + adj.p.value=p.adjust(DE\$table\$PValue, method=fdrtype), + Dispersion=DGEList\$tagwise.dispersion,totreads=rsums, + predLFC1=predLFC1[,lastcoef], + predLFC3=predLFC3[,lastcoef], + predLFC5=predLFC5[,lastcoef], + logCPMnorm, + DGEList\$counts + ) + soutput = uoutput[order(DE\$table\$PValue),] + heatlogcpmnorm = logCPMnorm[order(DE\$table\$PValue),] + goodness = gof(DGLM, pcutoff=fdrthresh) + noutl = (sum(goodness\$outlier) > 0) + if (noutl > 0) { + print.noquote(paste('***',noutl,'GLM outliers found')) + print(paste(rownames(DGLM)[(goodness\$outlier)],collapse=','),quote=F) + } else { + print('*** No GLM fit outlier genes found') + } + z = limma::zscoreGamma(goodness\$gof.statistic, shape=goodness\$df/2, scale=2) + pdf(paste(mt,"GoodnessofFit.pdf",sep='_')) + qq = qqnorm(z, panel.first=grid(), main="tagwise dispersion") + abline(0,1,lwd=3) + points(qq\$x[goodness\$outlier],qq\$y[goodness\$outlier], pch=16, col="maroon") + dev.off() + print(paste("Common Dispersion =",comdisp,"CV = ",sqrt(comdisp),"getPriorN = ",estpriorn),quote=F) + uniqueg = unique(group) + sample_colors = match(group,levels(group)) + pdf(paste(mt,"MDSplot.pdf",sep='_')) + sampleTypes = levels(factor(group)) + print.noquote(sampleTypes) + plotMDS.DGEList(DGEList,main=paste("MDS Plot for",myTitle),cex=0.5,col=sample_colors,pch=sample_colors) + legend(x="topleft", legend = sampleTypes,col=c(1:length(sampleTypes)), pch=19) + grid(col="blue") + dev.off() + colnames(logCPMnorm) = paste( colnames(logCPMnorm),'N',sep="_") + print(paste('Raw sample CPM',paste(colSums(logCPMraw,na.rm=T),collapse=','))) + try(boxPlot(rawrs=logCPMraw,cleanrs=logCPMnorm,maint='TMM Normalisation',myTitle=myTitle)) + nreads = soutput\$totreads + print('*** writing output',quote=F) + write.table(soutput,outputfilename, quote=FALSE, sep="\t",row.names=F) + rn = row.names(workCM) + print.noquote('@@ rn') + print.noquote(head(rn)) + reg = "^chr([0-9]+):([0-9]+)-([0-9]+)" + genecards="<a href=\'http://www.genecards.org/index.php?path=/Search/keyword/" + ucsc = paste("<a href=\'http://genome.ucsc.edu/cgi-bin/hgTracks?db=",org,sep='') + testreg = str_match(rn,reg) + nreads = uoutput\$totreads + if (sum(!is.na(testreg[,1]))/length(testreg[,1]) > 0.8) + { + print("@@ using ucsc substitution for urls") + urls = paste0(ucsc,"&position=chr",testreg[,2],":",testreg[,3],"-",testreg[,4],"\'>",rn,"</a>") + } else { + print("@@ using genecards substitution for urls") + urls = paste0(genecards,rn,"\'>",rn,"</a>") + } + tt = uoutput + print.noquote("*** edgeR Top tags\n") + tt = cbind(tt,ntotreads=nreads,URL=urls) + tt = tt[order(DE\$table\$PValue),] + print.noquote(tt[1:50,]) + ### Plot MAplot + deTags = rownames(uoutput[uoutput\$adj.p.value < fdrthresh,]) + nsig = length(deTags) + print(paste('***',nsig,'tags significant at adj p=',fdrthresh),quote=F) + if (nsig > 0) { + print('*** deTags',quote=F) + print(head(deTags)) + } + deColours = ifelse(deTags,'red','black') + pdf(paste(mt,"BCV_vs_abundance.pdf",sep='_')) + plotBCV(DGEList, cex=0.3, main="Biological CV vs abundance") + dev.off() + dg = DGEList[order(DE\$table\$PValue),] + outpdfname=paste(mt,"heatmap.pdf",sep='_') + hmap2(heatlogcpmnorm,nsamp=100,TName=TName,group=group,outpdfname=outpdfname,myTitle=myTitle) + outSmear = paste(mt,"Smearplot.pdf",sep='_') + outMain = paste("Smear Plot for ",TName,' Vs ',CName,' (FDR@',fdrthresh,' N = ',nsig,')',sep='') + smearPlot(DGEList=DGEList,deTags=deTags, outSmear=outSmear, outMain = outMain) + qqPlot(descr=myTitle,pvector=DE\$table\$PValue) + if (doDESeq == T) + { + ### DESeq2 + require('DESeq2') + print.noquote(paste('****subjects=',subjects,'length=',length(subjects))) + if (length(subjects) == 0) + { + pdata = data.frame(Name=colnames(workCM),Rx=group,row.names=colnames(workCM)) + deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ Rx)) + } else { + pdata = data.frame(Name=colnames(workCM),Rx=group,subjects=subjects,row.names=colnames(workCM)) + deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ subjects + Rx)) + } + deSeqDatsizefac <- estimateSizeFactors(deSEQds) + deSeqDatdisp <- estimateDispersions(deSeqDatsizefac,fitType=DESeq_fittype) + resDESeq <- nbinomWaldTest(deSeqDatdisp, pAdjustMethod=fdrtype) + rDESeq = as.data.frame(results(resDESeq)) + srDESeq = rDESeq[order(rDESeq\$pvalue),] + write.table(srDESeq,paste(mt,'DESeq2_TopTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F) + topresults.DESeq <- rDESeq[which(rDESeq\$padj < fdrthresh), ] + DESeqcountsindex <- which(allgenes %in% rownames(topresults.DESeq)) + DESeqcounts <- rep(0, length(allgenes)) + DESeqcounts[DESeqcountsindex] <- 1 + pdf(paste(mt,"DESeq2_dispersion_estimates.pdf",sep='_')) + plotDispEsts(resDESeq) + dev.off() + if (doCook) { + pdf(paste(mt,"DESeq2_cooks_distance.pdf",sep='_')) + W <- mcols(resDESeq)\$WaldStatistic_condition_treated_vs_untreated + maxCooks <- mcols(resDESeq)\$maxCooks + idx <- !is.na(W) + plot(rank(W[idx]), maxCooks[idx], xlab="rank of Wald statistic", ylab="maximum Cook's distance per gene", + ylim=c(0,5), cex=.4, col="maroon") + m <- ncol(dds) + p <- 3 + abline(h=qf(.75, p, m - p),col="darkblue") + grid(col="lightgray",lty="dotted") + } + } + counts.dataframe = as.data.frame(c()) + norm.factor = DGEList\$samples\$norm.factors + topresults.edgeR <- soutput[which(soutput\$adj.p.value < fdrthresh), ] + edgeRcountsindex <- which(allgenes %in% rownames(topresults.edgeR)) + edgeRcounts <- rep(0, length(allgenes)) + edgeRcounts[edgeRcountsindex] <- 1 + if (doVoom == T) { + pdf(paste(mt,"voomplot.pdf",sep='_')) + dat.voomed <- voom(DGEList, mydesign, plot = TRUE, normalize.method="quantil", lib.size = NULL) + dev.off() + fit <- lmFit(dat.voomed, mydesign) + fit <- eBayes(fit) + rvoom <- topTable(fit, coef = length(colnames(mydesign)), adj = "BH", n = Inf) + write.table(rvoom,paste(mt,'VOOM_topTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F) + topresults.voom <- rvoom[which(rvoom\$adj.P.Val < fdrthresh), ] + voomcountsindex <- which(allgenes %in% rownames(topresults.voom)) + voomcounts <- rep(0, length(allgenes)) + voomcounts[voomcountsindex] <- 1 + } + if ((doDESeq==T) || (doVoom==T)) { + if ((doVoom==T) && (doDESeq==T)) { + vennmain = paste(mt,'Voom,edgeR and DESeq2 overlap at FDR=',fdrthresh) + counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts, + VOOM_limma = voomcounts, row.names = allgenes) + } else if (doDESeq==T) { + vennmain = paste(mt,'DESeq2 and edgeR overlap at FDR=',fdrthresh) + counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts, row.names = allgenes) + } else if (doVoom==T) { + vennmain = paste(mt,'Voom and edgeR overlap at FDR=',fdrthresh) + counts.dataframe <- data.frame(edgeR = edgeRcounts, VOOM_limma = voomcounts, row.names = allgenes) + } + + if (nrow(counts.dataframe > 1)) { + counts.venn <- vennCounts(counts.dataframe) + vennf = paste(mt,'venn.pdf',sep='_') + pdf(vennf) + vennDiagram(counts.venn,main=vennmain,col="maroon") + dev.off() + } + } ### doDESeq or doVoom + if (doDESeq==T) { + cat("*** DESeq top 50\n") + print(srDESeq[1:50,]) + } + if (doVoom==T) { + cat("*** VOOM top 50\n") + print(rvoom[1:50,]) + } + if (doCamera) { + doGSEA(y=DGEList,design=mydesign,histgmt=histgmt,bigmt=bigmt,ntest=20,myTitle=myTitle, + outfname=paste(mt,"GSEA.xls",sep="_"),fdrthresh=fdrthresh,fdrtype=fdrtype) + } + uoutput + +} +#### Done + +#### sink(stdout(),append=T,type="message") + +doDESeq = $DESeq.doDESeq +### make these 'T' or 'F' +doVoom = $doVoom +doCamera = $camera.doCamera +Out_Dir = "$html_file.files_path" +Input = "$input1" +TreatmentName = "$treatment_name" +TreatmentCols = "$Treat_cols" +ControlName = "$control_name" +ControlCols= "$Control_cols" +outputfilename = "$outtab" +org = "$input1.dbkey" +if (org == "") { org = "hg19"} +fdrtype = "$fdrtype" +priordf = $priordf +fdrthresh = $fdrthresh +useNDF = "$useNDF" +fQ = $fQ +myTitle = "$title" +sids = strsplit("$subjectids",',') +subjects = unlist(sids) +nsubj = length(subjects) +builtin_gmt="" +history_gmt="" + +builtin_gmt = "" +history_gmt = "" +DESeq_fittype="" +#if $DESeq.doDESeq == "T" + DESeq_fittype = "$DESeq.DESeq_fitType" +#end if +#if $camera.doCamera == 'T' + #if $camera.gmtSource.refgmtSource == "indexed" or $camera.gmtSource.refgmtSource == "both": + builtin_gmt = "${camera.gmtSource.builtinGMT.fields.path}" + #end if + #if $camera.gmtSource.refgmtSource == "history" or $camera.gmtSource.refgmtSource == "both": + history_gmt = "${camera.gmtSource.ownGMT}" + history_gmt_name = "${camera.gmtSource.ownGMT.name}" + #end if +#end if +if (nsubj > 0) { +if (doDESeq) { + print('WARNING - cannot yet use DESeq2 for 2 way anova - see the docs') + doDESeq = F + } +} +TCols = as.numeric(strsplit(TreatmentCols,",")[[1]])-1 +CCols = as.numeric(strsplit(ControlCols,",")[[1]])-1 +cat('Got TCols=') +cat(TCols) +cat('; CCols=') +cat(CCols) +cat('\n') +useCols = c(TCols,CCols) +if (file.exists(Out_Dir) == F) dir.create(Out_Dir) +Count_Matrix = read.table(Input,header=T,row.names=1,sep='\t') #Load tab file assume header +snames = colnames(Count_Matrix) +nsamples = length(snames) +if (nsubj > 0 & nsubj != nsamples) { +options("show.error.messages"=T) +mess = paste('Fatal error: Supplied subject id list',paste(subjects,collapse=','), + 'has length',nsubj,'but there are',nsamples,'samples',paste(snames,collapse=',')) +write(mess, stderr()) +quit(save="no",status=4) +} + +Count_Matrix = Count_Matrix[,useCols] ### reorder columns +if (length(subjects) != 0) {subjects = subjects[useCols]} +rn = rownames(Count_Matrix) +islib = rn %in% c('librarySize','NotInBedRegions') +LibSizes = Count_Matrix[subset(rn,islib),][1] # take first +Count_Matrix = Count_Matrix[subset(rn,! islib),] +group = c(rep(TreatmentName,length(TCols)), rep(ControlName,length(CCols)) ) +group = factor(group, levels=c(ControlName,TreatmentName)) +colnames(Count_Matrix) = paste(group,colnames(Count_Matrix),sep="_") +results = edgeIt(Count_Matrix=Count_Matrix,group=group,outputfilename=outputfilename, + fdrtype='BH',priordf=priordf,fdrthresh=fdrthresh,outputdir='.', + myTitle='edgeR',useNDF=F,libSize=c(),filterquantile=fQ,subjects=subjects, + doDESeq=doDESeq,doVoom=doVoom,doCamera=doCamera,org=org, + histgmt=history_gmt,bigmt=builtin_gmt,DESeq_fittype=DESeq_fittype) +sessionInfo() +]]> +</configfile> +</configfiles> +<help> + +**What it does** + +Performs digital gene expression analysis between a treatment and control on a count matrix. +Optionally adds a term for subject if not all samples are independent or if some other factor needs to be blocked in the design. + +**Input** + +A matrix consisting of non-negative integers. The matrix must have a unique header row identifiying the samples, and a unique set of row names +as the first column. Typically the row names are gene symbols or probe id's for downstream use in GSEA and other methods. + +If you have (eg) paired samples and wish to include a term in the GLM to account for some other factor (subject in the case of paired samples), +put a comma separated list of indicators for every sample (whether modelled or not!) indicating (eg) the subject number or +A list of integers, one for each subject or an empty string if samples are all independent. +If not empty, there must be exactly as many integers in the supplied integer list as there are columns (samples) in the count matrix. +Integers for samples that are not in the analysis *must* be present in the string as filler even if not used. + +So if you have 2 pairs out of 6 samples, you need to put in unique integers for the unpaired ones +eg if you had 6 samples with the first two independent but the second and third pairs each being from independent subjects. you might use +8,9,1,1,2,2 +as subject IDs to indicate two paired samples from the same subject in columns 3/4 and 5/6 + +**Output** + +A summary html page with links to the R source code and all the outputs, nice grids of helpful plot thumbnails, and lots +of logging and the top 50 rows of the topTable. + +The main topTables of results are provided as separate excelish tabular files. + +They include adjusted p values and dispersions for each region, raw and cpm sample data counts and shrunken (predicted) log fold change estimates. +These are provided for downstream analyses such as GSEA and are predictions of the logFC you might expect to see +in an independent replication of your original experiment. Higher number means more shrinkage. Shrinkage is more extreme for low coverage features +so downstream analyses are more robust against strong effect size estimates based on relatively little experimental information. + +**Note on prior.N** + +http://seqanswers.com/forums/showthread.php?t=5591 says: + +*prior.n* + +The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion. +You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood +in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your +tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the +common likelihood the weight of one observation. + +In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value, +or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that +you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation +(squeezing) of the tagwise dispersions. How many samples do you have in your experiment? +What is the experimental design? If you have few samples (less than 6) then I would suggest a prior.n of at least 10. +If you have more samples, then the tagwise dispersion estimates will be more reliable, +so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5. + + +From Bioconductor Digest, Vol 118, Issue 5, Gordon writes: + +Dear Dorota, + +The important settings are prior.df and trend. + +prior.n and prior.df are related through prior.df = prior.n * residual.df, +and your experiment has residual.df = 36 - 12 = 24. So the old setting of +prior.n=10 is equivalent for your data to prior.df = 240, a very large +value. Going the other way, the new setting of prior.df=10 is equivalent +to prior.n=10/24. + +To recover old results with the current software you would use + + estimateTagwiseDisp(object, prior.df=240, trend="none") + +To get the new default from old software you would use + + estimateTagwiseDisp(object, prior.n=10/24, trend=TRUE) + +Actually the old trend method is equivalent to trend="loess" in the new +software. You should use plotBCV(object) to see whether a trend is +required. + +Note you could also use + + prior.n = getPriorN(object, prior.df=10) + +to map between prior.df and prior.n. + +** Old rant on variable name changes in bioconductor versions** + +BioC authors sometimes make small mostly cosmetic changes to variable names (eg: from p.value to PValue) +often to make them more internally consistent or self describing. Unfortunately, these improvements +break existing code in ways that can take a while to track down that relies on the library in ways that can take a while to track down, +increasing downstream tool maintenance effort uselessly. + +Please, don't do that. It hurts us. + + +</help> + +</tool> + +