comparison old.xml @ 34:c6fdf2c6d0f4 draft

Citations added (thanks John!) and a few more output formats for Alistair Chilcott
author fubar
date Thu, 28 Aug 2014 02:33:05 -0400
parents
children
comparison
equal deleted inserted replaced
33:ca60c96f0beb 34:c6fdf2c6d0f4
1 <tool id="rgedgeRpaired" name="edgeR" version="0.20">
2 <description>1 or 2 level models for count data</description>
3 <requirements>
4 <requirement type="package" version="2.12">biocbasics</requirement>
5 <requirement type="package" version="3.0.1">package_r3</requirement>
6 </requirements>
7
8 <command interpreter="python">
9 rgToolFactory.py --script_path "$runme" --interpreter "Rscript" --tool_name "edgeR"
10 --output_dir "$html_file.files_path" --output_html "$html_file" --output_tab "$outtab" --make_HTML "yes"
11 </command>
12 <inputs>
13 <param name="input1" type="data" format="tabular" label="Select an input matrix - rows are contigs, columns are counts for each sample"
14 help="Use the HTSeq based count matrix preparation tool to create these matrices from BAM/SAM files and a GTF file of genomic features"/>
15 <param name="title" type="text" value="edgeR" size="80" label="Title for job outputs" help="Supply a meaningful name here to remind you what the outputs contain">
16 <sanitizer invalid_char="">
17 <valid initial="string.letters,string.digits"><add value="_" /> </valid>
18 </sanitizer>
19 </param>
20 <param name="treatment_name" type="text" value="Treatment" size="50" label="Treatment Name"/>
21 <param name="Treat_cols" label="Select columns containing treatment." type="data_column" data_ref="input1" numerical="True"
22 multiple="true" use_header_names="true" size="120" display="checkboxes">
23 <validator type="no_options" message="Please select at least one column."/>
24 </param>
25 <param name="control_name" type="text" value="Control" size="50" label="Control Name"/>
26 <param name="Control_cols" label="Select columns containing control." type="data_column" data_ref="input1" numerical="True"
27 multiple="true" use_header_names="true" size="120" display="checkboxes" optional="true">
28 </param>
29 <param name="subjectids" type="text" optional="true" size="120" value = ""
30 label="IF SUBJECTS NOT ALL INDEPENDENT! Enter integers to indicate sample pairing for every column in input"
31 help="Leave blank if no pairing, but eg if data from sample id A99 is in columns 2,4 and id C21 is in 3,5 then enter '1,2,1,2'">
32 <sanitizer>
33 <valid initial="string.digits"><add value="," /> </valid>
34 </sanitizer>
35 </param>
36 <param name="fQ" type="float" value="0.3" size="5" label="Non-differential contig count quantile threshold - zero to analyze all non-zero read count contigs"
37 help="May be a good or a bad idea depending on the biology and the question. EG 0.3 = sparsest 30% of contigs with at least one read are removed before analysis"/>
38 <param name="useNDF" type="boolean" truevalue="T" falsevalue="F" checked="false" size="1"
39 label="Non differential filter - remove contigs below a threshold (1 per million) for half or more samples"
40 help="May be a good or a bad idea depending on the biology and the question. This was the old default. Quantile based is available as an alternative"/>
41 <conditional name="DESeq">
42 <param name="doDESeq" type="select"
43 label="Run the same model with DESeq2 and compare findings"
44 help="DESeq2 is an update to the DESeq package. It uses different assumptions and methods to edgeR">
45 <option value="F" selected="true">Do not run DESeq2</option>
46 <option value="T">Run DESeq2 (only works if NO second GLM factor supplied at present)</option>
47 </param>
48 <when value="T">
49 <param name="DESeq_fitType" type="select">
50 <option value="parametric" selected="true">Parametric (default) fit for dispersions</option>
51 <option value="local">Local fit - use this if parametric fails</option>
52 <option value="mean">Mean dispersion fit- use this if you really understand what you're doing - read the fine manual</option>
53 </param>
54 </when>
55 <when value="F"> </when>
56 </conditional>
57 <param name="doVoom" type="boolean" truevalue="T" checked='false' falsevalue="F" size="1" label="Run the same model with VOOM transformation and limma."/>
58 <conditional name="camera">
59 <param name="doCamera" type="select" label="Run the edgeR implementation of Camera GSEA for up/down gene sets"
60 help="If yes, you can choose a set of genesets to test and/or supply a gmt format geneset collection from your history">
61 <option value="F" selected="true">Do not run GSEA tests with the Camera algorithm</option>
62 <option value="T">Run GSEA tests with the Camera algorithm</option>
63 </param>
64 <when value="T">
65 <conditional name="gmtSource">
66 <param name="refgmtSource" type="select"
67 label="Use a gene set (.gmt) from your history and/or use a built-in (MSigDB etc) gene set">
68 <option value="indexed" selected="true">Use a built-in gene set</option>
69 <option value="history">Use a gene set from my history</option>
70 <option value="both">Add a gene set from my history to a built in gene set</option>
71 </param>
72 <when value="indexed">
73 <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
74 <options from_data_table="gseaGMT_3.1">
75 <filter type="sort_by" column="2" />
76 <validator type="no_options" message="No GMT v3.1 files are available - please install them"/>
77 </options>
78 </param>
79 </when>
80 <when value="history">
81 <param name="ownGMT" type="data" format="gmt" label="Select a Gene Set from your history" />
82 </when>
83 <when value="both">
84 <param name="ownGMT" type="data" format="gseagmt" label="Select a Gene Set from your history" />
85 <param name="builtinGMT" type="select" label="Select a gene set matrix (.gmt) file to use for the analysis">
86 <options from_data_table="gseaGMT_3.1">
87 <filter type="sort_by" column="2" />
88 <validator type="no_options" message="No GMT v3.1 files are available - please fix tool_data_table and loc files"/>
89 </options>
90 </param>
91 </when>
92 </conditional>
93 </when>
94 <when value="F">
95 </when>
96 </conditional>
97 <param name="priordf" type="integer" value="20" size="3" label="prior.df for tagwise dispersion - lower value = more emphasis on each tag's variance. Replaces prior.n and prior.df = prior.n * residual.df"
98 help="0 = Use edgeR default. Use a small value to 'smooth' small samples. See edgeR docs and note below"/>
99 <param name="fdrthresh" type="float" value="0.05" size="5" label="P value threshold for FDR filtering for amily wise error rate control"
100 help="Conventional default value of 0.05 recommended"/>
101 <param name="fdrtype" type="select" label="FDR (Type II error) control method"
102 help="Use fdr or bh typically to control for the number of tests in a reliable way">
103 <option value="fdr" selected="true">fdr</option>
104 <option value="BH">Benjamini Hochberg</option>
105 <option value="BY">Benjamini Yukateli</option>
106 <option value="bonferroni">Bonferroni</option>
107 <option value="hochberg">Hochberg</option>
108 <option value="holm">Holm</option>
109 <option value="hommel">Hommel</option>
110 <option value="none">no control for multiple tests</option>
111 </param>
112 </inputs>
113 <outputs>
114 <data format="tabular" name="outtab" label="${title}.xls"/>
115 <data format="html" name="html_file" label="${title}.html"/>
116 </outputs>
117 <stdio>
118 <exit_code range="4" level="fatal" description="Number of subject ids must match total number of samples in the input matrix" />
119 </stdio>
120 <tests>
121 <test>
122 <param name='input1' value='test_bams2mx.xls' ftype='tabular' />
123 <param name='treatment_name' value='case' />
124 <param name='title' value='edgeRtest' />
125 <param name='useNDF' value='' />
126 <param name='fdrtype' value='fdr' />
127 <param name='priordf' value="0" />
128 <param name='fdrthresh' value="0.05" />
129 <param name='control_name' value='control' />
130 <param name='subjectids' value='' />
131 <param name='Treat_cols' value='3,4,5,9' />
132 <param name='Control_cols' value='2,6,7,8' />
133 <output name='outtab' file='edgeRtest1out.xls' compare='diff' />
134 <output name='html_file' file='edgeRtest1out.html' compare='diff' lines_diff='20' />
135 </test>
136 </tests>
137
138 <configfiles>
139 <configfile name="runme">
140 <![CDATA[
141 ##
142 ## edgeR.Rscript
143 ## updated npv 2011 for R 2.14.0 and edgeR 2.4.0 by ross
144 ## Performs DGE on a count table containing n replicates of two conditions
145 ##
146 ### Original edgeR code by: S.Lunke and A.Kaspi
147 reallybig = log10(.Machine\$double.xmax)
148 reallysmall = log10(.Machine\$double.xmin)
149 library('stringr')
150 library('gplots')
151 library('edgeR')
152
153 hmap2 = function(cmat,nsamp=100,outpdfname='heatmap2.pdf', TName='Treatment',group=NA,myTitle='title goes here')
154 {
155 ### Perform clustering for significant pvalues after controlling FWER
156 samples = colnames(cmat)
157 gu = unique(group)
158 if (length(gu) == 2) {
159 col.map = function(g) {if (g==gu[1]) "#FF0000" else "#0000FF"}
160 pcols = unlist(lapply(group,col.map))
161 } else {
162 colours = rainbow(length(gu),start=0,end=4/6)
163 pcols = colours[match(group,gu)]
164 }
165 gn = rownames(cmat)
166 dm = cmat[(! is.na(gn)),]
167 ### remove unlabelled hm rows
168 nprobes = nrow(dm)
169 if (nprobes > nsamp) {
170 dm =dm[1:nsamp,]
171 }
172 newcolnames = substr(colnames(dm),1,20)
173 colnames(dm) = newcolnames
174 pdf(outpdfname)
175 heatmap.2(dm,main=myTitle,ColSideColors=pcols,col=topo.colors(100),dendrogram="col",key=T,density.info='none',
176 Rowv=F,scale='row',trace='none',margins=c(8,8),cexRow=0.4,cexCol=0.5)
177 dev.off()
178 }
179
180 hmap = function(cmat,nmeans=4,outpdfname="heatMap.pdf",nsamp=250,TName='Treatment',group=NA,myTitle="Title goes here")
181 {
182 ## for 2 groups only was
183 ## col.map = function(g) {if (g==TName) "#FF0000" else "#0000FF"}
184 ## pcols = unlist(lapply(group,col.map))
185 gu = unique(group)
186 colours = rainbow(length(gu),start=0.3,end=0.6)
187 pcols = colours[match(group,gu)]
188 nrows = nrow(cmat)
189 mtitle = paste(myTitle,'Heatmap: n contigs =',nrows)
190 if (nrows > nsamp) {
191 cmat = cmat[c(1:nsamp),]
192 mtitle = paste('Heatmap: Top ',nsamp,' DE contigs (of ',nrows,')',sep='')
193 }
194 newcolnames = substr(colnames(cmat),1,20)
195 colnames(cmat) = newcolnames
196 pdf(outpdfname)
197 heatmap(cmat,scale='row',main=mtitle,cexRow=0.3,cexCol=0.4,Rowv=NA,ColSideColors=pcols)
198 dev.off()
199 }
200
201 qqPlot = function(descr='Title',pvector, ...)
202 ## stolen from https://gist.github.com/703512
203 {
204 o = -log10(sort(pvector,decreasing=F))
205 e = -log10( 1:length(o)/length(o) )
206 o[o==-Inf] = reallysmall
207 o[o==Inf] = reallybig
208 pdfname = paste(gsub(" ","", descr , fixed=TRUE),'pval_qq.pdf',sep='_')
209 maint = paste(descr,'QQ Plot')
210 pdf(pdfname)
211 plot(e,o,pch=19,cex=1, main=maint, ...,
212 xlab=expression(Expected~~-log[10](italic(p))),
213 ylab=expression(Observed~~-log[10](italic(p))),
214 xlim=c(0,max(e)), ylim=c(0,max(o)))
215 lines(e,e,col="red")
216 grid(col = "lightgray", lty = "dotted")
217 dev.off()
218 }
219
220 smearPlot = function(DGEList,deTags, outSmear, outMain)
221 {
222 pdf(outSmear)
223 plotSmear(DGEList,de.tags=deTags,main=outMain)
224 grid(col="blue")
225 dev.off()
226 }
227
228 boxPlot = function(rawrs,cleanrs,maint,myTitle)
229 {
230 nc = ncol(rawrs)
231 for (i in c(1:nc)) {rawrs[(rawrs[,i] < 0),i] = NA}
232 fullnames = colnames(rawrs)
233 newcolnames = substr(colnames(rawrs),1,20)
234 colnames(rawrs) = newcolnames
235 newcolnames = substr(colnames(cleanrs),1,20)
236 colnames(cleanrs) = newcolnames
237 pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"sampleBoxplot.pdf",sep='_')
238 defpar = par(no.readonly=T)
239 pdf(pdfname)
240 l = layout(matrix(c(1,2),1,2,byrow=T))
241 print.noquote('raw contig counts by sample:')
242 print.noquote(summary(rawrs))
243 print.noquote('normalised contig counts by sample:')
244 print.noquote(summary(cleanrs))
245 boxplot(rawrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('Raw:',maint))
246 grid(col="blue")
247 boxplot(cleanrs,varwidth=T,notch=T,ylab='log contig count',col="maroon",las=3,cex.axis=0.35,main=paste('After ',maint))
248 grid(col="blue")
249 dev.off()
250 pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"samplehistplot.pdf",sep='_')
251 nc = ncol(rawrs)
252 print.noquote(paste('Using ncol rawrs=',nc))
253 ncroot = round(sqrt(nc))
254 if (ncroot*ncroot < nc) { ncroot = ncroot + 1 }
255 m = c()
256 for (i in c(1:nc)) {
257 rhist = hist(rawrs[,i],breaks=100,plot=F)
258 m = append(m,max(rhist\$counts))
259 }
260 ymax = max(m)
261 pdf(pdfname)
262 par(mfrow=c(ncroot,ncroot))
263 for (i in c(1:nc)) {
264 hist(rawrs[,i], main=paste("Contig logcount",i), xlab='log raw count', col="maroon",
265 breaks=100,sub=fullnames[i],cex=0.8,ylim=c(0,ymax))
266 }
267 dev.off()
268 par(defpar)
269
270 }
271
272 cumPlot = function(rawrs,cleanrs,maint,myTitle)
273 {
274 pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_')
275 defpar = par(no.readonly=T)
276 pdf(pdfname)
277 par(mfrow=c(2,1))
278 lrs = log(rawrs,10)
279 lim = max(lrs)
280 hist(lrs,breaks=100,main=paste('Before:',maint),xlab="Reads (log)",
281 ylab="Count",col="maroon",sub=myTitle, xlim=c(0,lim),las=1)
282 grid(col="blue")
283 lrs = log(cleanrs,10)
284 hist(lrs,breaks=100,main=paste('After:',maint),xlab="Reads (log)",
285 ylab="Count",col="maroon",sub=myTitle,xlim=c(0,lim),las=1)
286 grid(col="blue")
287 dev.off()
288 par(defpar)
289 }
290
291 cumPlot1 = function(rawrs,cleanrs,maint,myTitle)
292 {
293 pdfname = paste(gsub(" ","", myTitle , fixed=TRUE),"RowsumCum.pdf",sep='_')
294 pdf(pdfname)
295 par(mfrow=c(2,1))
296 lastx = max(rawrs)
297 rawe = knots(ecdf(rawrs))
298 cleane = knots(ecdf(cleanrs))
299 cy = 1:length(cleane)/length(cleane)
300 ry = 1:length(rawe)/length(rawe)
301 plot(rawe,ry,type='l',main=paste('Before',maint),xlab="Log Contig Total Reads",
302 ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
303 grid(col="blue")
304 plot(cleane,cy,type='l',main=paste('After',maint),xlab="Log Contig Total Reads",
305 ylab="Cumulative proportion",col="maroon",log='x',xlim=c(1,lastx),sub=myTitle)
306 grid(col="blue")
307 dev.off()
308 }
309
310
311
312 doGSEA = function(y=NULL,design=NULL,histgmt="",
313 bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
314 ntest=0, myTitle="myTitle", outfname="GSEA.xls", minnin=5, maxnin=2000,fdrthresh=0.05,fdrtype="BH")
315 {
316 genesets = c()
317 if (bigmt > "")
318 {
319 bigenesets = readLines(bigmt)
320 genesets = bigenesets
321 }
322 if (histgmt > "")
323 {
324 hgenesets = readLines(histgmt)
325 if (bigmt > "") {
326 genesets = rbind(genesets,hgenesets)
327 } else {
328 genesets = hgenesets
329 }
330 }
331 print.noquote(paste("@@@read",length(genesets), 'genesets from',histgmt,bigmt))
332 genesets = strsplit(genesets,'\t')
333 ##### tabular. genesetid\tURLorwhatever\tgene_1\t..\tgene_n
334 outf = outfname
335 head=paste(myTitle,'edgeR GSEA')
336 write(head,file=outfname,append=F)
337 ntest=length(genesets)
338 urownames = toupper(rownames(y))
339 upcam = c()
340 downcam = c()
341 for (i in 1:ntest) {
342 gs = unlist(genesets[i])
343 g = gs[1] #### geneset_id
344 u = gs[2]
345 if (u > "") { u = paste("<a href=\'",u,"\'>",u,"</a>",sep="") }
346 glist = gs[3:length(gs)] #### member gene symbols
347 glist = toupper(glist)
348 inglist = urownames %in% glist
349 nin = sum(inglist)
350 if ((nin > minnin) && (nin < maxnin)) {
351 ### print(paste('@@found',sum(inglist),'genes in glist'))
352 camres = camera(y=y,index=inglist,design=design)
353 if (camres) {
354 rownames(camres) = g
355 ##### gene set name
356 camres = cbind(GeneSet=g,URL=u,camres)
357 if (camres\$Direction == "Up")
358 {
359 upcam = rbind(upcam,camres) } else {
360 downcam = rbind(downcam,camres)
361 }
362 }
363 }
364 }
365 uscam = upcam[order(upcam\$PValue),]
366 unadjp = uscam\$PValue
367 uscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
368 nup = max(10,sum((uscam\$adjPValue < fdrthresh)))
369 dscam = downcam[order(downcam\$PValue),]
370 unadjp = dscam\$PValue
371 dscam\$adjPValue = p.adjust(unadjp,method=fdrtype)
372 ndown = max(10,sum((dscam\$adjPValue < fdrthresh)))
373 write.table(uscam,file=paste('upCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F)
374 write.table(dscam,file=paste('downCamera',outfname,sep='_'),quote=F,sep='\t',row.names=F)
375 print.noquote(paste('@@@@@ Camera up top',nup,'gene sets:'))
376 write.table(head(uscam,nup),file="",quote=F,sep='\t',row.names=F)
377 print.noquote(paste('@@@@@ Camera down top',ndown,'gene sets:'))
378 write.table(head(dscam,ndown),file="",quote=F,sep='\t',row.names=F)
379 }
380
381
382
383 edgeIt = function (Count_Matrix,group,outputfilename,fdrtype='fdr',priordf=5,
384 fdrthresh=0.05,outputdir='.', myTitle='edgeR',libSize=c(),useNDF=F,
385 filterquantile=0.2, subjects=c(),mydesign=NULL,
386 doDESeq=T,doVoom=T,doCamera=T,org='hg19',
387 histgmt="", bigmt="/data/genomes/gsea/3.1/Abetterchoice_nocgp_c2_c3_c5_symbols_all.gmt",
388 doCook=F,DESeq_fittype="parameteric")
389 {
390 if (length(unique(group))!=2){
391 print("Number of conditions identified in experiment does not equal 2")
392 q()
393 }
394 require(edgeR)
395 options(width = 512)
396 mt = paste(unlist(strsplit(myTitle,'_')),collapse=" ")
397 allN = nrow(Count_Matrix)
398 nscut = round(ncol(Count_Matrix)/2)
399 colTotmillionreads = colSums(Count_Matrix)/1e6
400 rawrs = rowSums(Count_Matrix)
401 nonzerod = Count_Matrix[(rawrs > 0),]
402 nzN = nrow(nonzerod)
403 nzrs = rowSums(nonzerod)
404 zN = allN - nzN
405 print('**** Quantiles for non-zero row counts:',quote=F)
406 print(quantile(nzrs,probs=seq(0,1,0.1)),quote=F)
407 if (useNDF == "T")
408 {
409 gt1rpin3 = rowSums(Count_Matrix/expandAsMatrix(colTotmillionreads,dim(Count_Matrix)) >= 1) >= nscut
410 lo = colSums(Count_Matrix[!gt1rpin3,])
411 workCM = Count_Matrix[gt1rpin3,]
412 cleanrs = rowSums(workCM)
413 cleanN = length(cleanrs)
414 meth = paste( "After removing",length(lo),"contigs with fewer than ",nscut," sample read counts >= 1 per million, there are",sep="")
415 print(paste("Read",allN,"contigs. Removed",zN,"contigs with no reads.",meth,cleanN,"contigs"),quote=F)
416 maint = paste('Filter >= 1/million reads in >=',nscut,'samples')
417 } else {
418 useme = (nzrs > quantile(nzrs,filterquantile))
419 workCM = nonzerod[useme,]
420 lo = colSums(nonzerod[!useme,])
421 cleanrs = rowSums(workCM)
422 cleanN = length(cleanrs)
423 meth = paste("After filtering at count quantile =",filterquantile,", there are",sep="")
424 print(paste('Read',allN,"contigs. Removed",zN,"with no reads.",meth,cleanN,"contigs"),quote=F)
425 maint = paste('Filter below',filterquantile,'quantile')
426 }
427 cumPlot(rawrs=rawrs,cleanrs=cleanrs,maint=maint,myTitle=myTitle)
428 allgenes <- rownames(workCM)
429 print(paste("*** Total low count contigs per sample = ",paste(lo,collapse=',')),quote=F)
430 rsums = rowSums(workCM)
431 TName=unique(group)[1]
432 CName=unique(group)[2]
433 DGEList = DGEList(counts=workCM, group = group)
434 DGEList = calcNormFactors(DGEList)
435
436 if (is.null(mydesign)) {
437 if (length(subjects) == 0)
438 {
439 mydesign = model.matrix(~group)
440 }
441 else {
442 subjf = factor(subjects)
443 mydesign = model.matrix(~subjf+group)
444 ### we block on subject so make group last to simplify finding it
445 }
446 }
447 print.noquote(paste('Using samples:',paste(colnames(workCM),collapse=',')))
448 print.noquote('Using design matrix:')
449 print.noquote(mydesign)
450 DGEList = estimateGLMCommonDisp(DGEList,mydesign)
451 comdisp = DGEList\$common.dispersion
452 DGEList = estimateGLMTrendedDisp(DGEList,mydesign)
453 if (priordf > 0) {
454 print.noquote(paste("prior.df =",priordf))
455 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign,prior.df = priordf)
456 } else {
457 DGEList = estimateGLMTagwiseDisp(DGEList,mydesign)
458 }
459 lastcoef=ncol(mydesign)
460 print.noquote(paste('*** lastcoef = ',lastcoef))
461 estpriorn = getPriorN(DGEList)
462 predLFC1 = predFC(DGEList,prior.count=1,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
463 predLFC3 = predFC(DGEList,prior.count=3,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
464 predLFC5 = predFC(DGEList,prior.count=5,design=mydesign,dispersion=DGEList\$tagwise.dispersion,offset=getOffset(DGEList))
465 DGLM = glmFit(DGEList,design=mydesign)
466 DE = glmLRT(DGLM)
467 #### always last one - subject is first if needed
468 logCPMnorm = cpm(DGEList,log=T,normalized.lib.sizes=T)
469 logCPMraw = cpm(DGEList,log=T,normalized.lib.sizes=F)
470 uoutput = cbind(
471 Name=as.character(rownames(DGEList\$counts)),
472 DE\$table,
473 adj.p.value=p.adjust(DE\$table\$PValue, method=fdrtype),
474 Dispersion=DGEList\$tagwise.dispersion,totreads=rsums,
475 predLFC1=predLFC1[,lastcoef],
476 predLFC3=predLFC3[,lastcoef],
477 predLFC5=predLFC5[,lastcoef],
478 logCPMnorm,
479 DGEList\$counts
480 )
481 soutput = uoutput[order(DE\$table\$PValue),]
482 heatlogcpmnorm = logCPMnorm[order(DE\$table\$PValue),]
483 goodness = gof(DGLM, pcutoff=fdrthresh)
484 noutl = (sum(goodness\$outlier) > 0)
485 if (noutl > 0) {
486 print.noquote(paste('***',noutl,'GLM outliers found'))
487 print(paste(rownames(DGLM)[(goodness\$outlier)],collapse=','),quote=F)
488 } else {
489 print('*** No GLM fit outlier genes found')
490 }
491 z = limma::zscoreGamma(goodness\$gof.statistic, shape=goodness\$df/2, scale=2)
492 pdf(paste(mt,"GoodnessofFit.pdf",sep='_'))
493 qq = qqnorm(z, panel.first=grid(), main="tagwise dispersion")
494 abline(0,1,lwd=3)
495 points(qq\$x[goodness\$outlier],qq\$y[goodness\$outlier], pch=16, col="maroon")
496 dev.off()
497 print(paste("Common Dispersion =",comdisp,"CV = ",sqrt(comdisp),"getPriorN = ",estpriorn),quote=F)
498 uniqueg = unique(group)
499 sample_colors = match(group,levels(group))
500 pdf(paste(mt,"MDSplot.pdf",sep='_'))
501 sampleTypes = levels(factor(group))
502 print.noquote(sampleTypes)
503 plotMDS.DGEList(DGEList,main=paste("MDS Plot for",myTitle),cex=0.5,col=sample_colors,pch=sample_colors)
504 legend(x="topleft", legend = sampleTypes,col=c(1:length(sampleTypes)), pch=19)
505 grid(col="blue")
506 dev.off()
507 colnames(logCPMnorm) = paste( colnames(logCPMnorm),'N',sep="_")
508 print(paste('Raw sample CPM',paste(colSums(logCPMraw,na.rm=T),collapse=',')))
509 try(boxPlot(rawrs=logCPMraw,cleanrs=logCPMnorm,maint='TMM Normalisation',myTitle=myTitle))
510 nreads = soutput\$totreads
511 print('*** writing output',quote=F)
512 write.table(soutput,outputfilename, quote=FALSE, sep="\t",row.names=F)
513 rn = row.names(workCM)
514 print.noquote('@@ rn')
515 print.noquote(head(rn))
516 reg = "^chr([0-9]+):([0-9]+)-([0-9]+)"
517 genecards="<a href=\'http://www.genecards.org/index.php?path=/Search/keyword/"
518 ucsc = paste("<a href=\'http://genome.ucsc.edu/cgi-bin/hgTracks?db=",org,sep='')
519 testreg = str_match(rn,reg)
520 nreads = uoutput\$totreads
521 if (sum(!is.na(testreg[,1]))/length(testreg[,1]) > 0.8)
522 {
523 print("@@ using ucsc substitution for urls")
524 urls = paste0(ucsc,"&amp;position=chr",testreg[,2],":",testreg[,3],"-",testreg[,4],"\'>",rn,"</a>")
525 } else {
526 print("@@ using genecards substitution for urls")
527 urls = paste0(genecards,rn,"\'>",rn,"</a>")
528 }
529 tt = uoutput
530 print.noquote("*** edgeR Top tags\n")
531 tt = cbind(tt,ntotreads=nreads,URL=urls)
532 tt = tt[order(DE\$table\$PValue),]
533 print.noquote(tt[1:50,])
534 ### Plot MAplot
535 deTags = rownames(uoutput[uoutput\$adj.p.value < fdrthresh,])
536 nsig = length(deTags)
537 print(paste('***',nsig,'tags significant at adj p=',fdrthresh),quote=F)
538 if (nsig > 0) {
539 print('*** deTags',quote=F)
540 print(head(deTags))
541 }
542 deColours = ifelse(deTags,'red','black')
543 pdf(paste(mt,"BCV_vs_abundance.pdf",sep='_'))
544 plotBCV(DGEList, cex=0.3, main="Biological CV vs abundance")
545 dev.off()
546 dg = DGEList[order(DE\$table\$PValue),]
547 outpdfname=paste(mt,"heatmap.pdf",sep='_')
548 hmap2(heatlogcpmnorm,nsamp=100,TName=TName,group=group,outpdfname=outpdfname,myTitle=myTitle)
549 outSmear = paste(mt,"Smearplot.pdf",sep='_')
550 outMain = paste("Smear Plot for ",TName,' Vs ',CName,' (FDR@',fdrthresh,' N = ',nsig,')',sep='')
551 smearPlot(DGEList=DGEList,deTags=deTags, outSmear=outSmear, outMain = outMain)
552 qqPlot(descr=myTitle,pvector=DE\$table\$PValue)
553 if (doDESeq == T)
554 {
555 ### DESeq2
556 require('DESeq2')
557 print.noquote(paste('****subjects=',subjects,'length=',length(subjects)))
558 if (length(subjects) == 0)
559 {
560 pdata = data.frame(Name=colnames(workCM),Rx=group,row.names=colnames(workCM))
561 deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ Rx))
562 } else {
563 pdata = data.frame(Name=colnames(workCM),Rx=group,subjects=subjects,row.names=colnames(workCM))
564 deSEQds = DESeqDataSetFromMatrix(countData = workCM, colData = pdata, design = formula(~ subjects + Rx))
565 }
566 deSeqDatsizefac <- estimateSizeFactors(deSEQds)
567 deSeqDatdisp <- estimateDispersions(deSeqDatsizefac,fitType=DESeq_fittype)
568 resDESeq <- nbinomWaldTest(deSeqDatdisp, pAdjustMethod=fdrtype)
569 rDESeq = as.data.frame(results(resDESeq))
570 srDESeq = rDESeq[order(rDESeq\$pvalue),]
571 write.table(srDESeq,paste(mt,'DESeq2_TopTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F)
572 topresults.DESeq <- rDESeq[which(rDESeq\$padj < fdrthresh), ]
573 DESeqcountsindex <- which(allgenes %in% rownames(topresults.DESeq))
574 DESeqcounts <- rep(0, length(allgenes))
575 DESeqcounts[DESeqcountsindex] <- 1
576 pdf(paste(mt,"DESeq2_dispersion_estimates.pdf",sep='_'))
577 plotDispEsts(resDESeq)
578 dev.off()
579 if (doCook) {
580 pdf(paste(mt,"DESeq2_cooks_distance.pdf",sep='_'))
581 W <- mcols(resDESeq)\$WaldStatistic_condition_treated_vs_untreated
582 maxCooks <- mcols(resDESeq)\$maxCooks
583 idx <- !is.na(W)
584 plot(rank(W[idx]), maxCooks[idx], xlab="rank of Wald statistic", ylab="maximum Cook's distance per gene",
585 ylim=c(0,5), cex=.4, col="maroon")
586 m <- ncol(dds)
587 p <- 3
588 abline(h=qf(.75, p, m - p),col="darkblue")
589 grid(col="lightgray",lty="dotted")
590 }
591 }
592 counts.dataframe = as.data.frame(c())
593 norm.factor = DGEList\$samples\$norm.factors
594 topresults.edgeR <- soutput[which(soutput\$adj.p.value < fdrthresh), ]
595 edgeRcountsindex <- which(allgenes %in% rownames(topresults.edgeR))
596 edgeRcounts <- rep(0, length(allgenes))
597 edgeRcounts[edgeRcountsindex] <- 1
598 if (doVoom == T) {
599 pdf(paste(mt,"voomplot.pdf",sep='_'))
600 dat.voomed <- voom(DGEList, mydesign, plot = TRUE, normalize.method="quantil", lib.size = NULL)
601 dev.off()
602 fit <- lmFit(dat.voomed, mydesign)
603 fit <- eBayes(fit)
604 rvoom <- topTable(fit, coef = length(colnames(mydesign)), adj = "BH", n = Inf)
605 write.table(rvoom,paste(mt,'VOOM_topTable.xls',sep='_'), quote=FALSE, sep="\t",row.names=F)
606 topresults.voom <- rvoom[which(rvoom\$adj.P.Val < fdrthresh), ]
607 voomcountsindex <- which(allgenes %in% rownames(topresults.voom))
608 voomcounts <- rep(0, length(allgenes))
609 voomcounts[voomcountsindex] <- 1
610 }
611 if ((doDESeq==T) || (doVoom==T)) {
612 if ((doVoom==T) && (doDESeq==T)) {
613 vennmain = paste(mt,'Voom,edgeR and DESeq2 overlap at FDR=',fdrthresh)
614 counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts,
615 VOOM_limma = voomcounts, row.names = allgenes)
616 } else if (doDESeq==T) {
617 vennmain = paste(mt,'DESeq2 and edgeR overlap at FDR=',fdrthresh)
618 counts.dataframe <- data.frame(edgeR = edgeRcounts, DESeq2 = DESeqcounts, row.names = allgenes)
619 } else if (doVoom==T) {
620 vennmain = paste(mt,'Voom and edgeR overlap at FDR=',fdrthresh)
621 counts.dataframe <- data.frame(edgeR = edgeRcounts, VOOM_limma = voomcounts, row.names = allgenes)
622 }
623
624 if (nrow(counts.dataframe > 1)) {
625 counts.venn <- vennCounts(counts.dataframe)
626 vennf = paste(mt,'venn.pdf',sep='_')
627 pdf(vennf)
628 vennDiagram(counts.venn,main=vennmain,col="maroon")
629 dev.off()
630 }
631 } ### doDESeq or doVoom
632 if (doDESeq==T) {
633 cat("*** DESeq top 50\n")
634 print(srDESeq[1:50,])
635 }
636 if (doVoom==T) {
637 cat("*** VOOM top 50\n")
638 print(rvoom[1:50,])
639 }
640 if (doCamera) {
641 doGSEA(y=DGEList,design=mydesign,histgmt=histgmt,bigmt=bigmt,ntest=20,myTitle=myTitle,
642 outfname=paste(mt,"GSEA.xls",sep="_"),fdrthresh=fdrthresh,fdrtype=fdrtype)
643 }
644 uoutput
645
646 }
647 #### Done
648
649 #### sink(stdout(),append=T,type="message")
650
651 doDESeq = $DESeq.doDESeq
652 ### make these 'T' or 'F'
653 doVoom = $doVoom
654 doCamera = $camera.doCamera
655 Out_Dir = "$html_file.files_path"
656 Input = "$input1"
657 TreatmentName = "$treatment_name"
658 TreatmentCols = "$Treat_cols"
659 ControlName = "$control_name"
660 ControlCols= "$Control_cols"
661 outputfilename = "$outtab"
662 org = "$input1.dbkey"
663 if (org == "") { org = "hg19"}
664 fdrtype = "$fdrtype"
665 priordf = $priordf
666 fdrthresh = $fdrthresh
667 useNDF = "$useNDF"
668 fQ = $fQ
669 myTitle = "$title"
670 sids = strsplit("$subjectids",',')
671 subjects = unlist(sids)
672 nsubj = length(subjects)
673 builtin_gmt=""
674 history_gmt=""
675
676 builtin_gmt = ""
677 history_gmt = ""
678 DESeq_fittype=""
679 #if $DESeq.doDESeq == "T"
680 DESeq_fittype = "$DESeq.DESeq_fitType"
681 #end if
682 #if $camera.doCamera == 'T'
683 #if $camera.gmtSource.refgmtSource == "indexed" or $camera.gmtSource.refgmtSource == "both":
684 builtin_gmt = "${camera.gmtSource.builtinGMT.fields.path}"
685 #end if
686 #if $camera.gmtSource.refgmtSource == "history" or $camera.gmtSource.refgmtSource == "both":
687 history_gmt = "${camera.gmtSource.ownGMT}"
688 history_gmt_name = "${camera.gmtSource.ownGMT.name}"
689 #end if
690 #end if
691 if (nsubj > 0) {
692 if (doDESeq) {
693 print('WARNING - cannot yet use DESeq2 for 2 way anova - see the docs')
694 doDESeq = F
695 }
696 }
697 TCols = as.numeric(strsplit(TreatmentCols,",")[[1]])-1
698 CCols = as.numeric(strsplit(ControlCols,",")[[1]])-1
699 cat('Got TCols=')
700 cat(TCols)
701 cat('; CCols=')
702 cat(CCols)
703 cat('\n')
704 useCols = c(TCols,CCols)
705 if (file.exists(Out_Dir) == F) dir.create(Out_Dir)
706 Count_Matrix = read.table(Input,header=T,row.names=1,sep='\t') #Load tab file assume header
707 snames = colnames(Count_Matrix)
708 nsamples = length(snames)
709 if (nsubj > 0 & nsubj != nsamples) {
710 options("show.error.messages"=T)
711 mess = paste('Fatal error: Supplied subject id list',paste(subjects,collapse=','),
712 'has length',nsubj,'but there are',nsamples,'samples',paste(snames,collapse=','))
713 write(mess, stderr())
714 quit(save="no",status=4)
715 }
716
717 Count_Matrix = Count_Matrix[,useCols] ### reorder columns
718 if (length(subjects) != 0) {subjects = subjects[useCols]}
719 rn = rownames(Count_Matrix)
720 islib = rn %in% c('librarySize','NotInBedRegions')
721 LibSizes = Count_Matrix[subset(rn,islib),][1] # take first
722 Count_Matrix = Count_Matrix[subset(rn,! islib),]
723 group = c(rep(TreatmentName,length(TCols)), rep(ControlName,length(CCols)) )
724 group = factor(group, levels=c(ControlName,TreatmentName))
725 colnames(Count_Matrix) = paste(group,colnames(Count_Matrix),sep="_")
726 results = edgeIt(Count_Matrix=Count_Matrix,group=group,outputfilename=outputfilename,
727 fdrtype='BH',priordf=priordf,fdrthresh=fdrthresh,outputdir='.',
728 myTitle='edgeR',useNDF=F,libSize=c(),filterquantile=fQ,subjects=subjects,
729 doDESeq=doDESeq,doVoom=doVoom,doCamera=doCamera,org=org,
730 histgmt=history_gmt,bigmt=builtin_gmt,DESeq_fittype=DESeq_fittype)
731 sessionInfo()
732 ]]>
733 </configfile>
734 </configfiles>
735 <help>
736
737 **What it does**
738
739 Performs digital gene expression analysis between a treatment and control on a count matrix.
740 Optionally adds a term for subject if not all samples are independent or if some other factor needs to be blocked in the design.
741
742 **Input**
743
744 A matrix consisting of non-negative integers. The matrix must have a unique header row identifiying the samples, and a unique set of row names
745 as the first column. Typically the row names are gene symbols or probe id's for downstream use in GSEA and other methods.
746
747 If you have (eg) paired samples and wish to include a term in the GLM to account for some other factor (subject in the case of paired samples),
748 put a comma separated list of indicators for every sample (whether modelled or not!) indicating (eg) the subject number or
749 A list of integers, one for each subject or an empty string if samples are all independent.
750 If not empty, there must be exactly as many integers in the supplied integer list as there are columns (samples) in the count matrix.
751 Integers for samples that are not in the analysis *must* be present in the string as filler even if not used.
752
753 So if you have 2 pairs out of 6 samples, you need to put in unique integers for the unpaired ones
754 eg if you had 6 samples with the first two independent but the second and third pairs each being from independent subjects. you might use
755 8,9,1,1,2,2
756 as subject IDs to indicate two paired samples from the same subject in columns 3/4 and 5/6
757
758 **Output**
759
760 A summary html page with links to the R source code and all the outputs, nice grids of helpful plot thumbnails, and lots
761 of logging and the top 50 rows of the topTable.
762
763 The main topTables of results are provided as separate excelish tabular files.
764
765 They include adjusted p values and dispersions for each region, raw and cpm sample data counts and shrunken (predicted) log fold change estimates.
766 These are provided for downstream analyses such as GSEA and are predictions of the logFC you might expect to see
767 in an independent replication of your original experiment. Higher number means more shrinkage. Shrinkage is more extreme for low coverage features
768 so downstream analyses are more robust against strong effect size estimates based on relatively little experimental information.
769
770 **Note on prior.N**
771
772 http://seqanswers.com/forums/showthread.php?t=5591 says:
773
774 *prior.n*
775
776 The value for prior.n determines the amount of smoothing of tagwise dispersions towards the common dispersion.
777 You can think of it as like a "weight" for the common value. (It is actually the weight for the common likelihood
778 in the weighted likelihood equation). The larger the value for prior.n, the more smoothing, i.e. the closer your
779 tagwise dispersion estimates will be to the common dispersion. If you use a prior.n of 1, then that gives the
780 common likelihood the weight of one observation.
781
782 In answer to your question, it is a good thing to squeeze the tagwise dispersions towards a common value,
783 or else you will be using very unreliable estimates of the dispersion. I would not recommend using the value that
784 you obtained from estimateSmoothing()---this is far too small and would result in virtually no moderation
785 (squeezing) of the tagwise dispersions. How many samples do you have in your experiment?
786 What is the experimental design? If you have few samples (less than 6) then I would suggest a prior.n of at least 10.
787 If you have more samples, then the tagwise dispersion estimates will be more reliable,
788 so you could consider using a smaller prior.n, although I would hesitate to use a prior.n less than 5.
789
790
791 From Bioconductor Digest, Vol 118, Issue 5, Gordon writes:
792
793 Dear Dorota,
794
795 The important settings are prior.df and trend.
796
797 prior.n and prior.df are related through prior.df = prior.n * residual.df,
798 and your experiment has residual.df = 36 - 12 = 24. So the old setting of
799 prior.n=10 is equivalent for your data to prior.df = 240, a very large
800 value. Going the other way, the new setting of prior.df=10 is equivalent
801 to prior.n=10/24.
802
803 To recover old results with the current software you would use
804
805 estimateTagwiseDisp(object, prior.df=240, trend="none")
806
807 To get the new default from old software you would use
808
809 estimateTagwiseDisp(object, prior.n=10/24, trend=TRUE)
810
811 Actually the old trend method is equivalent to trend="loess" in the new
812 software. You should use plotBCV(object) to see whether a trend is
813 required.
814
815 Note you could also use
816
817 prior.n = getPriorN(object, prior.df=10)
818
819 to map between prior.df and prior.n.
820
821 ** Old rant on variable name changes in bioconductor versions**
822
823 BioC authors sometimes make small mostly cosmetic changes to variable names (eg: from p.value to PValue)
824 often to make them more internally consistent or self describing. Unfortunately, these improvements
825 break existing code in ways that can take a while to track down that relies on the library in ways that can take a while to track down,
826 increasing downstream tool maintenance effort uselessly.
827
828 Please, don't do that. It hurts us.
829
830
831 </help>
832
833 </tool>
834
835