0
|
1 ---
|
|
2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'
|
|
3 output:
|
13
|
4 html_document:
|
0
|
5 number_sections: false
|
|
6 ---
|
|
7
|
|
8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}
|
|
9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)
|
|
10 ```
|
10
|
11 ```{r, include=FALSE}
|
|
12 options(tinytex.verbose = TRUE)
|
|
13 ```
|
0
|
14
|
|
15 ```{r}
|
|
16 # Make a directory for saving the figures.
|
|
17 dir.create('figures', showWarnings = FALSE)
|
|
18 ```
|
|
19
|
|
20 # Introduction
|
10
|
21 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.
|
0
|
22
|
|
23 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.
|
|
24
|
|
25 This report was generated on:
|
|
26 ```{r}
|
|
27 format(Sys.time(), "%a %b %d %X %Y")
|
|
28 ```
|
|
29
|
|
30
|
|
31 ## About the Input Data
|
|
32 ### Gene Expression Matrix (GEM)
|
|
33 The gene expression data is an *n* x *m* matrix where *n* rows are the genes, *m* columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:
|
|
34 - Housed in a comma-separated (CSV) file.
|
|
35 - The rows represent the gene expression levels
|
|
36 - The first column of each row is the gene, transcript or probe name.
|
|
37 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.
|
|
38
|
|
39 ### Trait/Phenotype Matrix
|
|
40 The trait/phenotype data is an *n* x *m* matrix where *n* is the samples and *m* are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.
|
|
41
|
|
42 ## Parameters provided by the user.
|
|
43 The following describes the input arguments provided to this tool:
|
|
44 ```{r}
|
|
45
|
|
46 if (!is.null(opt$height_cut)) {
|
|
47 print('The cut height for outlier removal of the sample dendrogram:')
|
|
48 print(opt$cut_height)
|
|
49 }
|
|
50
|
|
51 if (!is.null(opt$power)) {
|
|
52 print('The power to which the gene expression data is raised:')
|
|
53 print(opt$power)
|
|
54 }
|
|
55 print('The minimal size for a module:')
|
|
56 print(opt$min_cluster_size)
|
|
57
|
|
58 print('The block size for dividing the GEM to reduce memory requirements:')
|
|
59 print(opt$block_size)
|
|
60
|
|
61 print('The hard threshold when generating the graph file:')
|
|
62 print(opt$hard_threshold)
|
|
63
|
|
64 print('The character string used to identify missing values in the GEM:')
|
|
65 print(opt$missing_value1)
|
|
66
|
|
67 if (!is.null(opt$trait_data)) {
|
|
68 print('The column in the trait data that contains the sample name:')
|
|
69 print(opt$sname_col)
|
10
|
70
|
0
|
71 print('The character string used to identify missing values in the trait data:')
|
|
72 print(opt$missing_value2)
|
10
|
73
|
0
|
74 print('Columns in the trait data that should be treated as categorical:')
|
|
75 print(opt$one_hot_cols)
|
10
|
76
|
0
|
77 print('Columns in the trait data that should be ignored:')
|
|
78 print(opt$ignore_cols)
|
|
79 }
|
|
80 ```
|
|
81
|
|
82 ## If Errors Occur
|
|
83 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:
|
|
84
|
|
85 - Do the formats for the input datasets match the requirements listed above.
|
|
86 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)
|
|
87 - If trait data was provided, check that the column specified for the sample name is correct.
|
|
88 - The block size should not exceed 10,000 and should not be lower than 1,000.
|
10
|
89 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.
|
0
|
90
|
|
91
|
|
92 # Expression Data
|
|
93
|
|
94 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.
|
|
95
|
|
96 ```{r}
|
|
97 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)
|
|
98 #table_data = head(gem, 100)
|
|
99 #datatable(table_data)
|
|
100 gem[1:10,1:6]
|
|
101 ```
|
|
102
|
|
103 ```{r}
|
|
104 gemt = as.data.frame(t(gem))
|
|
105 ```
|
|
106
|
|
107 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
|
|
108
|
|
109
|
|
110 ```{r}
|
|
111 gsg = goodSamplesGenes(gemt, verbose = 3)
|
|
112
|
|
113 if (!gsg$allOK) {
|
|
114 gemt = gemt[gsg$goodSamples, gsg$goodGenes]
|
|
115 } else {
|
|
116 print('all genes are OK!')
|
|
117 }
|
|
118 ```
|
|
119
|
|
120
|
|
121 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
|
|
122
|
|
123 ```{r fig.align='center'}
|
|
124 sampleTree = hclust(dist(gemt), method = "average");
|
|
125
|
|
126 plotSampleDendro <- function() {
|
|
127 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",
|
|
128 cex.axis = 1, cex.main = 1, cex = 0.5)
|
|
129 }
|
|
130 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
|
|
131 plotSampleDendro()
|
|
132 invisible(dev.off())
|
|
133 plotSampleDendro()
|
|
134 ```
|
|
135
|
|
136 ```{r}
|
|
137 if (is.null(opt$height_cut)) {
|
|
138 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
|
|
139 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
|
|
140 keepSamples = (clust!=0)
|
|
141 gemt = gemt[keepSamples, ]
|
|
142 } else {
|
|
143 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
|
|
144 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
|
|
145 keepSamples = (clust==1)
|
|
146 gemt = gemt[keepSamples, ]
|
|
147 }
|
|
148 n_genes = ncol(gemt)
|
|
149 n_samples = nrow(gemt)
|
|
150 removed = length(which(keepSamples == FALSE))
|
|
151 if (removed == 1) {
|
|
152 print(paste("A total of", removed, "sample was removed"))
|
|
153 } else {
|
|
154 print(paste("A total of", removed, "samples were removed"))
|
|
155 }
|
|
156
|
|
157 # Write out the filtered GEM
|
|
158 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)
|
|
159 ```
|
|
160 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.
|
|
161
|
|
162 ```{r fig.align='center'}
|
|
163 sampleTree = hclust(dist(gemt), method = "average");
|
|
164
|
|
165 plotFilteredSampleDendro <- function() {
|
|
166 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",
|
|
167 cex.axis = 1, cex.main = 1, cex = 0.5)
|
|
168 }
|
|
169 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
|
|
170 plotFilteredSampleDendro()
|
|
171 invisible(dev.off())
|
|
172 plotFilteredSampleDendro()
|
|
173 ```
|
|
174
|
|
175 # Network Module Discovery
|
|
176
|
|
177 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:
|
|
178
|
|
179 - Power: The power tested
|
|
180 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
|
|
181 - slope: The slope of the regression line used to calculate SFT.R.sq
|
|
182 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
|
|
183 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
|
|
184 - median.k: The median degree
|
|
185 - max.k: The largest degree.
|
|
186
|
|
187 ```{r}
|
|
188 powers = c(1:10, seq(12, 20, 2))
|
|
189 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
|
|
190 ```
|
|
191
|
|
192 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
|
|
193
|
|
194 ```{r fig.align='center'}
|
|
195 par(mfrow=c(1,2))
|
|
196 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
|
|
197
|
|
198 plotPower <- function() {
|
|
199 # Scale-free topology fit index as a function of the soft-thresholding power.
|
|
200 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
|
|
201 xlab="Soft Threshold (power)",
|
|
202 ylab="Scale Free Topology Model Fit,signed R^2", type="n",
|
|
203 main = paste("Scale Independence"), cex.lab = 0.5);
|
|
204 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
|
|
205 labels=powers,cex=0.5,col="red");
|
|
206 #abline(h=th, col="blue")
|
10
|
207
|
0
|
208 # Mean connectivity as a function of the soft-thresholding power.
|
|
209 plot(sft$fitIndices[,1], sft$fitIndices[,5],
|
|
210 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
|
|
211 main = paste("Mean Connectivity"), cex.lab = 0.5)
|
|
212 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
|
|
213 #abline(h=th, col="blue")
|
|
214 par(mfrow=c(1,1))
|
|
215 }
|
|
216
|
|
217 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)
|
|
218 plotPower()
|
|
219 invisible(dev.off())
|
|
220 plotPower()
|
|
221 ```
|
|
222 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
|
|
223
|
|
224 ```{r}
|
|
225 print("WGCNA predicted the following power:")
|
|
226 print(sft$powerEstimate)
|
|
227 power = sft$powerEstimate
|
|
228 if (!is.null(opt$power)) {
|
|
229 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
|
|
230 print(opt$soft_threshold_power)
|
|
231 power = opt$power
|
|
232 }
|
|
233 ```
|
|
234
|
|
235 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
|
|
236
|
|
237 ```{r}
|
|
238 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
|
|
239 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
|
|
240 reassignThreshold = 0, mergeCutHeight = 0.25,
|
|
241 numericLabels = TRUE, pamRespectsDendro = FALSE,
|
|
242 verbose = 1, saveTOMs = TRUE,
|
|
243 saveTOMFileBase = "TOM")
|
|
244 blocks = sort(unique(net$blocks))
|
|
245 ```
|
|
246 The following table shows the list of modules that were discovered and their size (i.e. number of genes).
|
|
247
|
|
248 ```{r}
|
|
249 module_labels = labels2colors(net$colors)
|
|
250 module_labels = paste("ME", module_labels, sep="")
|
|
251 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
|
|
252 rownames(module_labels2num) = paste0('ME', module_labels2num$num)
|
|
253 modules = unique(as.data.frame(table(module_labels)))
|
|
254 n_modules = length(modules) - 1
|
|
255 module_size_upper = modules[2]
|
|
256 module_size_lower = modules[length(modules)]
|
|
257 colnames(modules) = c('Module', 'Module Size')
|
|
258 #datatable(modules)
|
|
259 modules
|
|
260 ```
|
|
261
|
|
262 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
|
|
263
|
|
264 ```{r fig.align='center'}
|
|
265 MEs = net$MEs
|
|
266 colnames(MEs) = module_labels2num[colnames(MEs),]$label
|
|
267
|
|
268 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)
|
|
269 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
|
|
270 dev.off()
|
|
271 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
|
|
272 ```
|
|
273
|
|
274 Alternatively, we can use a heatmap to explore similarity of each module.
|
|
275 ```{r fig.align='center'}
|
|
276 plotModuleHeatmap <- function() {
|
|
277 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",
|
|
278 marHeatmap = c(2, 3, 2, 2),
|
|
279 plotDendrograms = FALSE)
|
|
280 }
|
|
281 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)
|
|
282 plotModuleHeatmap()
|
|
283 invisible(dev.off())
|
|
284 plotModuleHeatmap()
|
|
285 ```
|
|
286
|
|
287 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
|
|
288
|
|
289 ```{r}
|
|
290 # Plot the dendrogram and the module colors underneath
|
|
291 for (i in blocks) {
|
|
292 options(repr.plot.width=15, repr.plot.height=10)
|
|
293 colors = module_labels[net$blockGenes[[i]]]
|
|
294 colors = sub('ME','', colors)
|
|
295 plotClusterDendro <- function() {
|
|
296 plotDendroAndColors(net$dendrograms[[i]], colors,
|
|
297 "Module colors", dendroLabels = FALSE, hang = 0.03,
|
10
|
298 addGuide = TRUE, guideHang = 0.05,
|
0
|
299 main=paste('Cluster Dendgrogram, Block', i))
|
|
300 }
|
|
301 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)
|
|
302 plotClusterDendro();
|
|
303 invisible(dev.off())
|
|
304 plotClusterDendro();
|
|
305 }
|
|
306
|
|
307 ```
|
|
308
|
|
309 The network is housed in a *n* x *n* similarity matrix known as the the Topological Overlap Matrix (TOM), where *n* is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.
|
|
310
|
|
311 ```{r}
|
|
312 for (i in blocks) {
|
|
313 # Load the TOM from a file.
|
|
314 load(net$TOMFiles[i])
|
|
315 TOM_size = length(which(net$blocks == i))
|
|
316 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
|
|
317 dissTOM = 1-TOM
|
|
318
|
|
319 # For reproducibility, we set the random seed
|
|
320 set.seed(10);
|
|
321 select = sample(dim(TOM)[1], size = 1000);
|
|
322 selectColors = module_labels[net$blockGenes[[i]][select]]
|
|
323 selectTOM = dissTOM[select, select];
|
|
324
|
|
325 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
|
|
326 selectTree = hclust(as.dist(selectTOM), method = "average")
|
|
327
|
|
328 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
|
|
329 # the color palette; setting the diagonal to NA also improves the clarity of the plot
|
|
330 plotDiss = selectTOM^7;
|
|
331 diag(plotDiss) = NA;
|
|
332 colors = sub('ME','', selectColors)
|
10
|
333
|
0
|
334 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)
|
|
335 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
|
|
336 dev.off()
|
|
337 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
|
|
338 }
|
|
339 ```
|
|
340
|
|
341 ```{r}
|
4
|
342 module_colors = sub('ME','', module_labels)
|
|
343 output = cbind(colnames(gemt), module_labels, module_colors)
|
|
344 colnames(output) = c('Gene', 'Module', 'Color')
|
0
|
345 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
|
|
346 ```
|
|
347
|
|
348 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
|
|
349
|
|
350 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
|
|
351
|
|
352 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
|
|
353
|
|
354 ```{r}
|
|
355 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
|
|
356 for (i in blocks) {
|
|
357 # Load the TOM from a file.
|
|
358 load(net$TOMFiles[i])
|
|
359 TOM_size = length(which(net$blocks == i))
|
|
360 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
|
|
361 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
|
|
362 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
|
|
363
|
|
364 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
|
|
365 edges = rbind(edges, cydata$edgeData)
|
|
366 }
|
|
367
|
|
368 edges$Interaction = 'co'
|
|
369 output = edges[,c('fromNode','toNode','Interaction', 'weight')]
|
|
370 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
|
|
371 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
|
|
372 ```
|
|
373
|
|
374 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
|
|
375
|
|
376 ```{r}
|
|
377 # Save this image for the next step which is optional if theuser
|
|
378 # provides a trait file.
|
|
379 save.image(file=opt$r_data)
|
|
380 ```
|