annotate aurora_wgcna.Rmd @ 13:53fc3fdf5e9a draft

Uploaded
author spficklin
date Mon, 09 Dec 2019 12:09:10 -0500
parents 96ba1a8fff06
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
1 ---
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
3 output:
13
53fc3fdf5e9a Uploaded
spficklin
parents: 10
diff changeset
4 html_document:
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
5 number_sections: false
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
6 ---
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
7
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
10 ```
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
11 ```{r, include=FALSE}
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
12 options(tinytex.verbose = TRUE)
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
13 ```
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
14
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
15 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
16 # Make a directory for saving the figures.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
17 dir.create('figures', showWarnings = FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
18 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
19
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
20 # Introduction
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
21 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
22
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
23 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
24
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
25 This report was generated on:
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
26 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
27 format(Sys.time(), "%a %b %d %X %Y")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
28 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
29
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
30
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
31 ## About the Input Data
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
32 ### Gene Expression Matrix (GEM)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
33 The gene expression data is an *n* x *m* matrix where *n* rows are the genes, *m* columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
34 - Housed in a comma-separated (CSV) file.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
35 - The rows represent the gene expression levels
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
36 - The first column of each row is the gene, transcript or probe name.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
37 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
38
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
39 ### Trait/Phenotype Matrix
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
40 The trait/phenotype data is an *n* x *m* matrix where *n* is the samples and *m* are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
41
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
42 ## Parameters provided by the user.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
43 The following describes the input arguments provided to this tool:
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
44 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
45
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
46 if (!is.null(opt$height_cut)) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
47 print('The cut height for outlier removal of the sample dendrogram:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
48 print(opt$cut_height)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
49 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
50
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
51 if (!is.null(opt$power)) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
52 print('The power to which the gene expression data is raised:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
53 print(opt$power)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
54 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
55 print('The minimal size for a module:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
56 print(opt$min_cluster_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
57
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
58 print('The block size for dividing the GEM to reduce memory requirements:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
59 print(opt$block_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
60
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
61 print('The hard threshold when generating the graph file:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
62 print(opt$hard_threshold)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
63
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
64 print('The character string used to identify missing values in the GEM:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
65 print(opt$missing_value1)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
66
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
67 if (!is.null(opt$trait_data)) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
68 print('The column in the trait data that contains the sample name:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
69 print(opt$sname_col)
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
70
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
71 print('The character string used to identify missing values in the trait data:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
72 print(opt$missing_value2)
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
73
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
74 print('Columns in the trait data that should be treated as categorical:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
75 print(opt$one_hot_cols)
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
76
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
77 print('Columns in the trait data that should be ignored:')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
78 print(opt$ignore_cols)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
79 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
80 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
81
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
82 ## If Errors Occur
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
83 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
84
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
85 - Do the formats for the input datasets match the requirements listed above.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
86 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
87 - If trait data was provided, check that the column specified for the sample name is correct.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
88 - The block size should not exceed 10,000 and should not be lower than 1,000.
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
89 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
90
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
91
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
92 # Expression Data
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
93
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
94 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
95
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
96 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
97 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
98 #table_data = head(gem, 100)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
99 #datatable(table_data)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
100 gem[1:10,1:6]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
101 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
102
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
103 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
104 gemt = as.data.frame(t(gem))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
105 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
106
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
107 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
108
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
109
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
110 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
111 gsg = goodSamplesGenes(gemt, verbose = 3)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
112
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
113 if (!gsg$allOK) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
114 gemt = gemt[gsg$goodSamples, gsg$goodGenes]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
115 } else {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
116 print('all genes are OK!')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
117 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
118 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
119
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
120
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
121 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
122
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
123 ```{r fig.align='center'}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
124 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
125
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
126 plotSampleDendro <- function() {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
127 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
128 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
129 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
130 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
131 plotSampleDendro()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
132 invisible(dev.off())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
133 plotSampleDendro()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
134 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
135
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
136 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
137 if (is.null(opt$height_cut)) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
138 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
139 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
140 keepSamples = (clust!=0)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
141 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
142 } else {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
143 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
144 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
145 keepSamples = (clust==1)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
146 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
147 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
148 n_genes = ncol(gemt)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
149 n_samples = nrow(gemt)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
150 removed = length(which(keepSamples == FALSE))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
151 if (removed == 1) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
152 print(paste("A total of", removed, "sample was removed"))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
153 } else {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
154 print(paste("A total of", removed, "samples were removed"))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
155 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
156
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
157 # Write out the filtered GEM
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
158 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
159 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
160 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
161
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
162 ```{r fig.align='center'}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
163 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
164
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
165 plotFilteredSampleDendro <- function() {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
166 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
167 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
168 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
169 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
170 plotFilteredSampleDendro()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
171 invisible(dev.off())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
172 plotFilteredSampleDendro()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
173 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
174
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
175 # Network Module Discovery
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
176
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
177 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
178
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
179 - Power: The power tested
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
180 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
181 - slope: The slope of the regression line used to calculate SFT.R.sq
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
182 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
183 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
184 - median.k: The median degree
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
185 - max.k: The largest degree.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
186
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
187 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
188 powers = c(1:10, seq(12, 20, 2))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
189 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
190 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
191
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
192 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
193
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
194 ```{r fig.align='center'}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
195 par(mfrow=c(1,2))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
196 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
197
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
198 plotPower <- function() {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
199 # Scale-free topology fit index as a function of the soft-thresholding power.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
200 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
201 xlab="Soft Threshold (power)",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
202 ylab="Scale Free Topology Model Fit,signed R^2", type="n",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
203 main = paste("Scale Independence"), cex.lab = 0.5);
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
204 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
205 labels=powers,cex=0.5,col="red");
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
206 #abline(h=th, col="blue")
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
207
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
208 # Mean connectivity as a function of the soft-thresholding power.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
209 plot(sft$fitIndices[,1], sft$fitIndices[,5],
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
210 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
211 main = paste("Mean Connectivity"), cex.lab = 0.5)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
212 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
213 #abline(h=th, col="blue")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
214 par(mfrow=c(1,1))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
215 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
216
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
217 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
218 plotPower()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
219 invisible(dev.off())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
220 plotPower()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
221 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
222 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
223
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
224 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
225 print("WGCNA predicted the following power:")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
226 print(sft$powerEstimate)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
227 power = sft$powerEstimate
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
228 if (!is.null(opt$power)) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
229 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
230 print(opt$soft_threshold_power)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
231 power = opt$power
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
232 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
233 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
234
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
235 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
236
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
237 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
238 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
239 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
240 reassignThreshold = 0, mergeCutHeight = 0.25,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
241 numericLabels = TRUE, pamRespectsDendro = FALSE,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
242 verbose = 1, saveTOMs = TRUE,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
243 saveTOMFileBase = "TOM")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
244 blocks = sort(unique(net$blocks))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
245 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
246 The following table shows the list of modules that were discovered and their size (i.e. number of genes).
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
247
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
248 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
249 module_labels = labels2colors(net$colors)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
250 module_labels = paste("ME", module_labels, sep="")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
251 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
252 rownames(module_labels2num) = paste0('ME', module_labels2num$num)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
253 modules = unique(as.data.frame(table(module_labels)))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
254 n_modules = length(modules) - 1
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
255 module_size_upper = modules[2]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
256 module_size_lower = modules[length(modules)]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
257 colnames(modules) = c('Module', 'Module Size')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
258 #datatable(modules)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
259 modules
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
260 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
261
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
262 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
263
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
264 ```{r fig.align='center'}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
265 MEs = net$MEs
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
266 colnames(MEs) = module_labels2num[colnames(MEs),]$label
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
267
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
268 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
269 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
270 dev.off()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
271 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
272 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
273
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
274 Alternatively, we can use a heatmap to explore similarity of each module.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
275 ```{r fig.align='center'}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
276 plotModuleHeatmap <- function() {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
277 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
278 marHeatmap = c(2, 3, 2, 2),
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
279 plotDendrograms = FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
280 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
281 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
282 plotModuleHeatmap()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
283 invisible(dev.off())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
284 plotModuleHeatmap()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
285 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
286
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
287 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
288
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
289 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
290 # Plot the dendrogram and the module colors underneath
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
291 for (i in blocks) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
292 options(repr.plot.width=15, repr.plot.height=10)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
293 colors = module_labels[net$blockGenes[[i]]]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
294 colors = sub('ME','', colors)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
295 plotClusterDendro <- function() {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
296 plotDendroAndColors(net$dendrograms[[i]], colors,
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
297 "Module colors", dendroLabels = FALSE, hang = 0.03,
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
298 addGuide = TRUE, guideHang = 0.05,
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
299 main=paste('Cluster Dendgrogram, Block', i))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
300 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
301 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
302 plotClusterDendro();
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
303 invisible(dev.off())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
304 plotClusterDendro();
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
305 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
306
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
307 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
308
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
309 The network is housed in a *n* x *n* similarity matrix known as the the Topological Overlap Matrix (TOM), where *n* is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
310
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
311 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
312 for (i in blocks) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
313 # Load the TOM from a file.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
314 load(net$TOMFiles[i])
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
315 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
316 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
317 dissTOM = 1-TOM
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
318
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
319 # For reproducibility, we set the random seed
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
320 set.seed(10);
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
321 select = sample(dim(TOM)[1], size = 1000);
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
322 selectColors = module_labels[net$blockGenes[[i]][select]]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
323 selectTOM = dissTOM[select, select];
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
324
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
325 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
326 selectTree = hclust(as.dist(selectTOM), method = "average")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
327
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
328 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
329 # the color palette; setting the diagonal to NA also improves the clarity of the plot
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
330 plotDiss = selectTOM^7;
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
331 diag(plotDiss) = NA;
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
332 colors = sub('ME','', selectColors)
10
96ba1a8fff06 Uploaded
spficklin
parents: 4
diff changeset
333
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
334 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
335 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
336 dev.off()
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
337 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
338 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
339 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
340
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
341 ```{r}
4
b14e4bf568b0 Uploaded
spficklin
parents: 0
diff changeset
342 module_colors = sub('ME','', module_labels)
b14e4bf568b0 Uploaded
spficklin
parents: 0
diff changeset
343 output = cbind(colnames(gemt), module_labels, module_colors)
b14e4bf568b0 Uploaded
spficklin
parents: 0
diff changeset
344 colnames(output) = c('Gene', 'Module', 'Color')
0
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
345 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
346 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
347
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
348 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
349
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
350 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
351
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
352 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
353
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
354 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
355 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
356 for (i in blocks) {
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
357 # Load the TOM from a file.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
358 load(net$TOMFiles[i])
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
359 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
360 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
361 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
362 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
363
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
364 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
365 edges = rbind(edges, cydata$edgeData)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
366 }
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
367
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
368 edges$Interaction = 'co'
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
369 output = edges[,c('fromNode','toNode','Interaction', 'weight')]
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
370 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
371 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
372 ```
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
373
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
374 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
375
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
376 ```{r}
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
377 # Save this image for the next step which is optional if theuser
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
378 # provides a trait file.
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
379 save.image(file=opt$r_data)
66ef158fa85c Uploaded
spficklin
parents:
diff changeset
380 ```