aurora_wgcna: aurora_wgcna.Rmd annotate

author	spficklin
date	Fri, 22 Nov 2019 19:44:46 -0500
parents
children	b14e4bf568b0

rev	line source
0 66ef158fa85c Uploaded spficklin parents: diff changeset	1 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'
66ef158fa85c Uploaded spficklin parents: diff changeset	3 output:
66ef158fa85c Uploaded spficklin parents: diff changeset	4 pdf_document:
66ef158fa85c Uploaded spficklin parents: diff changeset	5 number_sections: false
66ef158fa85c Uploaded spficklin parents: diff changeset	6 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	7
66ef158fa85c Uploaded spficklin parents: diff changeset	8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}
66ef158fa85c Uploaded spficklin parents: diff changeset	9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	10 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	11
66ef158fa85c Uploaded spficklin parents: diff changeset	12 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	13 # Make a directory for saving the figures.
66ef158fa85c Uploaded spficklin parents: diff changeset	14 dir.create('figures', showWarnings = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	15 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	16
66ef158fa85c Uploaded spficklin parents: diff changeset	17 # Introduction
66ef158fa85c Uploaded spficklin parents: diff changeset	18 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	19
66ef158fa85c Uploaded spficklin parents: diff changeset	20 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.
66ef158fa85c Uploaded spficklin parents: diff changeset	21
66ef158fa85c Uploaded spficklin parents: diff changeset	22 This report was generated on:
66ef158fa85c Uploaded spficklin parents: diff changeset	23 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	24 format(Sys.time(), "%a %b %d %X %Y")
66ef158fa85c Uploaded spficklin parents: diff changeset	25 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	26
66ef158fa85c Uploaded spficklin parents: diff changeset	27
66ef158fa85c Uploaded spficklin parents: diff changeset	28 ## About the Input Data
66ef158fa85c Uploaded spficklin parents: diff changeset	29 ### Gene Expression Matrix (GEM)
66ef158fa85c Uploaded spficklin parents: diff changeset	30 The gene expression data is an n x m matrix where n rows are the genes, m columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:
66ef158fa85c Uploaded spficklin parents: diff changeset	31 - Housed in a comma-separated (CSV) file.
66ef158fa85c Uploaded spficklin parents: diff changeset	32 - The rows represent the gene expression levels
66ef158fa85c Uploaded spficklin parents: diff changeset	33 - The first column of each row is the gene, transcript or probe name.
66ef158fa85c Uploaded spficklin parents: diff changeset	34 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.
66ef158fa85c Uploaded spficklin parents: diff changeset	35
66ef158fa85c Uploaded spficklin parents: diff changeset	36 ### Trait/Phenotype Matrix
66ef158fa85c Uploaded spficklin parents: diff changeset	37 The trait/phenotype data is an n x m matrix where n is the samples and m are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.
66ef158fa85c Uploaded spficklin parents: diff changeset	38
66ef158fa85c Uploaded spficklin parents: diff changeset	39 ## Parameters provided by the user.
66ef158fa85c Uploaded spficklin parents: diff changeset	40 The following describes the input arguments provided to this tool:
66ef158fa85c Uploaded spficklin parents: diff changeset	41 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	42
66ef158fa85c Uploaded spficklin parents: diff changeset	43 if (!is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	44 print('The cut height for outlier removal of the sample dendrogram:')
66ef158fa85c Uploaded spficklin parents: diff changeset	45 print(opt$cut_height)
66ef158fa85c Uploaded spficklin parents: diff changeset	46 }
66ef158fa85c Uploaded spficklin parents: diff changeset	47
66ef158fa85c Uploaded spficklin parents: diff changeset	48 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	49 print('The power to which the gene expression data is raised:')
66ef158fa85c Uploaded spficklin parents: diff changeset	50 print(opt$power)
66ef158fa85c Uploaded spficklin parents: diff changeset	51 }
66ef158fa85c Uploaded spficklin parents: diff changeset	52 print('The minimal size for a module:')
66ef158fa85c Uploaded spficklin parents: diff changeset	53 print(opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	54
66ef158fa85c Uploaded spficklin parents: diff changeset	55 print('The block size for dividing the GEM to reduce memory requirements:')
66ef158fa85c Uploaded spficklin parents: diff changeset	56 print(opt$block_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	57
66ef158fa85c Uploaded spficklin parents: diff changeset	58 print('The hard threshold when generating the graph file:')
66ef158fa85c Uploaded spficklin parents: diff changeset	59 print(opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	60
66ef158fa85c Uploaded spficklin parents: diff changeset	61 print('The character string used to identify missing values in the GEM:')
66ef158fa85c Uploaded spficklin parents: diff changeset	62 print(opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	63
66ef158fa85c Uploaded spficklin parents: diff changeset	64 if (!is.null(opt$trait_data)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	65 print('The column in the trait data that contains the sample name:')
66ef158fa85c Uploaded spficklin parents: diff changeset	66 print(opt$sname_col)
66ef158fa85c Uploaded spficklin parents: diff changeset	67
66ef158fa85c Uploaded spficklin parents: diff changeset	68 print('The character string used to identify missing values in the trait data:')
66ef158fa85c Uploaded spficklin parents: diff changeset	69 print(opt$missing_value2)
66ef158fa85c Uploaded spficklin parents: diff changeset	70
66ef158fa85c Uploaded spficklin parents: diff changeset	71 print('Columns in the trait data that should be treated as categorical:')
66ef158fa85c Uploaded spficklin parents: diff changeset	72 print(opt$one_hot_cols)
66ef158fa85c Uploaded spficklin parents: diff changeset	73
66ef158fa85c Uploaded spficklin parents: diff changeset	74 print('Columns in the trait data that should be ignored:')
66ef158fa85c Uploaded spficklin parents: diff changeset	75 print(opt$ignore_cols)
66ef158fa85c Uploaded spficklin parents: diff changeset	76 }
66ef158fa85c Uploaded spficklin parents: diff changeset	77 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	78
66ef158fa85c Uploaded spficklin parents: diff changeset	79 ## If Errors Occur
66ef158fa85c Uploaded spficklin parents: diff changeset	80 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:
66ef158fa85c Uploaded spficklin parents: diff changeset	81
66ef158fa85c Uploaded spficklin parents: diff changeset	82 - Do the formats for the input datasets match the requirements listed above.
66ef158fa85c Uploaded spficklin parents: diff changeset	83 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)
66ef158fa85c Uploaded spficklin parents: diff changeset	84 - If trait data was provided, check that the column specified for the sample name is correct.
66ef158fa85c Uploaded spficklin parents: diff changeset	85 - The block size should not exceed 10,000 and should not be lower than 1,000.
66ef158fa85c Uploaded spficklin parents: diff changeset	86 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.
66ef158fa85c Uploaded spficklin parents: diff changeset	87
66ef158fa85c Uploaded spficklin parents: diff changeset	88
66ef158fa85c Uploaded spficklin parents: diff changeset	89 # Expression Data
66ef158fa85c Uploaded spficklin parents: diff changeset	90
66ef158fa85c Uploaded spficklin parents: diff changeset	91 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	92
66ef158fa85c Uploaded spficklin parents: diff changeset	93 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	94 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	95 #table_data = head(gem, 100)
66ef158fa85c Uploaded spficklin parents: diff changeset	96 #datatable(table_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	97 gem[1:10,1:6]
66ef158fa85c Uploaded spficklin parents: diff changeset	98 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	99
66ef158fa85c Uploaded spficklin parents: diff changeset	100 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	101 gemt = as.data.frame(t(gem))
66ef158fa85c Uploaded spficklin parents: diff changeset	102 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	103
66ef158fa85c Uploaded spficklin parents: diff changeset	104 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
66ef158fa85c Uploaded spficklin parents: diff changeset	105
66ef158fa85c Uploaded spficklin parents: diff changeset	106
66ef158fa85c Uploaded spficklin parents: diff changeset	107 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	108 gsg = goodSamplesGenes(gemt, verbose = 3)
66ef158fa85c Uploaded spficklin parents: diff changeset	109
66ef158fa85c Uploaded spficklin parents: diff changeset	110 if (!gsg$allOK) {
66ef158fa85c Uploaded spficklin parents: diff changeset	111 gemt = gemt[gsg$goodSamples, gsg$goodGenes]
66ef158fa85c Uploaded spficklin parents: diff changeset	112 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	113 print('all genes are OK!')
66ef158fa85c Uploaded spficklin parents: diff changeset	114 }
66ef158fa85c Uploaded spficklin parents: diff changeset	115 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	116
66ef158fa85c Uploaded spficklin parents: diff changeset	117
66ef158fa85c Uploaded spficklin parents: diff changeset	118 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
66ef158fa85c Uploaded spficklin parents: diff changeset	119
66ef158fa85c Uploaded spficklin parents: diff changeset	120 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	121 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	122
66ef158fa85c Uploaded spficklin parents: diff changeset	123 plotSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	124 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	125 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	126 }
66ef158fa85c Uploaded spficklin parents: diff changeset	127 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	128 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	129 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	130 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	131 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	132
66ef158fa85c Uploaded spficklin parents: diff changeset	133 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	134 if (is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	135 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	136 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	137 keepSamples = (clust!=0)
66ef158fa85c Uploaded spficklin parents: diff changeset	138 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	139 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	140 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	141 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	142 keepSamples = (clust==1)
66ef158fa85c Uploaded spficklin parents: diff changeset	143 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	144 }
66ef158fa85c Uploaded spficklin parents: diff changeset	145 n_genes = ncol(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	146 n_samples = nrow(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	147 removed = length(which(keepSamples == FALSE))
66ef158fa85c Uploaded spficklin parents: diff changeset	148 if (removed == 1) {
66ef158fa85c Uploaded spficklin parents: diff changeset	149 print(paste("A total of", removed, "sample was removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	150 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	151 print(paste("A total of", removed, "samples were removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	152 }
66ef158fa85c Uploaded spficklin parents: diff changeset	153
66ef158fa85c Uploaded spficklin parents: diff changeset	154 # Write out the filtered GEM
66ef158fa85c Uploaded spficklin parents: diff changeset	155 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	156 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	157 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.
66ef158fa85c Uploaded spficklin parents: diff changeset	158
66ef158fa85c Uploaded spficklin parents: diff changeset	159 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	160 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	161
66ef158fa85c Uploaded spficklin parents: diff changeset	162 plotFilteredSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	163 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	164 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	165 }
66ef158fa85c Uploaded spficklin parents: diff changeset	166 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	167 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	168 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	169 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	170 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	171
66ef158fa85c Uploaded spficklin parents: diff changeset	172 # Network Module Discovery
66ef158fa85c Uploaded spficklin parents: diff changeset	173
66ef158fa85c Uploaded spficklin parents: diff changeset	174 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:
66ef158fa85c Uploaded spficklin parents: diff changeset	175
66ef158fa85c Uploaded spficklin parents: diff changeset	176 - Power: The power tested
66ef158fa85c Uploaded spficklin parents: diff changeset	177 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
66ef158fa85c Uploaded spficklin parents: diff changeset	178 - slope: The slope of the regression line used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	179 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	180 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
66ef158fa85c Uploaded spficklin parents: diff changeset	181 - median.k: The median degree
66ef158fa85c Uploaded spficklin parents: diff changeset	182 - max.k: The largest degree.
66ef158fa85c Uploaded spficklin parents: diff changeset	183
66ef158fa85c Uploaded spficklin parents: diff changeset	184 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	185 powers = c(1:10, seq(12, 20, 2))
66ef158fa85c Uploaded spficklin parents: diff changeset	186 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
66ef158fa85c Uploaded spficklin parents: diff changeset	187 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	188
66ef158fa85c Uploaded spficklin parents: diff changeset	189 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
66ef158fa85c Uploaded spficklin parents: diff changeset	190
66ef158fa85c Uploaded spficklin parents: diff changeset	191 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	192 par(mfrow=c(1,2))
66ef158fa85c Uploaded spficklin parents: diff changeset	193 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
66ef158fa85c Uploaded spficklin parents: diff changeset	194
66ef158fa85c Uploaded spficklin parents: diff changeset	195 plotPower <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	196 # Scale-free topology fit index as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	197 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	198 xlab="Soft Threshold (power)",
66ef158fa85c Uploaded spficklin parents: diff changeset	199 ylab="Scale Free Topology Model Fit,signed R^2", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	200 main = paste("Scale Independence"), cex.lab = 0.5);
66ef158fa85c Uploaded spficklin parents: diff changeset	201 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	202 labels=powers,cex=0.5,col="red");
66ef158fa85c Uploaded spficklin parents: diff changeset	203 #abline(h=th, col="blue")
66ef158fa85c Uploaded spficklin parents: diff changeset	204
66ef158fa85c Uploaded spficklin parents: diff changeset	205 # Mean connectivity as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	206 plot(sft$fitIndices[,1], sft$fitIndices[,5],
66ef158fa85c Uploaded spficklin parents: diff changeset	207 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	208 main = paste("Mean Connectivity"), cex.lab = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	209 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
66ef158fa85c Uploaded spficklin parents: diff changeset	210 #abline(h=th, col="blue")
66ef158fa85c Uploaded spficklin parents: diff changeset	211 par(mfrow=c(1,1))
66ef158fa85c Uploaded spficklin parents: diff changeset	212 }
66ef158fa85c Uploaded spficklin parents: diff changeset	213
66ef158fa85c Uploaded spficklin parents: diff changeset	214 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	215 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	216 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	217 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	218 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	219 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
66ef158fa85c Uploaded spficklin parents: diff changeset	220
66ef158fa85c Uploaded spficklin parents: diff changeset	221 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	222 print("WGCNA predicted the following power:")
66ef158fa85c Uploaded spficklin parents: diff changeset	223 print(sft$powerEstimate)
66ef158fa85c Uploaded spficklin parents: diff changeset	224 power = sft$powerEstimate
66ef158fa85c Uploaded spficklin parents: diff changeset	225 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	226 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	227 print(opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	228 power = opt$power
66ef158fa85c Uploaded spficklin parents: diff changeset	229 }
66ef158fa85c Uploaded spficklin parents: diff changeset	230 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	231
66ef158fa85c Uploaded spficklin parents: diff changeset	232 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	233
66ef158fa85c Uploaded spficklin parents: diff changeset	234 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	235 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	236 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	237 reassignThreshold = 0, mergeCutHeight = 0.25,
66ef158fa85c Uploaded spficklin parents: diff changeset	238 numericLabels = TRUE, pamRespectsDendro = FALSE,
66ef158fa85c Uploaded spficklin parents: diff changeset	239 verbose = 1, saveTOMs = TRUE,
66ef158fa85c Uploaded spficklin parents: diff changeset	240 saveTOMFileBase = "TOM")
66ef158fa85c Uploaded spficklin parents: diff changeset	241 blocks = sort(unique(net$blocks))
66ef158fa85c Uploaded spficklin parents: diff changeset	242 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	243 The following table shows the list of modules that were discovered and their size (i.e. number of genes).
66ef158fa85c Uploaded spficklin parents: diff changeset	244
66ef158fa85c Uploaded spficklin parents: diff changeset	245 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	246 module_labels = labels2colors(net$colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	247 module_labels = paste("ME", module_labels, sep="")
66ef158fa85c Uploaded spficklin parents: diff changeset	248 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
66ef158fa85c Uploaded spficklin parents: diff changeset	249 rownames(module_labels2num) = paste0('ME', module_labels2num$num)
66ef158fa85c Uploaded spficklin parents: diff changeset	250 modules = unique(as.data.frame(table(module_labels)))
66ef158fa85c Uploaded spficklin parents: diff changeset	251 n_modules = length(modules) - 1
66ef158fa85c Uploaded spficklin parents: diff changeset	252 module_size_upper = modules[2]
66ef158fa85c Uploaded spficklin parents: diff changeset	253 module_size_lower = modules[length(modules)]
66ef158fa85c Uploaded spficklin parents: diff changeset	254 colnames(modules) = c('Module', 'Module Size')
66ef158fa85c Uploaded spficklin parents: diff changeset	255 #datatable(modules)
66ef158fa85c Uploaded spficklin parents: diff changeset	256 modules
66ef158fa85c Uploaded spficklin parents: diff changeset	257 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	258
66ef158fa85c Uploaded spficklin parents: diff changeset	259 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
66ef158fa85c Uploaded spficklin parents: diff changeset	260
66ef158fa85c Uploaded spficklin parents: diff changeset	261 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	262 MEs = net$MEs
66ef158fa85c Uploaded spficklin parents: diff changeset	263 colnames(MEs) = module_labels2num[colnames(MEs),]$label
66ef158fa85c Uploaded spficklin parents: diff changeset	264
66ef158fa85c Uploaded spficklin parents: diff changeset	265 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	266 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	267 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	268 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	269 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	270
66ef158fa85c Uploaded spficklin parents: diff changeset	271 Alternatively, we can use a heatmap to explore similarity of each module.
66ef158fa85c Uploaded spficklin parents: diff changeset	272 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	273 plotModuleHeatmap <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	274 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",
66ef158fa85c Uploaded spficklin parents: diff changeset	275 marHeatmap = c(2, 3, 2, 2),
66ef158fa85c Uploaded spficklin parents: diff changeset	276 plotDendrograms = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	277 }
66ef158fa85c Uploaded spficklin parents: diff changeset	278 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	279 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	280 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	281 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	282 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	283
66ef158fa85c Uploaded spficklin parents: diff changeset	284 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
66ef158fa85c Uploaded spficklin parents: diff changeset	285
66ef158fa85c Uploaded spficklin parents: diff changeset	286 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	287 # Plot the dendrogram and the module colors underneath
66ef158fa85c Uploaded spficklin parents: diff changeset	288 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	289 options(repr.plot.width=15, repr.plot.height=10)
66ef158fa85c Uploaded spficklin parents: diff changeset	290 colors = module_labels[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	291 colors = sub('ME','', colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	292 plotClusterDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	293 plotDendroAndColors(net$dendrograms[[i]], colors,
66ef158fa85c Uploaded spficklin parents: diff changeset	294 "Module colors", dendroLabels = FALSE, hang = 0.03,
66ef158fa85c Uploaded spficklin parents: diff changeset	295 addGuide = TRUE, guideHang = 0.05,
66ef158fa85c Uploaded spficklin parents: diff changeset	296 main=paste('Cluster Dendgrogram, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	297 }
66ef158fa85c Uploaded spficklin parents: diff changeset	298 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	299 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	300 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	301 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	302 }
66ef158fa85c Uploaded spficklin parents: diff changeset	303
66ef158fa85c Uploaded spficklin parents: diff changeset	304 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	305
66ef158fa85c Uploaded spficklin parents: diff changeset	306 The network is housed in a n x n similarity matrix known as the the Topological Overlap Matrix (TOM), where n is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.
66ef158fa85c Uploaded spficklin parents: diff changeset	307
66ef158fa85c Uploaded spficklin parents: diff changeset	308 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	309 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	310 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	311 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	312 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	313 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	314 dissTOM = 1-TOM
66ef158fa85c Uploaded spficklin parents: diff changeset	315
66ef158fa85c Uploaded spficklin parents: diff changeset	316 # For reproducibility, we set the random seed
66ef158fa85c Uploaded spficklin parents: diff changeset	317 set.seed(10);
66ef158fa85c Uploaded spficklin parents: diff changeset	318 select = sample(dim(TOM)[1], size = 1000);
66ef158fa85c Uploaded spficklin parents: diff changeset	319 selectColors = module_labels[net$blockGenes[[i]][select]]
66ef158fa85c Uploaded spficklin parents: diff changeset	320 selectTOM = dissTOM[select, select];
66ef158fa85c Uploaded spficklin parents: diff changeset	321
66ef158fa85c Uploaded spficklin parents: diff changeset	322 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
66ef158fa85c Uploaded spficklin parents: diff changeset	323 selectTree = hclust(as.dist(selectTOM), method = "average")
66ef158fa85c Uploaded spficklin parents: diff changeset	324
66ef158fa85c Uploaded spficklin parents: diff changeset	325 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
66ef158fa85c Uploaded spficklin parents: diff changeset	326 # the color palette; setting the diagonal to NA also improves the clarity of the plot
66ef158fa85c Uploaded spficklin parents: diff changeset	327 plotDiss = selectTOM^7;
66ef158fa85c Uploaded spficklin parents: diff changeset	328 diag(plotDiss) = NA;
66ef158fa85c Uploaded spficklin parents: diff changeset	329 colors = sub('ME','', selectColors)
66ef158fa85c Uploaded spficklin parents: diff changeset	330
66ef158fa85c Uploaded spficklin parents: diff changeset	331 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	332 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	333 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	334 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	335 }
66ef158fa85c Uploaded spficklin parents: diff changeset	336 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	337
66ef158fa85c Uploaded spficklin parents: diff changeset	338 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	339 output = cbind(colnames(gemt), module_labels)
66ef158fa85c Uploaded spficklin parents: diff changeset	340 colnames(output) = c('Gene', 'Module')
66ef158fa85c Uploaded spficklin parents: diff changeset	341 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	342 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	343
66ef158fa85c Uploaded spficklin parents: diff changeset	344 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
66ef158fa85c Uploaded spficklin parents: diff changeset	345
66ef158fa85c Uploaded spficklin parents: diff changeset	346 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
66ef158fa85c Uploaded spficklin parents: diff changeset	347
66ef158fa85c Uploaded spficklin parents: diff changeset	348 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
66ef158fa85c Uploaded spficklin parents: diff changeset	349
66ef158fa85c Uploaded spficklin parents: diff changeset	350 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	351 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
66ef158fa85c Uploaded spficklin parents: diff changeset	352 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	353 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	354 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	355 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	356 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	357 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	358 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	359
66ef158fa85c Uploaded spficklin parents: diff changeset	360 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	361 edges = rbind(edges, cydata$edgeData)
66ef158fa85c Uploaded spficklin parents: diff changeset	362 }
66ef158fa85c Uploaded spficklin parents: diff changeset	363
66ef158fa85c Uploaded spficklin parents: diff changeset	364 edges$Interaction = 'co'
66ef158fa85c Uploaded spficklin parents: diff changeset	365 output = edges[,c('fromNode','toNode','Interaction', 'weight')]
66ef158fa85c Uploaded spficklin parents: diff changeset	366 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
66ef158fa85c Uploaded spficklin parents: diff changeset	367 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
66ef158fa85c Uploaded spficklin parents: diff changeset	368 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	369
66ef158fa85c Uploaded spficklin parents: diff changeset	370 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
66ef158fa85c Uploaded spficklin parents: diff changeset	371
66ef158fa85c Uploaded spficklin parents: diff changeset	372 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	373 # Save this image for the next step which is optional if theuser
66ef158fa85c Uploaded spficklin parents: diff changeset	374 # provides a trait file.
66ef158fa85c Uploaded spficklin parents: diff changeset	375 save.image(file=opt$r_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	376 ```

0

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

1 ---

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

3 output:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

4 pdf_document:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

5 number_sections: false

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

6 ---

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

7

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

10 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

11

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

12 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

13 # Make a directory for saving the figures.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

14 dir.create('figures', showWarnings = FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

15 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

16

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

17 # Introduction

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

18 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

19

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

20 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

21

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

22 This report was generated on:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

23 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

24 format(Sys.time(), "%a %b %d %X %Y")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

25 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

26

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

27

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

28 ## About the Input Data

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

29 ### Gene Expression Matrix (GEM)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

30 The gene expression data is an *n* x *m* matrix where *n* rows are the genes, *m* columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

31 - Housed in a comma-separated (CSV) file.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

32 - The rows represent the gene expression levels

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

33 - The first column of each row is the gene, transcript or probe name.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

34 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

35

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

36 ### Trait/Phenotype Matrix

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

37 The trait/phenotype data is an *n* x *m* matrix where *n* is the samples and *m* are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

38

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

39 ## Parameters provided by the user.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

40 The following describes the input arguments provided to this tool:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

41 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

42

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

43 if (!is.null(opt$height_cut)) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

44 print('The cut height for outlier removal of the sample dendrogram:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

45 print(opt$cut_height)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

46 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

47

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

48 if (!is.null(opt$power)) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

49 print('The power to which the gene expression data is raised:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

50 print(opt$power)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

51 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

52 print('The minimal size for a module:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

53 print(opt$min_cluster_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

54

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

55 print('The block size for dividing the GEM to reduce memory requirements:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

56 print(opt$block_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

57

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

58 print('The hard threshold when generating the graph file:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

59 print(opt$hard_threshold)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

60

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

61 print('The character string used to identify missing values in the GEM:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

62 print(opt$missing_value1)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

63

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

64 if (!is.null(opt$trait_data)) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

65 print('The column in the trait data that contains the sample name:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

66 print(opt$sname_col)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

67

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

68 print('The character string used to identify missing values in the trait data:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

69 print(opt$missing_value2)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

70

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

71 print('Columns in the trait data that should be treated as categorical:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

72 print(opt$one_hot_cols)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

73

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

74 print('Columns in the trait data that should be ignored:')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

75 print(opt$ignore_cols)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

76 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

77 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

78

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

79 ## If Errors Occur

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

80 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

81

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

82 - Do the formats for the input datasets match the requirements listed above.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

83 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

84 - If trait data was provided, check that the column specified for the sample name is correct.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

85 - The block size should not exceed 10,000 and should not be lower than 1,000.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

86 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

87

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

88

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

89 # Expression Data

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

90

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

91 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

92

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

93 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

94 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

95 #table_data = head(gem, 100)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

96 #datatable(table_data)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

97 gem[1:10,1:6]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

98 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

99

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

100 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

101 gemt = as.data.frame(t(gem))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

102 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

103

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

104 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

105

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

106

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

107 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

108 gsg = goodSamplesGenes(gemt, verbose = 3)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

109

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

110 if (!gsg$allOK) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

111 gemt = gemt[gsg$goodSamples, gsg$goodGenes]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

112 } else {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

113 print('all genes are OK!')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

114 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

115 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

116

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

117

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

118 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

119

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

120 ```{r fig.align='center'}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

121 sampleTree = hclust(dist(gemt), method = "average");

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

122

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

123 plotSampleDendro <- function() {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

124 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

125 cex.axis = 1, cex.main = 1, cex = 0.5)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

126 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

127 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

128 plotSampleDendro()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

129 invisible(dev.off())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

130 plotSampleDendro()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

131 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

132

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

133 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

134 if (is.null(opt$height_cut)) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

135 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

136 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

137 keepSamples = (clust!=0)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

138 gemt = gemt[keepSamples, ]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

139 } else {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

140 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

141 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

142 keepSamples = (clust==1)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

143 gemt = gemt[keepSamples, ]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

144 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

145 n_genes = ncol(gemt)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

146 n_samples = nrow(gemt)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

147 removed = length(which(keepSamples == FALSE))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

148 if (removed == 1) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

149 print(paste("A total of", removed, "sample was removed"))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

150 } else {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

151 print(paste("A total of", removed, "samples were removed"))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

152 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

153

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

154 # Write out the filtered GEM

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

155 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

156 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

157 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

158

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

159 ```{r fig.align='center'}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

160 sampleTree = hclust(dist(gemt), method = "average");

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

161

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

162 plotFilteredSampleDendro <- function() {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

163 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

164 cex.axis = 1, cex.main = 1, cex = 0.5)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

165 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

166 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

167 plotFilteredSampleDendro()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

168 invisible(dev.off())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

169 plotFilteredSampleDendro()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

170 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

171

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

172 # Network Module Discovery

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

173

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

174 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

175

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

176 - Power: The power tested

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

177 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

178 - slope: The slope of the regression line used to calculate SFT.R.sq

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

179 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

180 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

181 - median.k: The median degree

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

182 - max.k: The largest degree.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

183

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

184 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

185 powers = c(1:10, seq(12, 20, 2))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

186 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

187 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

188

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

189 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

190

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

191 ```{r fig.align='center'}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

192 par(mfrow=c(1,2))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

193 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

194

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

195 plotPower <- function() {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

196 # Scale-free topology fit index as a function of the soft-thresholding power.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

197 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

198 xlab="Soft Threshold (power)",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

199 ylab="Scale Free Topology Model Fit,signed R^2", type="n",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

200 main = paste("Scale Independence"), cex.lab = 0.5);

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

201 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

202 labels=powers,cex=0.5,col="red");

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

203 #abline(h=th, col="blue")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

204

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

205 # Mean connectivity as a function of the soft-thresholding power.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

206 plot(sft$fitIndices[,1], sft$fitIndices[,5],

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

207 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

208 main = paste("Mean Connectivity"), cex.lab = 0.5)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

209 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

210 #abline(h=th, col="blue")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

211 par(mfrow=c(1,1))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

212 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

213

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

214 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

215 plotPower()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

216 invisible(dev.off())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

217 plotPower()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

218 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

219 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

220

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

221 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

222 print("WGCNA predicted the following power:")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

223 print(sft$powerEstimate)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

224 power = sft$powerEstimate

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

225 if (!is.null(opt$power)) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

226 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

227 print(opt$soft_threshold_power)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

228 power = opt$power

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

229 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

230 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

231

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

232 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

233

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

234 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

235 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

236 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

237 reassignThreshold = 0, mergeCutHeight = 0.25,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

238 numericLabels = TRUE, pamRespectsDendro = FALSE,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

239 verbose = 1, saveTOMs = TRUE,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

240 saveTOMFileBase = "TOM")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

241 blocks = sort(unique(net$blocks))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

242 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

243 The following table shows the list of modules that were discovered and their size (i.e. number of genes).

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

244

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

245 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

246 module_labels = labels2colors(net$colors)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

247 module_labels = paste("ME", module_labels, sep="")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

248 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

249 rownames(module_labels2num) = paste0('ME', module_labels2num$num)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

250 modules = unique(as.data.frame(table(module_labels)))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

251 n_modules = length(modules) - 1

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

252 module_size_upper = modules[2]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

253 module_size_lower = modules[length(modules)]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

254 colnames(modules) = c('Module', 'Module Size')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

255 #datatable(modules)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

256 modules

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

257 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

258

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

259 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

260

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

261 ```{r fig.align='center'}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

262 MEs = net$MEs

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

263 colnames(MEs) = module_labels2num[colnames(MEs),]$label

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

264

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

265 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

266 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

267 dev.off()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

268 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

269 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

270

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

271 Alternatively, we can use a heatmap to explore similarity of each module.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

272 ```{r fig.align='center'}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

273 plotModuleHeatmap <- function() {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

274 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

275 marHeatmap = c(2, 3, 2, 2),

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

276 plotDendrograms = FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

277 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

278 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

279 plotModuleHeatmap()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

280 invisible(dev.off())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

281 plotModuleHeatmap()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

282 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

283

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

284 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

285

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

286 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

287 # Plot the dendrogram and the module colors underneath

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

288 for (i in blocks) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

289 options(repr.plot.width=15, repr.plot.height=10)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

290 colors = module_labels[net$blockGenes[[i]]]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

291 colors = sub('ME','', colors)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

292 plotClusterDendro <- function() {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

293 plotDendroAndColors(net$dendrograms[[i]], colors,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

294 "Module colors", dendroLabels = FALSE, hang = 0.03,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

295 addGuide = TRUE, guideHang = 0.05,

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

296 main=paste('Cluster Dendgrogram, Block', i))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

297 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

298 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

299 plotClusterDendro();

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

300 invisible(dev.off())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

301 plotClusterDendro();

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

302 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

303

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

304 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

305

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

306 The network is housed in a *n* x *n* similarity matrix known as the the Topological Overlap Matrix (TOM), where *n* is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

307

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

308 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

309 for (i in blocks) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

310 # Load the TOM from a file.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

311 load(net$TOMFiles[i])

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

312 TOM_size = length(which(net$blocks == i))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

313 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

314 dissTOM = 1-TOM

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

315

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

316 # For reproducibility, we set the random seed

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

317 set.seed(10);

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

318 select = sample(dim(TOM)[1], size = 1000);

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

319 selectColors = module_labels[net$blockGenes[[i]][select]]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

320 selectTOM = dissTOM[select, select];

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

321

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

322 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

323 selectTree = hclust(as.dist(selectTOM), method = "average")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

324

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

325 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

326 # the color palette; setting the diagonal to NA also improves the clarity of the plot

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

327 plotDiss = selectTOM^7;

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

328 diag(plotDiss) = NA;

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

329 colors = sub('ME','', selectColors)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

330

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

331 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

332 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

333 dev.off()

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

334 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

335 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

336 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

337

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

338 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

339 output = cbind(colnames(gemt), module_labels)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

340 colnames(output) = c('Gene', 'Module')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

341 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

342 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

343

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

344 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

345

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

346 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

347

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

348 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

349

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

350 ```{r}

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

351 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

352 for (i in blocks) {

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

353 # Load the TOM from a file.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

354 load(net$TOMFiles[i])

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

355 TOM_size = length(which(net$blocks == i))

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

356 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

357 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

358 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

359

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

360 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

361 edges = rbind(edges, cydata$edgeData)

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

362 }

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

363

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

364 edges$Interaction = 'co'

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

365 output = edges[,c('fromNode','toNode','Interaction', 'weight')]

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

366 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

367 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

368 ```

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

369

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

370 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

371

66ef158fa85c Uploaded

spficklin

parents:

diff changeset

372 ```{r}

66ef158fa85c Uploaded

spficklin