aurora_wgcna: aurora_wgcna.Rmd annotate

annotate aurora_wgcna.Rmd @ 6:58b01fa2cc81 draft

Uploaded

author	spficklin
date	Thu, 05 Dec 2019 15:34:52 -0500
parents	b14e4bf568b0
children	96ba1a8fff06

rev	line source
0 66ef158fa85c Uploaded spficklin parents: diff changeset	1 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'
66ef158fa85c Uploaded spficklin parents: diff changeset	3 output:
66ef158fa85c Uploaded spficklin parents: diff changeset	4 pdf_document:
66ef158fa85c Uploaded spficklin parents: diff changeset	5 number_sections: false
66ef158fa85c Uploaded spficklin parents: diff changeset	6 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	7
66ef158fa85c Uploaded spficklin parents: diff changeset	8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}
66ef158fa85c Uploaded spficklin parents: diff changeset	9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	10 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	11
66ef158fa85c Uploaded spficklin parents: diff changeset	12 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	13 # Make a directory for saving the figures.
66ef158fa85c Uploaded spficklin parents: diff changeset	14 dir.create('figures', showWarnings = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	15 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	16
66ef158fa85c Uploaded spficklin parents: diff changeset	17 # Introduction
66ef158fa85c Uploaded spficklin parents: diff changeset	18 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	19
66ef158fa85c Uploaded spficklin parents: diff changeset	20 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.
66ef158fa85c Uploaded spficklin parents: diff changeset	21
66ef158fa85c Uploaded spficklin parents: diff changeset	22 This report was generated on:
66ef158fa85c Uploaded spficklin parents: diff changeset	23 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	24 format(Sys.time(), "%a %b %d %X %Y")
66ef158fa85c Uploaded spficklin parents: diff changeset	25 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	26
66ef158fa85c Uploaded spficklin parents: diff changeset	27
66ef158fa85c Uploaded spficklin parents: diff changeset	28 ## About the Input Data
66ef158fa85c Uploaded spficklin parents: diff changeset	29 ### Gene Expression Matrix (GEM)
66ef158fa85c Uploaded spficklin parents: diff changeset	30 The gene expression data is an n x m matrix where n rows are the genes, m columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:
66ef158fa85c Uploaded spficklin parents: diff changeset	31 - Housed in a comma-separated (CSV) file.
66ef158fa85c Uploaded spficklin parents: diff changeset	32 - The rows represent the gene expression levels
66ef158fa85c Uploaded spficklin parents: diff changeset	33 - The first column of each row is the gene, transcript or probe name.
66ef158fa85c Uploaded spficklin parents: diff changeset	34 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.
66ef158fa85c Uploaded spficklin parents: diff changeset	35
66ef158fa85c Uploaded spficklin parents: diff changeset	36 ### Trait/Phenotype Matrix
66ef158fa85c Uploaded spficklin parents: diff changeset	37 The trait/phenotype data is an n x m matrix where n is the samples and m are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.
66ef158fa85c Uploaded spficklin parents: diff changeset	38
66ef158fa85c Uploaded spficklin parents: diff changeset	39 ## Parameters provided by the user.
66ef158fa85c Uploaded spficklin parents: diff changeset	40 The following describes the input arguments provided to this tool:
66ef158fa85c Uploaded spficklin parents: diff changeset	41 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	42
66ef158fa85c Uploaded spficklin parents: diff changeset	43 if (!is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	44 print('The cut height for outlier removal of the sample dendrogram:')
66ef158fa85c Uploaded spficklin parents: diff changeset	45 print(opt$cut_height)
66ef158fa85c Uploaded spficklin parents: diff changeset	46 }
66ef158fa85c Uploaded spficklin parents: diff changeset	47
66ef158fa85c Uploaded spficklin parents: diff changeset	48 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	49 print('The power to which the gene expression data is raised:')
66ef158fa85c Uploaded spficklin parents: diff changeset	50 print(opt$power)
66ef158fa85c Uploaded spficklin parents: diff changeset	51 }
66ef158fa85c Uploaded spficklin parents: diff changeset	52 print('The minimal size for a module:')
66ef158fa85c Uploaded spficklin parents: diff changeset	53 print(opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	54
66ef158fa85c Uploaded spficklin parents: diff changeset	55 print('The block size for dividing the GEM to reduce memory requirements:')
66ef158fa85c Uploaded spficklin parents: diff changeset	56 print(opt$block_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	57
66ef158fa85c Uploaded spficklin parents: diff changeset	58 print('The hard threshold when generating the graph file:')
66ef158fa85c Uploaded spficklin parents: diff changeset	59 print(opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	60
66ef158fa85c Uploaded spficklin parents: diff changeset	61 print('The character string used to identify missing values in the GEM:')
66ef158fa85c Uploaded spficklin parents: diff changeset	62 print(opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	63
66ef158fa85c Uploaded spficklin parents: diff changeset	64 if (!is.null(opt$trait_data)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	65 print('The column in the trait data that contains the sample name:')
66ef158fa85c Uploaded spficklin parents: diff changeset	66 print(opt$sname_col)
66ef158fa85c Uploaded spficklin parents: diff changeset	67
66ef158fa85c Uploaded spficklin parents: diff changeset	68 print('The character string used to identify missing values in the trait data:')
66ef158fa85c Uploaded spficklin parents: diff changeset	69 print(opt$missing_value2)
66ef158fa85c Uploaded spficklin parents: diff changeset	70
66ef158fa85c Uploaded spficklin parents: diff changeset	71 print('Columns in the trait data that should be treated as categorical:')
66ef158fa85c Uploaded spficklin parents: diff changeset	72 print(opt$one_hot_cols)
66ef158fa85c Uploaded spficklin parents: diff changeset	73
66ef158fa85c Uploaded spficklin parents: diff changeset	74 print('Columns in the trait data that should be ignored:')
66ef158fa85c Uploaded spficklin parents: diff changeset	75 print(opt$ignore_cols)
66ef158fa85c Uploaded spficklin parents: diff changeset	76 }
66ef158fa85c Uploaded spficklin parents: diff changeset	77 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	78
66ef158fa85c Uploaded spficklin parents: diff changeset	79 ## If Errors Occur
66ef158fa85c Uploaded spficklin parents: diff changeset	80 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:
66ef158fa85c Uploaded spficklin parents: diff changeset	81
66ef158fa85c Uploaded spficklin parents: diff changeset	82 - Do the formats for the input datasets match the requirements listed above.
66ef158fa85c Uploaded spficklin parents: diff changeset	83 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)
66ef158fa85c Uploaded spficklin parents: diff changeset	84 - If trait data was provided, check that the column specified for the sample name is correct.
66ef158fa85c Uploaded spficklin parents: diff changeset	85 - The block size should not exceed 10,000 and should not be lower than 1,000.
66ef158fa85c Uploaded spficklin parents: diff changeset	86 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.
66ef158fa85c Uploaded spficklin parents: diff changeset	87
66ef158fa85c Uploaded spficklin parents: diff changeset	88
66ef158fa85c Uploaded spficklin parents: diff changeset	89 # Expression Data
66ef158fa85c Uploaded spficklin parents: diff changeset	90
66ef158fa85c Uploaded spficklin parents: diff changeset	91 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	92
66ef158fa85c Uploaded spficklin parents: diff changeset	93 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	94 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	95 #table_data = head(gem, 100)
66ef158fa85c Uploaded spficklin parents: diff changeset	96 #datatable(table_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	97 gem[1:10,1:6]
66ef158fa85c Uploaded spficklin parents: diff changeset	98 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	99
66ef158fa85c Uploaded spficklin parents: diff changeset	100 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	101 gemt = as.data.frame(t(gem))
66ef158fa85c Uploaded spficklin parents: diff changeset	102 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	103
66ef158fa85c Uploaded spficklin parents: diff changeset	104 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
66ef158fa85c Uploaded spficklin parents: diff changeset	105
66ef158fa85c Uploaded spficklin parents: diff changeset	106
66ef158fa85c Uploaded spficklin parents: diff changeset	107 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	108 gsg = goodSamplesGenes(gemt, verbose = 3)
66ef158fa85c Uploaded spficklin parents: diff changeset	109
66ef158fa85c Uploaded spficklin parents: diff changeset	110 if (!gsg$allOK) {
66ef158fa85c Uploaded spficklin parents: diff changeset	111 gemt = gemt[gsg$goodSamples, gsg$goodGenes]
66ef158fa85c Uploaded spficklin parents: diff changeset	112 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	113 print('all genes are OK!')
66ef158fa85c Uploaded spficklin parents: diff changeset	114 }
66ef158fa85c Uploaded spficklin parents: diff changeset	115 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	116
66ef158fa85c Uploaded spficklin parents: diff changeset	117
66ef158fa85c Uploaded spficklin parents: diff changeset	118 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
66ef158fa85c Uploaded spficklin parents: diff changeset	119
66ef158fa85c Uploaded spficklin parents: diff changeset	120 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	121 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	122
66ef158fa85c Uploaded spficklin parents: diff changeset	123 plotSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	124 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	125 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	126 }
66ef158fa85c Uploaded spficklin parents: diff changeset	127 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	128 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	129 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	130 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	131 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	132
66ef158fa85c Uploaded spficklin parents: diff changeset	133 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	134 if (is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	135 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	136 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	137 keepSamples = (clust!=0)
66ef158fa85c Uploaded spficklin parents: diff changeset	138 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	139 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	140 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	141 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	142 keepSamples = (clust==1)
66ef158fa85c Uploaded spficklin parents: diff changeset	143 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	144 }
66ef158fa85c Uploaded spficklin parents: diff changeset	145 n_genes = ncol(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	146 n_samples = nrow(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	147 removed = length(which(keepSamples == FALSE))
66ef158fa85c Uploaded spficklin parents: diff changeset	148 if (removed == 1) {
66ef158fa85c Uploaded spficklin parents: diff changeset	149 print(paste("A total of", removed, "sample was removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	150 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	151 print(paste("A total of", removed, "samples were removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	152 }
66ef158fa85c Uploaded spficklin parents: diff changeset	153
66ef158fa85c Uploaded spficklin parents: diff changeset	154 # Write out the filtered GEM
66ef158fa85c Uploaded spficklin parents: diff changeset	155 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	156 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	157 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.
66ef158fa85c Uploaded spficklin parents: diff changeset	158
66ef158fa85c Uploaded spficklin parents: diff changeset	159 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	160 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	161
66ef158fa85c Uploaded spficklin parents: diff changeset	162 plotFilteredSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	163 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	164 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	165 }
66ef158fa85c Uploaded spficklin parents: diff changeset	166 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	167 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	168 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	169 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	170 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	171
66ef158fa85c Uploaded spficklin parents: diff changeset	172 # Network Module Discovery
66ef158fa85c Uploaded spficklin parents: diff changeset	173
66ef158fa85c Uploaded spficklin parents: diff changeset	174 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:
66ef158fa85c Uploaded spficklin parents: diff changeset	175
66ef158fa85c Uploaded spficklin parents: diff changeset	176 - Power: The power tested
66ef158fa85c Uploaded spficklin parents: diff changeset	177 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
66ef158fa85c Uploaded spficklin parents: diff changeset	178 - slope: The slope of the regression line used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	179 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	180 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
66ef158fa85c Uploaded spficklin parents: diff changeset	181 - median.k: The median degree
66ef158fa85c Uploaded spficklin parents: diff changeset	182 - max.k: The largest degree.
66ef158fa85c Uploaded spficklin parents: diff changeset	183
66ef158fa85c Uploaded spficklin parents: diff changeset	184 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	185 powers = c(1:10, seq(12, 20, 2))
66ef158fa85c Uploaded spficklin parents: diff changeset	186 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
66ef158fa85c Uploaded spficklin parents: diff changeset	187 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	188
66ef158fa85c Uploaded spficklin parents: diff changeset	189 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
66ef158fa85c Uploaded spficklin parents: diff changeset	190
66ef158fa85c Uploaded spficklin parents: diff changeset	191 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	192 par(mfrow=c(1,2))
66ef158fa85c Uploaded spficklin parents: diff changeset	193 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
66ef158fa85c Uploaded spficklin parents: diff changeset	194
66ef158fa85c Uploaded spficklin parents: diff changeset	195 plotPower <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	196 # Scale-free topology fit index as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	197 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	198 xlab="Soft Threshold (power)",
66ef158fa85c Uploaded spficklin parents: diff changeset	199 ylab="Scale Free Topology Model Fit,signed R^2", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	200 main = paste("Scale Independence"), cex.lab = 0.5);
66ef158fa85c Uploaded spficklin parents: diff changeset	201 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	202 labels=powers,cex=0.5,col="red");
66ef158fa85c Uploaded spficklin parents: diff changeset	203 #abline(h=th, col="blue")
66ef158fa85c Uploaded spficklin parents: diff changeset	204
66ef158fa85c Uploaded spficklin parents: diff changeset	205 # Mean connectivity as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	206 plot(sft$fitIndices[,1], sft$fitIndices[,5],
66ef158fa85c Uploaded spficklin parents: diff changeset	207 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	208 main = paste("Mean Connectivity"), cex.lab = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	209 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
66ef158fa85c Uploaded spficklin parents: diff changeset	210 #abline(h=th, col="blue")
66ef158fa85c Uploaded spficklin parents: diff changeset	211 par(mfrow=c(1,1))
66ef158fa85c Uploaded spficklin parents: diff changeset	212 }
66ef158fa85c Uploaded spficklin parents: diff changeset	213
66ef158fa85c Uploaded spficklin parents: diff changeset	214 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	215 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	216 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	217 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	218 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	219 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
66ef158fa85c Uploaded spficklin parents: diff changeset	220
66ef158fa85c Uploaded spficklin parents: diff changeset	221 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	222 print("WGCNA predicted the following power:")
66ef158fa85c Uploaded spficklin parents: diff changeset	223 print(sft$powerEstimate)
66ef158fa85c Uploaded spficklin parents: diff changeset	224 power = sft$powerEstimate
66ef158fa85c Uploaded spficklin parents: diff changeset	225 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	226 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	227 print(opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	228 power = opt$power
66ef158fa85c Uploaded spficklin parents: diff changeset	229 }
66ef158fa85c Uploaded spficklin parents: diff changeset	230 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	231
66ef158fa85c Uploaded spficklin parents: diff changeset	232 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	233
66ef158fa85c Uploaded spficklin parents: diff changeset	234 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	235 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	236 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	237 reassignThreshold = 0, mergeCutHeight = 0.25,
66ef158fa85c Uploaded spficklin parents: diff changeset	238 numericLabels = TRUE, pamRespectsDendro = FALSE,
66ef158fa85c Uploaded spficklin parents: diff changeset	239 verbose = 1, saveTOMs = TRUE,
66ef158fa85c Uploaded spficklin parents: diff changeset	240 saveTOMFileBase = "TOM")
66ef158fa85c Uploaded spficklin parents: diff changeset	241 blocks = sort(unique(net$blocks))
66ef158fa85c Uploaded spficklin parents: diff changeset	242 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	243 The following table shows the list of modules that were discovered and their size (i.e. number of genes).
66ef158fa85c Uploaded spficklin parents: diff changeset	244
66ef158fa85c Uploaded spficklin parents: diff changeset	245 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	246 module_labels = labels2colors(net$colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	247 module_labels = paste("ME", module_labels, sep="")
66ef158fa85c Uploaded spficklin parents: diff changeset	248 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
66ef158fa85c Uploaded spficklin parents: diff changeset	249 rownames(module_labels2num) = paste0('ME', module_labels2num$num)
66ef158fa85c Uploaded spficklin parents: diff changeset	250 modules = unique(as.data.frame(table(module_labels)))
66ef158fa85c Uploaded spficklin parents: diff changeset	251 n_modules = length(modules) - 1
66ef158fa85c Uploaded spficklin parents: diff changeset	252 module_size_upper = modules[2]
66ef158fa85c Uploaded spficklin parents: diff changeset	253 module_size_lower = modules[length(modules)]
66ef158fa85c Uploaded spficklin parents: diff changeset	254 colnames(modules) = c('Module', 'Module Size')
66ef158fa85c Uploaded spficklin parents: diff changeset	255 #datatable(modules)
66ef158fa85c Uploaded spficklin parents: diff changeset	256 modules
66ef158fa85c Uploaded spficklin parents: diff changeset	257 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	258
66ef158fa85c Uploaded spficklin parents: diff changeset	259 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
66ef158fa85c Uploaded spficklin parents: diff changeset	260
66ef158fa85c Uploaded spficklin parents: diff changeset	261 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	262 MEs = net$MEs
66ef158fa85c Uploaded spficklin parents: diff changeset	263 colnames(MEs) = module_labels2num[colnames(MEs),]$label
66ef158fa85c Uploaded spficklin parents: diff changeset	264
66ef158fa85c Uploaded spficklin parents: diff changeset	265 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	266 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	267 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	268 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	269 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	270
66ef158fa85c Uploaded spficklin parents: diff changeset	271 Alternatively, we can use a heatmap to explore similarity of each module.
66ef158fa85c Uploaded spficklin parents: diff changeset	272 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	273 plotModuleHeatmap <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	274 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",
66ef158fa85c Uploaded spficklin parents: diff changeset	275 marHeatmap = c(2, 3, 2, 2),
66ef158fa85c Uploaded spficklin parents: diff changeset	276 plotDendrograms = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	277 }
66ef158fa85c Uploaded spficklin parents: diff changeset	278 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	279 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	280 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	281 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	282 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	283
66ef158fa85c Uploaded spficklin parents: diff changeset	284 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
66ef158fa85c Uploaded spficklin parents: diff changeset	285
66ef158fa85c Uploaded spficklin parents: diff changeset	286 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	287 # Plot the dendrogram and the module colors underneath
66ef158fa85c Uploaded spficklin parents: diff changeset	288 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	289 options(repr.plot.width=15, repr.plot.height=10)
66ef158fa85c Uploaded spficklin parents: diff changeset	290 colors = module_labels[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	291 colors = sub('ME','', colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	292 plotClusterDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	293 plotDendroAndColors(net$dendrograms[[i]], colors,
66ef158fa85c Uploaded spficklin parents: diff changeset	294 "Module colors", dendroLabels = FALSE, hang = 0.03,
66ef158fa85c Uploaded spficklin parents: diff changeset	295 addGuide = TRUE, guideHang = 0.05,
66ef158fa85c Uploaded spficklin parents: diff changeset	296 main=paste('Cluster Dendgrogram, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	297 }
66ef158fa85c Uploaded spficklin parents: diff changeset	298 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	299 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	300 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	301 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	302 }
66ef158fa85c Uploaded spficklin parents: diff changeset	303
66ef158fa85c Uploaded spficklin parents: diff changeset	304 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	305
66ef158fa85c Uploaded spficklin parents: diff changeset	306 The network is housed in a n x n similarity matrix known as the the Topological Overlap Matrix (TOM), where n is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.
66ef158fa85c Uploaded spficklin parents: diff changeset	307
66ef158fa85c Uploaded spficklin parents: diff changeset	308 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	309 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	310 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	311 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	312 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	313 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	314 dissTOM = 1-TOM
66ef158fa85c Uploaded spficklin parents: diff changeset	315
66ef158fa85c Uploaded spficklin parents: diff changeset	316 # For reproducibility, we set the random seed
66ef158fa85c Uploaded spficklin parents: diff changeset	317 set.seed(10);
66ef158fa85c Uploaded spficklin parents: diff changeset	318 select = sample(dim(TOM)[1], size = 1000);
66ef158fa85c Uploaded spficklin parents: diff changeset	319 selectColors = module_labels[net$blockGenes[[i]][select]]
66ef158fa85c Uploaded spficklin parents: diff changeset	320 selectTOM = dissTOM[select, select];
66ef158fa85c Uploaded spficklin parents: diff changeset	321
66ef158fa85c Uploaded spficklin parents: diff changeset	322 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
66ef158fa85c Uploaded spficklin parents: diff changeset	323 selectTree = hclust(as.dist(selectTOM), method = "average")
66ef158fa85c Uploaded spficklin parents: diff changeset	324
66ef158fa85c Uploaded spficklin parents: diff changeset	325 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
66ef158fa85c Uploaded spficklin parents: diff changeset	326 # the color palette; setting the diagonal to NA also improves the clarity of the plot
66ef158fa85c Uploaded spficklin parents: diff changeset	327 plotDiss = selectTOM^7;
66ef158fa85c Uploaded spficklin parents: diff changeset	328 diag(plotDiss) = NA;
66ef158fa85c Uploaded spficklin parents: diff changeset	329 colors = sub('ME','', selectColors)
66ef158fa85c Uploaded spficklin parents: diff changeset	330
66ef158fa85c Uploaded spficklin parents: diff changeset	331 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	332 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	333 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	334 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	335 }
66ef158fa85c Uploaded spficklin parents: diff changeset	336 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	337
66ef158fa85c Uploaded spficklin parents: diff changeset	338 ```{r}
4 b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	339 module_colors = sub('ME','', module_labels)
b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	340 output = cbind(colnames(gemt), module_labels, module_colors)
b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	341 colnames(output) = c('Gene', 'Module', 'Color')
0 66ef158fa85c Uploaded spficklin parents: diff changeset	342 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	343 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	344
66ef158fa85c Uploaded spficklin parents: diff changeset	345 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
66ef158fa85c Uploaded spficklin parents: diff changeset	346
66ef158fa85c Uploaded spficklin parents: diff changeset	347 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
66ef158fa85c Uploaded spficklin parents: diff changeset	348
66ef158fa85c Uploaded spficklin parents: diff changeset	349 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
66ef158fa85c Uploaded spficklin parents: diff changeset	350
66ef158fa85c Uploaded spficklin parents: diff changeset	351 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	352 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
66ef158fa85c Uploaded spficklin parents: diff changeset	353 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	354 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	355 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	356 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	357 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	358 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	359 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	360
66ef158fa85c Uploaded spficklin parents: diff changeset	361 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	362 edges = rbind(edges, cydata$edgeData)
66ef158fa85c Uploaded spficklin parents: diff changeset	363 }
66ef158fa85c Uploaded spficklin parents: diff changeset	364
66ef158fa85c Uploaded spficklin parents: diff changeset	365 edges$Interaction = 'co'
66ef158fa85c Uploaded spficklin parents: diff changeset	366 output = edges[,c('fromNode','toNode','Interaction', 'weight')]
66ef158fa85c Uploaded spficklin parents: diff changeset	367 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
66ef158fa85c Uploaded spficklin parents: diff changeset	368 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
66ef158fa85c Uploaded spficklin parents: diff changeset	369 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	370
66ef158fa85c Uploaded spficklin parents: diff changeset	371 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
66ef158fa85c Uploaded spficklin parents: diff changeset	372
66ef158fa85c Uploaded spficklin parents: diff changeset	373 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	374 # Save this image for the next step which is optional if theuser
66ef158fa85c Uploaded spficklin parents: diff changeset	375 # provides a trait file.
66ef158fa85c Uploaded spficklin parents: diff changeset	376 save.image(file=opt$r_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	377 ```

Mercurial > repos > spficklin > aurora_wgcna

annotate aurora_wgcna.Rmd @ 6:58b01fa2cc81 draft