aurora_wgcna: aurora_wgcna.Rmd annotate

annotate aurora_wgcna.Rmd @ 17:8d2a768304e3 draft default tip

Uploaded

author	spficklin
date	Mon, 09 Dec 2019 12:45:11 -0500
parents	53fc3fdf5e9a
children

rev	line source
0 66ef158fa85c Uploaded spficklin parents: diff changeset	1 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	2 title: 'Aurora Galaxy WGCNA Tool: Gene Co-Expression Network Construction & Analysis'
66ef158fa85c Uploaded spficklin parents: diff changeset	3 output:
13 53fc3fdf5e9a Uploaded spficklin parents: 10 diff changeset	4 html_document:
0 66ef158fa85c Uploaded spficklin parents: diff changeset	5 number_sections: false
66ef158fa85c Uploaded spficklin parents: diff changeset	6 ---
66ef158fa85c Uploaded spficklin parents: diff changeset	7
66ef158fa85c Uploaded spficklin parents: diff changeset	8 ```{r setup, include=FALSE, warning=FALSE, message=FALSE}
66ef158fa85c Uploaded spficklin parents: diff changeset	9 knitr::opts_chunk$set(error = FALSE, echo = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	10 ```
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	11 ```{r, include=FALSE}
96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	12 options(tinytex.verbose = TRUE)
96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	13 ```
0 66ef158fa85c Uploaded spficklin parents: diff changeset	14
66ef158fa85c Uploaded spficklin parents: diff changeset	15 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	16 # Make a directory for saving the figures.
66ef158fa85c Uploaded spficklin parents: diff changeset	17 dir.create('figures', showWarnings = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	18 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	19
66ef158fa85c Uploaded spficklin parents: diff changeset	20 # Introduction
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	21 This report contains step-by-step results from use of the [Aurora Galaxy](https://github.com/statonlab/aurora-galaxy-tools) Weighted Gene Co-expression Network Analysis [WGCNA](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) tool. This tool wraps the WGCNA R package into a ready-to-use Rmarkdown file. It performs module discovery and network construction using a dataset and optional trait data matrix provided.
0 66ef158fa85c Uploaded spficklin parents: diff changeset	22
66ef158fa85c Uploaded spficklin parents: diff changeset	23 If you provided trait data, a second report will be available with results comparing the trait values to the identified modules.
66ef158fa85c Uploaded spficklin parents: diff changeset	24
66ef158fa85c Uploaded spficklin parents: diff changeset	25 This report was generated on:
66ef158fa85c Uploaded spficklin parents: diff changeset	26 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	27 format(Sys.time(), "%a %b %d %X %Y")
66ef158fa85c Uploaded spficklin parents: diff changeset	28 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	29
66ef158fa85c Uploaded spficklin parents: diff changeset	30
66ef158fa85c Uploaded spficklin parents: diff changeset	31 ## About the Input Data
66ef158fa85c Uploaded spficklin parents: diff changeset	32 ### Gene Expression Matrix (GEM)
66ef158fa85c Uploaded spficklin parents: diff changeset	33 The gene expression data is an n x m matrix where n rows are the genes, m columns are the samples and the elements represent gene expression levels (derived either from Microarray or RNA-Seq). The matrix was provided in a file meething these rules:
66ef158fa85c Uploaded spficklin parents: diff changeset	34 - Housed in a comma-separated (CSV) file.
66ef158fa85c Uploaded spficklin parents: diff changeset	35 - The rows represent the gene expression levels
66ef158fa85c Uploaded spficklin parents: diff changeset	36 - The first column of each row is the gene, transcript or probe name.
66ef158fa85c Uploaded spficklin parents: diff changeset	37 - The header contains only the sample names and therefore is one value less than the remaining rows of the file.
66ef158fa85c Uploaded spficklin parents: diff changeset	38
66ef158fa85c Uploaded spficklin parents: diff changeset	39 ### Trait/Phenotype Matrix
66ef158fa85c Uploaded spficklin parents: diff changeset	40 The trait/phenotype data is an n x m matrix where n is the samples and m are the features such as experimental condition, biosample properties, traits or phenotype values. The matrix is stored in a comma-separated (CSV) file and has a header.
66ef158fa85c Uploaded spficklin parents: diff changeset	41
66ef158fa85c Uploaded spficklin parents: diff changeset	42 ## Parameters provided by the user.
66ef158fa85c Uploaded spficklin parents: diff changeset	43 The following describes the input arguments provided to this tool:
66ef158fa85c Uploaded spficklin parents: diff changeset	44 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	45
66ef158fa85c Uploaded spficklin parents: diff changeset	46 if (!is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	47 print('The cut height for outlier removal of the sample dendrogram:')
66ef158fa85c Uploaded spficklin parents: diff changeset	48 print(opt$cut_height)
66ef158fa85c Uploaded spficklin parents: diff changeset	49 }
66ef158fa85c Uploaded spficklin parents: diff changeset	50
66ef158fa85c Uploaded spficklin parents: diff changeset	51 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	52 print('The power to which the gene expression data is raised:')
66ef158fa85c Uploaded spficklin parents: diff changeset	53 print(opt$power)
66ef158fa85c Uploaded spficklin parents: diff changeset	54 }
66ef158fa85c Uploaded spficklin parents: diff changeset	55 print('The minimal size for a module:')
66ef158fa85c Uploaded spficklin parents: diff changeset	56 print(opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	57
66ef158fa85c Uploaded spficklin parents: diff changeset	58 print('The block size for dividing the GEM to reduce memory requirements:')
66ef158fa85c Uploaded spficklin parents: diff changeset	59 print(opt$block_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	60
66ef158fa85c Uploaded spficklin parents: diff changeset	61 print('The hard threshold when generating the graph file:')
66ef158fa85c Uploaded spficklin parents: diff changeset	62 print(opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	63
66ef158fa85c Uploaded spficklin parents: diff changeset	64 print('The character string used to identify missing values in the GEM:')
66ef158fa85c Uploaded spficklin parents: diff changeset	65 print(opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	66
66ef158fa85c Uploaded spficklin parents: diff changeset	67 if (!is.null(opt$trait_data)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	68 print('The column in the trait data that contains the sample name:')
66ef158fa85c Uploaded spficklin parents: diff changeset	69 print(opt$sname_col)
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	70
0 66ef158fa85c Uploaded spficklin parents: diff changeset	71 print('The character string used to identify missing values in the trait data:')
66ef158fa85c Uploaded spficklin parents: diff changeset	72 print(opt$missing_value2)
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	73
0 66ef158fa85c Uploaded spficklin parents: diff changeset	74 print('Columns in the trait data that should be treated as categorical:')
66ef158fa85c Uploaded spficklin parents: diff changeset	75 print(opt$one_hot_cols)
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	76
0 66ef158fa85c Uploaded spficklin parents: diff changeset	77 print('Columns in the trait data that should be ignored:')
66ef158fa85c Uploaded spficklin parents: diff changeset	78 print(opt$ignore_cols)
66ef158fa85c Uploaded spficklin parents: diff changeset	79 }
66ef158fa85c Uploaded spficklin parents: diff changeset	80 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	81
66ef158fa85c Uploaded spficklin parents: diff changeset	82 ## If Errors Occur
66ef158fa85c Uploaded spficklin parents: diff changeset	83 Please note, that if any of the R code encountered problems, error messages will appear in the report below. If an error occurs anywhere in the report, results should be thrown out. Errors are usually caused by improperly formatted input data or improper input arguments. Use the following checklist to find and correct potential errors:
66ef158fa85c Uploaded spficklin parents: diff changeset	84
66ef158fa85c Uploaded spficklin parents: diff changeset	85 - Do the formats for the input datasets match the requirements listed above.
66ef158fa85c Uploaded spficklin parents: diff changeset	86 - Do the values set for missing values match the values in the input files, and is the missing value used consistently within the input files (i.e you don't have more than one such as 0.0 and 0, or NA and 0.0)
66ef158fa85c Uploaded spficklin parents: diff changeset	87 - If trait data was provided, check that the column specified for the sample name is correct.
66ef158fa85c Uploaded spficklin parents: diff changeset	88 - The block size should not exceed 10,000 and should not be lower than 1,000.
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	89 - Ensure that the sample names and all headers in the trait/phenotype data only contain alpha-numeric and underscore characters.
0 66ef158fa85c Uploaded spficklin parents: diff changeset	90
66ef158fa85c Uploaded spficklin parents: diff changeset	91
66ef158fa85c Uploaded spficklin parents: diff changeset	92 # Expression Data
66ef158fa85c Uploaded spficklin parents: diff changeset	93
66ef158fa85c Uploaded spficklin parents: diff changeset	94 The content below shows the first 10 rows and 6 columns of the Gene Expression Matrix (GEM) file that was provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	95
66ef158fa85c Uploaded spficklin parents: diff changeset	96 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	97 gem = read.csv(opt$expression_data, header = TRUE, row.names = 1, na.strings = opt$missing_value1)
66ef158fa85c Uploaded spficklin parents: diff changeset	98 #table_data = head(gem, 100)
66ef158fa85c Uploaded spficklin parents: diff changeset	99 #datatable(table_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	100 gem[1:10,1:6]
66ef158fa85c Uploaded spficklin parents: diff changeset	101 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	102
66ef158fa85c Uploaded spficklin parents: diff changeset	103 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	104 gemt = as.data.frame(t(gem))
66ef158fa85c Uploaded spficklin parents: diff changeset	105 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	106
66ef158fa85c Uploaded spficklin parents: diff changeset	107 The next step is to check the data for low quality samples or genes. These have too many missing values or consist of genes with zero-variance. Samples and genes are removed if they are low quality. The `goodSamplesGenes` function of WGCNA is used to identify such cases. The following cell indicates if WGCNA identified any low quality genes or samples, and these were removed.
66ef158fa85c Uploaded spficklin parents: diff changeset	108
66ef158fa85c Uploaded spficklin parents: diff changeset	109
66ef158fa85c Uploaded spficklin parents: diff changeset	110 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	111 gsg = goodSamplesGenes(gemt, verbose = 3)
66ef158fa85c Uploaded spficklin parents: diff changeset	112
66ef158fa85c Uploaded spficklin parents: diff changeset	113 if (!gsg$allOK) {
66ef158fa85c Uploaded spficklin parents: diff changeset	114 gemt = gemt[gsg$goodSamples, gsg$goodGenes]
66ef158fa85c Uploaded spficklin parents: diff changeset	115 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	116 print('all genes are OK!')
66ef158fa85c Uploaded spficklin parents: diff changeset	117 }
66ef158fa85c Uploaded spficklin parents: diff changeset	118 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	119
66ef158fa85c Uploaded spficklin parents: diff changeset	120
66ef158fa85c Uploaded spficklin parents: diff changeset	121 Hierarchical clustering can be used to explore the similarity of expression across the samples of the GEM. The following dendrogram shows the results of that clustering. Outliers typically appear on their own in the dendrogram. If a height was not specified to trim outlier samples, then the `cutreeDynamic` function is used to automatically find outliers, and then they are removed. If you do not approve of the automatically detected height, you can re-run this tool with a desired cut height. The two plots below show the dendrogram before and after outlier removal.
66ef158fa85c Uploaded spficklin parents: diff changeset	122
66ef158fa85c Uploaded spficklin parents: diff changeset	123 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	124 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	125
66ef158fa85c Uploaded spficklin parents: diff changeset	126 plotSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	127 plot(sampleTree, main = "Sample Clustering Prior to Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	128 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	129 }
66ef158fa85c Uploaded spficklin parents: diff changeset	130 png('figures/01-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	131 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	132 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	133 plotSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	134 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	135
66ef158fa85c Uploaded spficklin parents: diff changeset	136 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	137 if (is.null(opt$height_cut)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	138 print("You did not specify a height for cutting the dendrogram. The cutreeDynamic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	139 clust = cutreeDynamic(sampleTree, method="tree", minClusterSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	140 keepSamples = (clust!=0)
66ef158fa85c Uploaded spficklin parents: diff changeset	141 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	142 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	143 print("You specified a height for cutting of", opt$height_cut, ". The cutreeStatic function was used.")
66ef158fa85c Uploaded spficklin parents: diff changeset	144 clust = cutreeStatic(sampleTree, cutHeight = opt$height_cut, minSize = opt$min_cluster_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	145 keepSamples = (clust==1)
66ef158fa85c Uploaded spficklin parents: diff changeset	146 gemt = gemt[keepSamples, ]
66ef158fa85c Uploaded spficklin parents: diff changeset	147 }
66ef158fa85c Uploaded spficklin parents: diff changeset	148 n_genes = ncol(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	149 n_samples = nrow(gemt)
66ef158fa85c Uploaded spficklin parents: diff changeset	150 removed = length(which(keepSamples == FALSE))
66ef158fa85c Uploaded spficklin parents: diff changeset	151 if (removed == 1) {
66ef158fa85c Uploaded spficklin parents: diff changeset	152 print(paste("A total of", removed, "sample was removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	153 } else {
66ef158fa85c Uploaded spficklin parents: diff changeset	154 print(paste("A total of", removed, "samples were removed"))
66ef158fa85c Uploaded spficklin parents: diff changeset	155 }
66ef158fa85c Uploaded spficklin parents: diff changeset	156
66ef158fa85c Uploaded spficklin parents: diff changeset	157 # Write out the filtered GEM
66ef158fa85c Uploaded spficklin parents: diff changeset	158 write.csv(t(gemt), opt$filtered_GEM, quote=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	159 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	160 A file named `filtered_GEM.csv` has been created. This file is a comma-separated file containing the original gene expression data but with outlier samples removed. If no outliers were detected this file will be identical to the original.
66ef158fa85c Uploaded spficklin parents: diff changeset	161
66ef158fa85c Uploaded spficklin parents: diff changeset	162 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	163 sampleTree = hclust(dist(gemt), method = "average");
66ef158fa85c Uploaded spficklin parents: diff changeset	164
66ef158fa85c Uploaded spficklin parents: diff changeset	165 plotFilteredSampleDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	166 plot(sampleTree, main = "Sample Clustering After Outlier Removal", sub="", xlab="",
66ef158fa85c Uploaded spficklin parents: diff changeset	167 cex.axis = 1, cex.main = 1, cex = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	168 }
66ef158fa85c Uploaded spficklin parents: diff changeset	169 png('figures/02-filtered-sample_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	170 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	171 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	172 plotFilteredSampleDendro()
66ef158fa85c Uploaded spficklin parents: diff changeset	173 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	174
66ef158fa85c Uploaded spficklin parents: diff changeset	175 # Network Module Discovery
66ef158fa85c Uploaded spficklin parents: diff changeset	176
66ef158fa85c Uploaded spficklin parents: diff changeset	177 The first step in network module discovery is calculating similarity of gene expression. This is performed by comparing the expression of every gene with every other gene using a correlation test. However, the WGCNA authors suggest that raising the GEM to a power that best approximates scale-free behavior improves the quality of the final modules. However, the power to which the data should be raised is initially unknown. This is determined using the `pickSoftThreshold` function of WGCNA which iterates through a series of power values (usually between 1 to 20) and tests how well the data approximates scale-free behavior. The following table shows the results of those tests. The meaning of the table headers are:
66ef158fa85c Uploaded spficklin parents: diff changeset	178
66ef158fa85c Uploaded spficklin parents: diff changeset	179 - Power: The power tested
66ef158fa85c Uploaded spficklin parents: diff changeset	180 - SFT.R.sq: This is the scale free index, or the R.squared value of the undelrying regression model. It indicates how well the power-raised data appears scale free. The higher the value the more scale-free.
66ef158fa85c Uploaded spficklin parents: diff changeset	181 - slope: The slope of the regression line used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	182 - trunacted.R.sq: The adjusted R.squared measure from the truncated exponential model used to calculate SFT.R.sq
66ef158fa85c Uploaded spficklin parents: diff changeset	183 - mean.k: The mean degree (degree is a measure of how connected a gene is to every other gene. The higher the number the more connected.)
66ef158fa85c Uploaded spficklin parents: diff changeset	184 - median.k: The median degree
66ef158fa85c Uploaded spficklin parents: diff changeset	185 - max.k: The largest degree.
66ef158fa85c Uploaded spficklin parents: diff changeset	186
66ef158fa85c Uploaded spficklin parents: diff changeset	187 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	188 powers = c(1:10, seq(12, 20, 2))
66ef158fa85c Uploaded spficklin parents: diff changeset	189 sft = pickSoftThreshold(gemt, powerVector = powers, verbose = 5)
66ef158fa85c Uploaded spficklin parents: diff changeset	190 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	191
66ef158fa85c Uploaded spficklin parents: diff changeset	192 The following plots show how the scale-free index and mean connectivity change as the power is adjusted. The ideal power value for the network should be the value where there is a diminishing change in both the scale-free index and mean connectivity.
66ef158fa85c Uploaded spficklin parents: diff changeset	193
66ef158fa85c Uploaded spficklin parents: diff changeset	194 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	195 par(mfrow=c(1,2))
66ef158fa85c Uploaded spficklin parents: diff changeset	196 th = sft$fitIndices$SFT.R.sq[which(sft$fitIndices$Power == sft$powerEstimate)]
66ef158fa85c Uploaded spficklin parents: diff changeset	197
66ef158fa85c Uploaded spficklin parents: diff changeset	198 plotPower <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	199 # Scale-free topology fit index as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	200 plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	201 xlab="Soft Threshold (power)",
66ef158fa85c Uploaded spficklin parents: diff changeset	202 ylab="Scale Free Topology Model Fit,signed R^2", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	203 main = paste("Scale Independence"), cex.lab = 0.5);
66ef158fa85c Uploaded spficklin parents: diff changeset	204 text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
66ef158fa85c Uploaded spficklin parents: diff changeset	205 labels=powers,cex=0.5,col="red");
66ef158fa85c Uploaded spficklin parents: diff changeset	206 #abline(h=th, col="blue")
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	207
0 66ef158fa85c Uploaded spficklin parents: diff changeset	208 # Mean connectivity as a function of the soft-thresholding power.
66ef158fa85c Uploaded spficklin parents: diff changeset	209 plot(sft$fitIndices[,1], sft$fitIndices[,5],
66ef158fa85c Uploaded spficklin parents: diff changeset	210 xlab="Soft Threshold (power)",ylab="Mean Connectivity", type="n",
66ef158fa85c Uploaded spficklin parents: diff changeset	211 main = paste("Mean Connectivity"), cex.lab = 0.5)
66ef158fa85c Uploaded spficklin parents: diff changeset	212 text(sft$fitIndices[,1], sft$fitIndices[,5], labels=powers, cex=0.5,col="red")
66ef158fa85c Uploaded spficklin parents: diff changeset	213 #abline(h=th, col="blue")
66ef158fa85c Uploaded spficklin parents: diff changeset	214 par(mfrow=c(1,1))
66ef158fa85c Uploaded spficklin parents: diff changeset	215 }
66ef158fa85c Uploaded spficklin parents: diff changeset	216
66ef158fa85c Uploaded spficklin parents: diff changeset	217 png('figures/03-power_thresholding.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	218 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	219 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	220 plotPower()
66ef158fa85c Uploaded spficklin parents: diff changeset	221 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	222 Using the values in the table above, WGCNA is able to predict the ideal power. This selection is indicated in the following cell and is shown as a blue line on the plots above. If you believe that the power was incorrectly chosen, you can re-run this tool with the same input files and provide the desired power.
66ef158fa85c Uploaded spficklin parents: diff changeset	223
66ef158fa85c Uploaded spficklin parents: diff changeset	224 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	225 print("WGCNA predicted the following power:")
66ef158fa85c Uploaded spficklin parents: diff changeset	226 print(sft$powerEstimate)
66ef158fa85c Uploaded spficklin parents: diff changeset	227 power = sft$powerEstimate
66ef158fa85c Uploaded spficklin parents: diff changeset	228 if (!is.null(opt$power)) {
66ef158fa85c Uploaded spficklin parents: diff changeset	229 print("However, you selected to override this by providing a power of:", opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	230 print(opt$soft_threshold_power)
66ef158fa85c Uploaded spficklin parents: diff changeset	231 power = opt$power
66ef158fa85c Uploaded spficklin parents: diff changeset	232 }
66ef158fa85c Uploaded spficklin parents: diff changeset	233 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	234
66ef158fa85c Uploaded spficklin parents: diff changeset	235 Now that a power has been identified, modules can be discovered. Here, the `blockwiseModule` function of WGCNA is called. The dataset is divided into blocks of genes in order to keep memory usage low. The output of that function call is shown below. The number of blocks is dependent on the block size you provided.
66ef158fa85c Uploaded spficklin parents: diff changeset	236
66ef158fa85c Uploaded spficklin parents: diff changeset	237 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	238 net = blockwiseModules(gemt, power = power, maxBlockSize = opt$block_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	239 TOMType = "unsigned", minModuleSize = opt$min_cluster_size,
66ef158fa85c Uploaded spficklin parents: diff changeset	240 reassignThreshold = 0, mergeCutHeight = 0.25,
66ef158fa85c Uploaded spficklin parents: diff changeset	241 numericLabels = TRUE, pamRespectsDendro = FALSE,
66ef158fa85c Uploaded spficklin parents: diff changeset	242 verbose = 1, saveTOMs = TRUE,
66ef158fa85c Uploaded spficklin parents: diff changeset	243 saveTOMFileBase = "TOM")
66ef158fa85c Uploaded spficklin parents: diff changeset	244 blocks = sort(unique(net$blocks))
66ef158fa85c Uploaded spficklin parents: diff changeset	245 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	246 The following table shows the list of modules that were discovered and their size (i.e. number of genes).
66ef158fa85c Uploaded spficklin parents: diff changeset	247
66ef158fa85c Uploaded spficklin parents: diff changeset	248 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	249 module_labels = labels2colors(net$colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	250 module_labels = paste("ME", module_labels, sep="")
66ef158fa85c Uploaded spficklin parents: diff changeset	251 module_labels2num = unique(data.frame(label = module_labels, num = net$color, row.names=NULL))
66ef158fa85c Uploaded spficklin parents: diff changeset	252 rownames(module_labels2num) = paste0('ME', module_labels2num$num)
66ef158fa85c Uploaded spficklin parents: diff changeset	253 modules = unique(as.data.frame(table(module_labels)))
66ef158fa85c Uploaded spficklin parents: diff changeset	254 n_modules = length(modules) - 1
66ef158fa85c Uploaded spficklin parents: diff changeset	255 module_size_upper = modules[2]
66ef158fa85c Uploaded spficklin parents: diff changeset	256 module_size_lower = modules[length(modules)]
66ef158fa85c Uploaded spficklin parents: diff changeset	257 colnames(modules) = c('Module', 'Module Size')
66ef158fa85c Uploaded spficklin parents: diff changeset	258 #datatable(modules)
66ef158fa85c Uploaded spficklin parents: diff changeset	259 modules
66ef158fa85c Uploaded spficklin parents: diff changeset	260 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	261
66ef158fa85c Uploaded spficklin parents: diff changeset	262 Modules consist of a set of genes that have highly similar expression patterns. Therefore, the similarity of genes within a module can be summarized using an "eigengene" vector. This vector is analgous to the first principal component in a PCA analysis. Once each module's eigengene is calculated, they can be compared and displayed in dendrogram to identify which modules are most similar to each other. This is visible in the following plot.
66ef158fa85c Uploaded spficklin parents: diff changeset	263
66ef158fa85c Uploaded spficklin parents: diff changeset	264 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	265 MEs = net$MEs
66ef158fa85c Uploaded spficklin parents: diff changeset	266 colnames(MEs) = module_labels2num[colnames(MEs),]$label
66ef158fa85c Uploaded spficklin parents: diff changeset	267
66ef158fa85c Uploaded spficklin parents: diff changeset	268 png('figures/04-module_dendrogram.png', width=6 ,height=5, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	269 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	270 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	271 plotEigengeneNetworks(MEs, "Module Eigengene Dendrogram", plotHeatmaps = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	272 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	273
66ef158fa85c Uploaded spficklin parents: diff changeset	274 Alternatively, we can use a heatmap to explore similarity of each module.
66ef158fa85c Uploaded spficklin parents: diff changeset	275 ```{r fig.align='center'}
66ef158fa85c Uploaded spficklin parents: diff changeset	276 plotModuleHeatmap <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	277 plotEigengeneNetworks(MEs, "Module Eigengene Heatmap",
66ef158fa85c Uploaded spficklin parents: diff changeset	278 marHeatmap = c(2, 3, 2, 2),
66ef158fa85c Uploaded spficklin parents: diff changeset	279 plotDendrograms = FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	280 }
66ef158fa85c Uploaded spficklin parents: diff changeset	281 png('figures/05-module_eigengene_heatmap.png', width=4 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	282 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	283 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	284 plotModuleHeatmap()
66ef158fa85c Uploaded spficklin parents: diff changeset	285 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	286
66ef158fa85c Uploaded spficklin parents: diff changeset	287 We can examine gene similarity within the context of our modules. The following dendrogram clusters genes by their similarity of expression and the modules to which each gene belongs is shown under the graph. When similar genes appear in the same module, the same colors will be visible in "blocks" under the dendrogram. The presence of blocks of color indicate that genes in modules tend to have similar expression.
66ef158fa85c Uploaded spficklin parents: diff changeset	288
66ef158fa85c Uploaded spficklin parents: diff changeset	289 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	290 # Plot the dendrogram and the module colors underneath
66ef158fa85c Uploaded spficklin parents: diff changeset	291 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	292 options(repr.plot.width=15, repr.plot.height=10)
66ef158fa85c Uploaded spficklin parents: diff changeset	293 colors = module_labels[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	294 colors = sub('ME','', colors)
66ef158fa85c Uploaded spficklin parents: diff changeset	295 plotClusterDendro <- function() {
66ef158fa85c Uploaded spficklin parents: diff changeset	296 plotDendroAndColors(net$dendrograms[[i]], colors,
66ef158fa85c Uploaded spficklin parents: diff changeset	297 "Module colors", dendroLabels = FALSE, hang = 0.03,
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	298 addGuide = TRUE, guideHang = 0.05,
0 66ef158fa85c Uploaded spficklin parents: diff changeset	299 main=paste('Cluster Dendgrogram, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	300 }
66ef158fa85c Uploaded spficklin parents: diff changeset	301 png(paste0('figures/06-cluster_dendrogram_block_', i, '.png'), width=6 ,height=4, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	302 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	303 invisible(dev.off())
66ef158fa85c Uploaded spficklin parents: diff changeset	304 plotClusterDendro();
66ef158fa85c Uploaded spficklin parents: diff changeset	305 }
66ef158fa85c Uploaded spficklin parents: diff changeset	306
66ef158fa85c Uploaded spficklin parents: diff changeset	307 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	308
66ef158fa85c Uploaded spficklin parents: diff changeset	309 The network is housed in a n x n similarity matrix known as the the Topological Overlap Matrix (TOM), where n is the number of genes and the value in each cell indicates the measure of similarity in terms of correlation of expression and interconnectedness. The following heat maps shows the TOM. Note, that the dendrograms in the TOM heat map may differ from what is shown above. This is because a subset of genes were selected to draw the heat maps in order to save on computational time.
66ef158fa85c Uploaded spficklin parents: diff changeset	310
66ef158fa85c Uploaded spficklin parents: diff changeset	311 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	312 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	313 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	314 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	315 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	316 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	317 dissTOM = 1-TOM
66ef158fa85c Uploaded spficklin parents: diff changeset	318
66ef158fa85c Uploaded spficklin parents: diff changeset	319 # For reproducibility, we set the random seed
66ef158fa85c Uploaded spficklin parents: diff changeset	320 set.seed(10);
66ef158fa85c Uploaded spficklin parents: diff changeset	321 select = sample(dim(TOM)[1], size = 1000);
66ef158fa85c Uploaded spficklin parents: diff changeset	322 selectColors = module_labels[net$blockGenes[[i]][select]]
66ef158fa85c Uploaded spficklin parents: diff changeset	323 selectTOM = dissTOM[select, select];
66ef158fa85c Uploaded spficklin parents: diff changeset	324
66ef158fa85c Uploaded spficklin parents: diff changeset	325 # There’s no simple way of restricting a clustering tree to a subset of genes, so we must re-cluster.
66ef158fa85c Uploaded spficklin parents: diff changeset	326 selectTree = hclust(as.dist(selectTOM), method = "average")
66ef158fa85c Uploaded spficklin parents: diff changeset	327
66ef158fa85c Uploaded spficklin parents: diff changeset	328 # Taking the dissimilarity to a power, say 10, makes the plot more informative by effectively changing
66ef158fa85c Uploaded spficklin parents: diff changeset	329 # the color palette; setting the diagonal to NA also improves the clarity of the plot
66ef158fa85c Uploaded spficklin parents: diff changeset	330 plotDiss = selectTOM^7;
66ef158fa85c Uploaded spficklin parents: diff changeset	331 diag(plotDiss) = NA;
66ef158fa85c Uploaded spficklin parents: diff changeset	332 colors = sub('ME','', selectColors)
10 96ba1a8fff06 Uploaded spficklin parents: 4 diff changeset	333
0 66ef158fa85c Uploaded spficklin parents: diff changeset	334 png(paste0('figures/06-TOM_heatmap_block_', i, '.png'), width=6 ,height=6, units="in", res=300)
66ef158fa85c Uploaded spficklin parents: diff changeset	335 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	336 dev.off()
66ef158fa85c Uploaded spficklin parents: diff changeset	337 TOMplot(plotDiss, selectTree, colors, main = paste('TOM Heatmap, Block', i))
66ef158fa85c Uploaded spficklin parents: diff changeset	338 }
66ef158fa85c Uploaded spficklin parents: diff changeset	339 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	340
66ef158fa85c Uploaded spficklin parents: diff changeset	341 ```{r}
4 b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	342 module_colors = sub('ME','', module_labels)
b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	343 output = cbind(colnames(gemt), module_labels, module_colors)
b14e4bf568b0 Uploaded spficklin parents: 0 diff changeset	344 colnames(output) = c('Gene', 'Module', 'Color')
0 66ef158fa85c Uploaded spficklin parents: diff changeset	345 write.csv(output, file = opt$gene_module_file, quote=FALSE, row.names=FALSE)
66ef158fa85c Uploaded spficklin parents: diff changeset	346 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	347
66ef158fa85c Uploaded spficklin parents: diff changeset	348 A file has been generated named `gene_module_file.csv` which contains the list of genes and the modules they belong to.
66ef158fa85c Uploaded spficklin parents: diff changeset	349
66ef158fa85c Uploaded spficklin parents: diff changeset	350 The TOM serves as both a simialrity matrix and an adjacency matrix. The adjacency matrix is typically identical to a similarity matrix but with values above a set threshold set to 1 and values below set to 0. This is known as hard thresholding. However, WGCNA does not set values above a threshold to zero but leaves the values as they are, hence the word "weighted" in the WGCNA name. Additionally, it does not use a threshold at all, so no elements are set to 0. This approach is called "soft thresholding", because the pairwise weights of all genes contributed to discover of modules. The name "soft thresholding" may be a misnomer, however, because no thresholding in the traditional sense actually occurs.
66ef158fa85c Uploaded spficklin parents: diff changeset	351
66ef158fa85c Uploaded spficklin parents: diff changeset	352 Unfortunately, this "soft thresholding" approach can make creation of a graph representation of the network difficult. If we exported the TOM as a connected graph it would result in a fully complete graph and would be difficult to interpret. Therefore, we must still set a hard-threshold if we want to visualize connectivity in graph form. Setting a hard threshold, if too high can result in genes being excluded from the graph and a threshold that is too low can result in too many false edges in the graph.
66ef158fa85c Uploaded spficklin parents: diff changeset	353
66ef158fa85c Uploaded spficklin parents: diff changeset	354 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	355 edges = data.frame(fromNode= c(), toNode=c(), weight=c(), direction=c(), fromAltName=c(), toAltName=c())
66ef158fa85c Uploaded spficklin parents: diff changeset	356 for (i in blocks) {
66ef158fa85c Uploaded spficklin parents: diff changeset	357 # Load the TOM from a file.
66ef158fa85c Uploaded spficklin parents: diff changeset	358 load(net$TOMFiles[i])
66ef158fa85c Uploaded spficklin parents: diff changeset	359 TOM_size = length(which(net$blocks == i))
66ef158fa85c Uploaded spficklin parents: diff changeset	360 TOM = as.matrix(TOM, nrow=TOM_size, ncol=TOM_size)
66ef158fa85c Uploaded spficklin parents: diff changeset	361 colnames(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	362 row.names(TOM) = colnames(gemt)[net$blockGenes[[i]]]
66ef158fa85c Uploaded spficklin parents: diff changeset	363
66ef158fa85c Uploaded spficklin parents: diff changeset	364 cydata = exportNetworkToCytoscape(TOM, threshold = opt$hard_threshold)
66ef158fa85c Uploaded spficklin parents: diff changeset	365 edges = rbind(edges, cydata$edgeData)
66ef158fa85c Uploaded spficklin parents: diff changeset	366 }
66ef158fa85c Uploaded spficklin parents: diff changeset	367
66ef158fa85c Uploaded spficklin parents: diff changeset	368 edges$Interaction = 'co'
66ef158fa85c Uploaded spficklin parents: diff changeset	369 output = edges[,c('fromNode','toNode','Interaction', 'weight')]
66ef158fa85c Uploaded spficklin parents: diff changeset	370 colnames(output) = c('Source', 'Target', 'Interaction', 'Weight')
66ef158fa85c Uploaded spficklin parents: diff changeset	371 write.table(output, file = opt$network_edges_file, quote=FALSE, row.names=FALSE, sep="\t")
66ef158fa85c Uploaded spficklin parents: diff changeset	372 ```
66ef158fa85c Uploaded spficklin parents: diff changeset	373
66ef158fa85c Uploaded spficklin parents: diff changeset	374 Using the hard threshold parameter provided, a file has been generated named `network_edges.txt` which contains the list of edges. You can import this file into [Cytoscape](https://cytoscape.org/) for visualization. If you would like a larger graph, you must re-run the tool with a smaller threshold.
66ef158fa85c Uploaded spficklin parents: diff changeset	375
66ef158fa85c Uploaded spficklin parents: diff changeset	376 ```{r}
66ef158fa85c Uploaded spficklin parents: diff changeset	377 # Save this image for the next step which is optional if theuser
66ef158fa85c Uploaded spficklin parents: diff changeset	378 # provides a trait file.
66ef158fa85c Uploaded spficklin parents: diff changeset	379 save.image(file=opt$r_data)
66ef158fa85c Uploaded spficklin parents: diff changeset	380 ```

Mercurial > repos > spficklin > aurora_wgcna

annotate aurora_wgcna.Rmd @ 17:8d2a768304e3 draft default tip