Mercurial > repos > pravs > protein_rna_correlation
changeset 1:796a42e10f77 draft
planemo upload
author | pravs |
---|---|
date | Sun, 17 Jun 2018 04:28:12 -0400 |
parents | fc89f8c3b777 |
children | 412403eec79c |
files | test_data/PE_abundance_GE_abundance_pearson.html |
diffstat | 1 files changed, 0 insertions(+), 56 deletions(-) [+] |
line wrap: on
line diff
--- a/test_data/PE_abundance_GE_abundance_pearson.html Sun Jun 17 04:20:06 2018 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,56 +0,0 @@ -<html><body> -<h1>Association between proteomics and transcriptomics data</h1> - <font color='blue'><h3>Input data summary</h3></font> <ul> <li>Abbrebiations used: PE (Proteomics) and GE (Transcriptomics) </li> <li>Input PE data dimension (Row Column): 3597 58 </li> <li>Input GE data dimension (Row Column): 191650 14 </li> <li>Protein ID fetched from column: 7 </li> <li>Transcript ID fetched from column: 1 </li> <li>Protein ID type: ensembl_peptide_id_version </li> <li>Transcript ID type: ensembl_transcript_id_version </li> <li>Protein expression data fetched from column: 13 </li> <li>Transcript expression data fetched from column: 10 </li><li>Total Protein ID mapped: 3582 </li> <li>Total Protein ID unmapped: 15 </li> <li>Total Transcript ID mapped: 3582 </li> <li>Total Transcript ID unmapped: 188068 </li></ul><font color='blue'><h3>Download mapped unmapped data</h3></font> <ul><li>Protein mapped data: <a href=" output_fold/PE_mapped.tsv " target="_blank"> Link</a> </li> <li>Protein unmapped data: <a href=" output_fold/PE_unmapped.tsv " target="_blank"> Link</a> </li> <li>Transcript mapped data: <a href=" output_fold/GE_mapped.tsv " target="_blank"> Link</a> </li> <li>Transcript unmapped data: <a href=" output_fold/GE_unmapped.tsv " target="_blank"> Link</a> </li><li>Protein abundance data: <a href=" output_fold/PE_abundance.tsv " target="_blank"> Link</a> </li> <li>Transcript abundance data: <a href=" output_fold/GE_abundance.tsv " target="_blank"> Link</a> </li></ul><ul> <li>Number of entries in Transcriptome data used for correlation: 3582 </li> <li>Number of entries in Proteome data used for correlation: 3582 </li></ul><font color='blue'><h3>Filtering</h3></font> Checking for NA or Inf or -Inf in either Transcriptome or Proteome data, if found, remove those entry<br> <ul> <li>Number of NA found: 88 </li> <li>Number of Inf or -Inf found: 559 </li></ul><ul><li>Protein excluded data with NA or Inf or -Inf: <a href=" output_fold/PE_excluded_NA_Inf.tsv " target="_blank"> Link</a> </li> <li>Transcript excluded data with NA or Inf or -Inf: <a href=" output_fold/GE_excluded_NA_Inf.tsv " target="_blank"> Link</a> </li></ul><font color='blue'><h3>Filtered data summary</h3></font> Excluding entires with abundance values: NA/Inf/-Inf<br> <ul> <li>Number of entries in Transcriptome data remained: 2949 </li> <li>Number of entries in Proteome data remained: 2949 </li></ul><font color='blue'><h3>Proteome data summary</h3></font> - <table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Value</th></tr><tr><td> </td><td> Min. :-2.98277 </td></tr> -<tr><td> </td><td> 1st Qu.:-0.40393 </td></tr> -<tr><td> </td><td> Median :-0.07986 </td></tr> -<tr><td> </td><td> Mean : 0.00000 </td></tr> -<tr><td> </td><td> 3rd Qu.: 0.26061 </td></tr> -<tr><td> </td><td> Max. :15.13211 </td></tr> -</table> -<font color='blue'><h3>Transcriptome data summary</h3></font> - <table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Value</th></tr><tr><td> </td><td> Min. :-8.33003 </td></tr> -<tr><td> </td><td> 1st Qu.:-0.06755 </td></tr> -<tr><td> </td><td> Median : 0.09635 </td></tr> -<tr><td> </td><td> Mean : 0.00000 </td></tr> -<tr><td> </td><td> 3rd Qu.: 0.18103 </td></tr> -<tr><td> </td><td> Max. : 8.50430 </td></tr> -</table> -<font color='blue'><h3>Distribution of Proteome and Transcripome abundance (Box plot and Density plot)</h3></font> - <img src="AbundancePlot.png"><font color='blue'><h3>Scatter plot between Proteome and Transcriptome Abundance</h3></font> - <img src="AbundancePlot_scatter.png"><font color='blue'><h3>Correlation with all data</h3></font> - <table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Method 1</th><th>Method 2</th><th>Method 3</th></tr><tr><td>Correlation method used</td><td> Pearson's product-moment correlation </td><td> Spearman's rank correlation rho </td><td> Kendall's rank correlation tau </td></tr> <tr><td>Correlation</td><td> -0.003584536 </td><td> 0.01866248 </td><td> 0.01280742 </td></tr> <tr><td>Pvalue</td><td> 0.8457255 </td><td> 0.3110035 </td><td> 0.314683 </td></tr></table> -<font color="red">*Note that <u>correlation</u> is <u>sensitive to outliers</u> in the data. So it is important to analyze outliers/influential observations in the data.<br> Below we use <u>cook's distance based approach</u> to identify such influential observations.</font><font color='blue'><h3>Linear Regression model fit between Proteome and Transcriptome data</h3></font> - <p>Assuming a linear relationship between Proteome and Transcriptome data, we here fit a linear regression model.</p> - <table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Value</th></tr><tr><td>Formula</td><td> PE_abundance~GE_abundance </td></tr> - <tr><td colspan='2' align='center'> <b>Coefficients</b></td> </tr> - <tr><td> (Intercept) </td><td> 1.727289e-16 (Pvalue: 1 ) </td></tr> - <tr><td> GE_abundance </td><td> -0.003584536 (Pvalue: 0.8457255 ) </td></tr> - <tr><td colspan='2' align='center'> <b>Model parameters</b></td> </tr> - <tr><td>Residual standard error</td><td> 1.000163 ( 2947 degree of freedom)</td></tr> - <tr><td>F-statistic</td><td> 0.0378662 ( on 1 and 2947 degree of freedom)</td></tr> - <tr><td>R-squared</td><td> 1.28489e-05 </td></tr> - <tr><td>Adjusted R-squared</td><td> -0.0003264749 </td></tr> -</table> -<font color='blue'><h3>Plotting various regression diagnostics plots</h3></font> -<u><font color='brown'><h4>Residuals vs Fitted plot</h4></font></u> - <img src="PE_GE_lm_1.png"> <br><br>This plot checks for linear relationship assumptions. If a horizontal line is observed without any distinct patterns, it indicates a linear relationship<br><u><font color='brown'><h4>Normal Q-Q plot of residuals</h4></font></u> - <img src="PE_GE_lm_2.png"> <br><br>This plot checks whether residuals are normally distributed or not. It is good if the residuals points follow the straight dashed line i.e., do not deviate much from dashed line.<br><u><font color='brown'><h4>Scale-Location (or Spread-Location) plot</h4></font></u> - <img src="PE_GE_lm_3.png"> <br><br>This plot checks for homogeneity of residual variance (homoscedasticity). A horizontal line observed with equally spread residual points is a good indication of homoscedasticity.<br><u><font color='brown'><h4>Residuals vs Leverage plot</h4></font></u> - <img src="PE_GE_lm_3.png"> <br><br>This plot is useful to identify any influential cases, that is outliers or extreme values that might influence the regression results upon inclusion or exclusion from the analysis.<br><font color='blue'><h3>Identify influential observations</h3></font> -<p><b>Cook’s distance</b> computes the influence of each data point/observation on the predicted outcome. i.e. this measures how much the observation is influencing the fitted values.<br>In general use, those observations that have a <b>cook’s distance > than 4 times the mean</b> may be classified as <b>influential.</b></p><img src="PE_GE_lm_cooksd.png"> <br>In the above plot, observations above red line (4*mean cook's distance) are influential, marked in <b>*</b>. Genes that are outliers could be important. These observations influences the correlation values and regression coefficients<br><br><table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Value</th></tr><tr><td>Mean cook's distance</td><td> 0.0002988385 </td></tr> - <tr><td>Total influential observations (cook's distance > 4 * mean cook's distance)</td><td> 90 </td></tr> - <tr><td>Total influential observations (cook's distance > 3 * mean cook's distance)</td><td> 116 </td></tr> - </table> <font color="brown"><h4>Top 10 influential observations (cook's distance > 4 * mean cook's distance)</h4></font><a href=" ./output_fold/PE_GE_influential_observation.tsv " target="_blank">Download entire list</a><table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>PE_ID</th><th>PE_abundance</th><th>GE_ID</th><th>GE_abundance</th><th>cooksd</th></tr><tr> <td> ENSMUSP00000107109.2 </td> <td> -0.4719799 </td> <td> ENSMUST00000001126 </td> <td> -5.301664 </td> <td> 0.001213545 </td></tr><tr> <td> ENSMUSP00000151536.1 </td> <td> 3.113811 </td> <td> ENSMUST00000001256 </td> <td> -0.6348804 </td> <td> 0.00230483 </td></tr><tr> <td> ENSMUSP00000150261.1 </td> <td> 2.914045 </td> <td> ENSMUST00000001583 </td> <td> 0.4988006 </td> <td> 0.001801232 </td></tr><tr> <td> ENSMUSP00000111204.1 </td> <td> 2.850989 </td> <td> ENSMUST00000002073 </td> <td> 0.09635024 </td> <td> 0.001391751 </td></tr><tr> <td> ENSMUSP00000089336.4 </td> <td> 1.219945 </td> <td> ENSMUST00000002391 </td> <td> -2.47573 </td> <td> 0.001781417 </td></tr><tr> <td> ENSMUSP00000030805.7 </td> <td> -0.8313093 </td> <td> ENSMUST00000003469 </td> <td> 3.660597 </td> <td> 0.001650483 </td></tr><tr> <td> ENSMUSP00000011492.8 </td> <td> -0.3735374 </td> <td> ENSMUST00000004326 </td> <td> -7.366491 </td> <td> 0.001556623 </td></tr><tr> <td> ENSMUSP00000029658.7 </td> <td> 9.120211 </td> <td> ENSMUST00000004473 </td> <td> 0.09635024 </td> <td> 0.01423993 </td></tr><tr> <td> ENSMUSP00000099904.4 </td> <td> -1.913743 </td> <td> ENSMUST00000004673 </td> <td> 0.9756628 </td> <td> 0.001209039 </td></tr><tr> <td> ENSMUSP00000081956.8 </td> <td> 3.674308 </td> <td> ENSMUST00000005607 </td> <td> 1.306612 </td> <td> 0.006223403 </td></tr></table><font color='blue'><h3>Scatter plot between Proteome and Transcriptome Abundance, after removal of outliers/influential observations</h3></font> - <img src="AbundancePlot_scatter_without_outliers.png"><font color='blue'><h3>Correlation with removal of outliers / influential observations</h3></font> - <p>We removed the influential observations and reestimated the correlation values.</p><table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Parameter</th><th>Method 1</th><th>Method 2</th><th>Method 3</th></tr><tr><td>Correlation method used</td><td> Pearson's product-moment correlation </td><td> Spearman's rank correlation rho </td><td> Kendall's rank correlation tau </td></tr> <tr><td>Correlation</td><td> 0.01485058 </td><td> 0.0246989 </td><td> 0.01689519 </td></tr> <tr><td>Pvalue</td><td> 0.4273403 </td><td> 0.1867467 </td><td> 0.1918906 </td></tr></table> -<font color='blue'><h3>Heatmap of PE and GE abundance values</h3></font> -<img src="PE_GE_heatmap.png"><font color='blue'><h3>Kmean clustering</h3></font> -Number of Clusters: 5<br><a href=" output_fold/PE_GE_kmeans_clusterpoints.txt " target="_blank">Download cluster list</a><br><img src="PE_GE_kmeans.png"><font color='blue'><h3>Other regression model fitting</h3></font> -<ul> - <li>MAE:mean absolute error</li> - <li>MSE: mean squared error</li> - <li>RMSE:root mean squared error ( sqrt(MSE) )</li> - <li>MAPE:mean absolute percentage error</li> - </ul> - <h4><a href="PE_GE_modelfit.pdf" target="_blank">Comparison of model fits</a></h4><table class="embedded-table" border=1 cellspacing=0 cellpadding=5 style="table-layout:auto; "> <tr bgcolor="#c3f0d6"><th>Model</th><th>MAE</th><th>MSE</th><th>RMSE</th><th>MAPE</th><th>Diagnostics Plot</th></tr><tr><td>Linear regression with all data</td><td> 0.5463329 </td><td> 0.9996481 </td><td> 0.999824 </td><td> 0.9996321 </td><td> <a href="PE_GE_lm.pdf" target="_blank">Link</a> </td></tr> <tr><td>Linear regression with removal of outliers</td><td> 0.5404805 </td><td> 1.006281 </td><td> 1.003136 </td><td> 1.455637 </td><td> <a href="PE_GE_lm_without_outliers.pdf" target="_blank">Link</a> </td></tr> <tr><td>Resistant regression (lqs / least trimmed squares method)</td><td> 0.5407598 </td><td> 1.007932 </td><td> 1.003958 </td><td> 1.537172 </td><td> <a href="PE_GE_lqs.pdf" target="_blank">Link</a> </td></tr> <tr><td>Robust regression (rlm / Huber M-estimator method)</td><td> 0.5404879 </td><td> 1.005054 </td><td> 1.002524 </td><td> 1.411806 </td><td> <a href="PE_GE_rlm.pdf" target="_blank">Link</a> </td></tr> <tr><td>Polynomial regression with degree 2</td><td> 0.546322 </td><td> 0.9996472 </td><td> 0.9998236 </td><td> 0.9993865 </td><td> <a href="PE_GE_poly2.pdf" target="_blank">Link</a> </td></tr> <tr><td>Polynomial regression with degree 3</td><td> 0.5469588 </td><td> 0.9976384 </td><td> 0.9988185 </td><td> 1.043158 </td><td> <a href="PE_GE_poly3.pdf" target="_blank">Link</a> </td></tr> <tr><td>Polynomial regression with degree 4</td><td> 0.5467885 </td><td> 0.9975077 </td><td> 0.9987531 </td><td> 1.041541 </td><td> <a href="PE_GE_poly4.pdf" target="_blank">Link</a> </td></tr> <tr><td>Polynomial regression with degree 5</td><td> 0.5467813 </td><td> 0.9975076 </td><td> 0.998753 </td><td> 1.041209 </td><td> <a href="PE_GE_poly5.pdf" target="_blank">Link</a> </td></tr> <tr><td>Polynomial regression with degree 6</td><td> 0.5465911 </td><td> 0.996652 </td><td> 0.9983246 </td><td> 1.056632 </td><td> <a href="PE_GE_poly6.pdf" target="_blank">Link</a> </td></tr> <tr><td>Generalized additive models</td><td> 0.5463695 </td><td> 0.9976796 </td><td> 0.9988391 </td><td> 1.032766 </td><td> <a href="PE_GE_gam.pdf" target="_blank">Link</a> </td></tr> </table> \ No newline at end of file