# HG changeset patch
# User pravs
# Date 1529224092 14400
# Node ID 796a42e10f77c7061e3439bf1fe16ba3a406f471
# Parent fc89f8c3b777c252b63f6d0bd4f452eb228ac1ab
planemo upload
diff -r fc89f8c3b777 -r 796a42e10f77 test_data/PE_abundance_GE_abundance_pearson.html
--- a/test_data/PE_abundance_GE_abundance_pearson.html Sun Jun 17 04:20:06 2018 -0400
+++ /dev/null Thu Jan 01 00:00:00 1970 +0000
@@ -1,56 +0,0 @@
-
-Association between proteomics and transcriptomics data
- Input data summary
- Abbrebiations used: PE (Proteomics) and GE (Transcriptomics)
- Input PE data dimension (Row Column): 3597 58
- Input GE data dimension (Row Column): 191650 14
- Protein ID fetched from column: 7
- Transcript ID fetched from column: 1
- Protein ID type: ensembl_peptide_id_version
- Transcript ID type: ensembl_transcript_id_version
- Protein expression data fetched from column: 13
- Transcript expression data fetched from column: 10
- Total Protein ID mapped: 3582
- Total Protein ID unmapped: 15
- Total Transcript ID mapped: 3582
- Total Transcript ID unmapped: 188068
Download mapped unmapped data
- Protein mapped data: Link
- Protein unmapped data: Link
- Transcript mapped data: Link
- Transcript unmapped data: Link
- Protein abundance data: Link
- Transcript abundance data: Link
- Number of entries in Transcriptome data used for correlation: 3582
- Number of entries in Proteome data used for correlation: 3582
Filtering
Checking for NA or Inf or -Inf in either Transcriptome or Proteome data, if found, remove those entry
- Number of NA found: 88
- Number of Inf or -Inf found: 559
- Protein excluded data with NA or Inf or -Inf: Link
- Transcript excluded data with NA or Inf or -Inf: Link
Filtered data summary
Excluding entires with abundance values: NA/Inf/-Inf
- Number of entries in Transcriptome data remained: 2949
- Number of entries in Proteome data remained: 2949
Proteome data summary
- Parameter | Value |
---|
| Min. :-2.98277 |
- | 1st Qu.:-0.40393 |
- | Median :-0.07986 |
- | Mean : 0.00000 |
- | 3rd Qu.: 0.26061 |
- | Max. :15.13211 |
-
-Transcriptome data summary
- Parameter | Value |
---|
| Min. :-8.33003 |
- | 1st Qu.:-0.06755 |
- | Median : 0.09635 |
- | Mean : 0.00000 |
- | 3rd Qu.: 0.18103 |
- | Max. : 8.50430 |
-
-Distribution of Proteome and Transcripome abundance (Box plot and Density plot)
- Scatter plot between Proteome and Transcriptome Abundance
- Correlation with all data
- Parameter | Method 1 | Method 2 | Method 3 |
---|
Correlation method used | Pearson's product-moment correlation | Spearman's rank correlation rho | Kendall's rank correlation tau |
Correlation | -0.003584536 | 0.01866248 | 0.01280742 |
Pvalue | 0.8457255 | 0.3110035 | 0.314683 |
-*Note that correlation is sensitive to outliers in the data. So it is important to analyze outliers/influential observations in the data.
Below we use cook's distance based approach to identify such influential observations.Linear Regression model fit between Proteome and Transcriptome data
- Assuming a linear relationship between Proteome and Transcriptome data, we here fit a linear regression model.
- Parameter | Value |
---|
Formula | PE_abundance~GE_abundance |
- Coefficients |
- (Intercept) | 1.727289e-16 (Pvalue: 1 ) |
- GE_abundance | -0.003584536 (Pvalue: 0.8457255 ) |
- Model parameters |
- Residual standard error | 1.000163 ( 2947 degree of freedom) |
- F-statistic | 0.0378662 ( on 1 and 2947 degree of freedom) |
- R-squared | 1.28489e-05 |
- Adjusted R-squared | -0.0003264749 |
-
-Plotting various regression diagnostics plots
-Residuals vs Fitted plot
-
This plot checks for linear relationship assumptions. If a horizontal line is observed without any distinct patterns, it indicates a linear relationship
Normal Q-Q plot of residuals
-
This plot checks whether residuals are normally distributed or not. It is good if the residuals points follow the straight dashed line i.e., do not deviate much from dashed line.
Scale-Location (or Spread-Location) plot
-
This plot checks for homogeneity of residual variance (homoscedasticity). A horizontal line observed with equally spread residual points is a good indication of homoscedasticity.
Residuals vs Leverage plot
-
This plot is useful to identify any influential cases, that is outliers or extreme values that might influence the regression results upon inclusion or exclusion from the analysis.
Identify influential observations
-Cook’s distance computes the influence of each data point/observation on the predicted outcome. i.e. this measures how much the observation is influencing the fitted values.
In general use, those observations that have a cook’s distance > than 4 times the mean may be classified as influential.
In the above plot, observations above red line (4*mean cook's distance) are influential, marked in *. Genes that are outliers could be important. These observations influences the correlation values and regression coefficients
Parameter | Value |
---|
Mean cook's distance | 0.0002988385 |
- Total influential observations (cook's distance > 4 * mean cook's distance) | 90 |
- Total influential observations (cook's distance > 3 * mean cook's distance) | 116 |
-
Top 10 influential observations (cook's distance > 4 * mean cook's distance)
Download entire list PE_ID | PE_abundance | GE_ID | GE_abundance | cooksd |
---|
ENSMUSP00000107109.2 | -0.4719799 | ENSMUST00000001126 | -5.301664 | 0.001213545 |
ENSMUSP00000151536.1 | 3.113811 | ENSMUST00000001256 | -0.6348804 | 0.00230483 |
ENSMUSP00000150261.1 | 2.914045 | ENSMUST00000001583 | 0.4988006 | 0.001801232 |
ENSMUSP00000111204.1 | 2.850989 | ENSMUST00000002073 | 0.09635024 | 0.001391751 |
ENSMUSP00000089336.4 | 1.219945 | ENSMUST00000002391 | -2.47573 | 0.001781417 |
ENSMUSP00000030805.7 | -0.8313093 | ENSMUST00000003469 | 3.660597 | 0.001650483 |
ENSMUSP00000011492.8 | -0.3735374 | ENSMUST00000004326 | -7.366491 | 0.001556623 |
ENSMUSP00000029658.7 | 9.120211 | ENSMUST00000004473 | 0.09635024 | 0.01423993 |
ENSMUSP00000099904.4 | -1.913743 | ENSMUST00000004673 | 0.9756628 | 0.001209039 |
ENSMUSP00000081956.8 | 3.674308 | ENSMUST00000005607 | 1.306612 | 0.006223403 |
Scatter plot between Proteome and Transcriptome Abundance, after removal of outliers/influential observations
- Correlation with removal of outliers / influential observations
- We removed the influential observations and reestimated the correlation values.
Parameter | Method 1 | Method 2 | Method 3 |
---|
Correlation method used | Pearson's product-moment correlation | Spearman's rank correlation rho | Kendall's rank correlation tau |
Correlation | 0.01485058 | 0.0246989 | 0.01689519 |
Pvalue | 0.4273403 | 0.1867467 | 0.1918906 |
-Heatmap of PE and GE abundance values
-Kmean clustering
-Number of Clusters: 5
Download cluster list
Other regression model fitting
-
- - MAE:mean absolute error
- - MSE: mean squared error
- - RMSE:root mean squared error ( sqrt(MSE) )
- - MAPE:mean absolute percentage error
-
- Model | MAE | MSE | RMSE | MAPE | Diagnostics Plot |
---|
Linear regression with all data | 0.5463329 | 0.9996481 | 0.999824 | 0.9996321 | Link |
Linear regression with removal of outliers | 0.5404805 | 1.006281 | 1.003136 | 1.455637 | Link |
Resistant regression (lqs / least trimmed squares method) | 0.5407598 | 1.007932 | 1.003958 | 1.537172 | Link |
Robust regression (rlm / Huber M-estimator method) | 0.5404879 | 1.005054 | 1.002524 | 1.411806 | Link |
Polynomial regression with degree 2 | 0.546322 | 0.9996472 | 0.9998236 | 0.9993865 | Link |
Polynomial regression with degree 3 | 0.5469588 | 0.9976384 | 0.9988185 | 1.043158 | Link |
Polynomial regression with degree 4 | 0.5467885 | 0.9975077 | 0.9987531 | 1.041541 | Link |
Polynomial regression with degree 5 | 0.5467813 | 0.9975076 | 0.998753 | 1.041209 | Link |
Polynomial regression with degree 6 | 0.5465911 | 0.996652 | 0.9983246 | 1.056632 | Link |
Generalized additive models | 0.5463695 | 0.9976796 | 0.9988391 | 1.032766 | Link |
\ No newline at end of file