# HG changeset patch # User pravs # Date 1529224092 14400 # Node ID 796a42e10f77c7061e3439bf1fe16ba3a406f471 # Parent fc89f8c3b777c252b63f6d0bd4f452eb228ac1ab planemo upload diff -r fc89f8c3b777 -r 796a42e10f77 test_data/PE_abundance_GE_abundance_pearson.html --- a/test_data/PE_abundance_GE_abundance_pearson.html Sun Jun 17 04:20:06 2018 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,56 +0,0 @@ - -

Association between proteomics and transcriptomics data

Input data summary

Abbrebiations used: PE (Proteomics) and GE (Transcriptomics)
Input PE data dimension (Row Column): 3597 58
Input GE data dimension (Row Column): 191650 14
Protein ID fetched from column: 7
Transcript ID fetched from column: 1
Protein ID type: ensembl_peptide_id_version
Transcript ID type: ensembl_transcript_id_version
Protein expression data fetched from column: 13
Transcript expression data fetched from column: 10
Total Protein ID mapped: 3582
Total Protein ID unmapped: 15
Total Transcript ID mapped: 3582
Total Transcript ID unmapped: 188068

Download mapped unmapped data

Protein mapped data: Link
Protein unmapped data: Link
Transcript mapped data: Link
Transcript unmapped data: Link
Protein abundance data: Link
Transcript abundance data: Link

Number of entries in Transcriptome data used for correlation: 3582
Number of entries in Proteome data used for correlation: 3582

Filtering

Checking for NA or Inf or -Inf in either Transcriptome or Proteome data, if found, remove those entry

Number of NA found: 88
Number of Inf or -Inf found: 559

Protein excluded data with NA or Inf or -Inf: Link
Transcript excluded data with NA or Inf or -Inf: Link

Filtered data summary

Excluding entires with abundance values: NA/Inf/-Inf

Number of entries in Transcriptome data remained: 2949
Number of entries in Proteome data remained: 2949

Proteome data summary

- - - - - - -

Parameter	Value
	Min. :-2.98277
	1st Qu.:-0.40393
	Median :-0.07986
	Mean : 0.00000
	3rd Qu.: 0.26061
	Max. :15.13211

Transcriptome data summary

- - - - - - -

Parameter	Value
	Min. :-8.33003
	1st Qu.:-0.06755
	Median : 0.09635
	Mean : 0.00000
	3rd Qu.: 0.18103
	Max. : 8.50430

Distribution of Proteome and Transcripome abundance (Box plot and Density plot)

Scatter plot between Proteome and Transcriptome Abundance

Correlation with all data

Parameter	Method 1	Method 2	Method 3
Correlation method used	Pearson's product-moment correlation	Spearman's rank correlation rho	Kendall's rank correlation tau
Correlation	-0.003584536	0.01866248	0.01280742
Pvalue	0.8457255	0.3110035	0.314683

-*Note that correlation is sensitive to outliers in the data. So it is important to analyze outliers/influential observations in the data.
Below we use cook's distance based approach to identify such influential observations.

Linear Regression model fit between Proteome and Transcriptome data

Assuming a linear relationship between Proteome and Transcriptome data, we here fit a linear regression model.

- - - - - - - - - -

Parameter	Value
Formula	PE_abundance~GE_abundance
Coefficients
(Intercept)	1.727289e-16 (Pvalue: 1 )
GE_abundance	-0.003584536 (Pvalue: 0.8457255 )
Model parameters
Residual standard error	1.000163 ( 2947 degree of freedom)
F-statistic	0.0378662 ( on 1 and 2947 degree of freedom)
R-squared	1.28489e-05
Adjusted R-squared	-0.0003264749

Plotting various regression diagnostics plots

Residuals vs Fitted plot

This plot checks for linear relationship assumptions. If a horizontal line is observed without any distinct patterns, it indicates a linear relationship

Normal Q-Q plot of residuals

This plot checks whether residuals are normally distributed or not. It is good if the residuals points follow the straight dashed line i.e., do not deviate much from dashed line.

Scale-Location (or Spread-Location) plot

This plot checks for homogeneity of residual variance (homoscedasticity). A horizontal line observed with equally spread residual points is a good indication of homoscedasticity.

Residuals vs Leverage plot

This plot is useful to identify any influential cases, that is outliers or extreme values that might influence the regression results upon inclusion or exclusion from the analysis.

Identify influential observations

Cook’s distance computes the influence of each data point/observation on the predicted outcome. i.e. this measures how much the observation is influencing the fitted values.
In general use, those observations that have a cook’s distance > than 4 times the mean may be classified as influential.

In the above plot, observations above red line (4*mean cook's distance) are influential, marked in *. Genes that are outliers could be important. These observations influences the correlation values and regression coefficients

- - -

Parameter	Value
Mean cook's distance	0.0002988385
Total influential observations (cook's distance > 4 * mean cook's distance)	90
Total influential observations (cook's distance > 3 * mean cook's distance)	116

Top 10 influential observations (cook's distance > 4 * mean cook's distance)

Download entire list

PE_ID	PE_abundance	GE_ID	GE_abundance	cooksd
ENSMUSP00000107109.2	-0.4719799	ENSMUST00000001126	-5.301664	0.001213545
ENSMUSP00000151536.1	3.113811	ENSMUST00000001256	-0.6348804	0.00230483
ENSMUSP00000150261.1	2.914045	ENSMUST00000001583	0.4988006	0.001801232
ENSMUSP00000111204.1	2.850989	ENSMUST00000002073	0.09635024	0.001391751
ENSMUSP00000089336.4	1.219945	ENSMUST00000002391	-2.47573	0.001781417
ENSMUSP00000030805.7	-0.8313093	ENSMUST00000003469	3.660597	0.001650483
ENSMUSP00000011492.8	-0.3735374	ENSMUST00000004326	-7.366491	0.001556623
ENSMUSP00000029658.7	9.120211	ENSMUST00000004473	0.09635024	0.01423993
ENSMUSP00000099904.4	-1.913743	ENSMUST00000004673	0.9756628	0.001209039
ENSMUSP00000081956.8	3.674308	ENSMUST00000005607	1.306612	0.006223403

Scatter plot between Proteome and Transcriptome Abundance, after removal of outliers/influential observations

Correlation with removal of outliers / influential observations

We removed the influential observations and reestimated the correlation values.

Parameter	Method 1	Method 2	Method 3
Correlation method used	Pearson's product-moment correlation	Spearman's rank correlation rho	Kendall's rank correlation tau
Correlation	0.01485058	0.0246989	0.01689519
Pvalue	0.4273403	0.1867467	0.1918906

Heatmap of PE and GE abundance values

Kmean clustering

-Number of Clusters: 5
Download cluster list

Other regression model fitting

MAE:mean absolute error
MSE: mean squared error
RMSE:root mean squared error ( sqrt(MSE) )
MAPE:mean absolute percentage error

Comparison of model fits

Model	MAE	MSE	RMSE	MAPE	Diagnostics Plot
Linear regression with all data	0.5463329	0.9996481	0.999824	0.9996321	Link
Linear regression with removal of outliers	0.5404805	1.006281	1.003136	1.455637	Link
Resistant regression (lqs / least trimmed squares method)	0.5407598	1.007932	1.003958	1.537172	Link
Robust regression (rlm / Huber M-estimator method)	0.5404879	1.005054	1.002524	1.411806	Link
Polynomial regression with degree 2	0.546322	0.9996472	0.9998236	0.9993865	Link
Polynomial regression with degree 3	0.5469588	0.9976384	0.9988185	1.043158	Link
Polynomial regression with degree 4	0.5467885	0.9975077	0.9987531	1.041541	Link
Polynomial regression with degree 5	0.5467813	0.9975076	0.998753	1.041209	Link
Polynomial regression with degree 6	0.5465911	0.996652	0.9983246	1.056632	Link
Generalized additive models	0.5463695	0.9976796	0.9988391	1.032766	Link

\ No newline at end of file