## Sunday, August 16, 2015

### Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))

library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
laply(yvars, function(y) {
summary.lm(aov(y~x))\$r.squared
})
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))\$`Pr(>F)`
})
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

```> summary.lm(results)

Call:
aov(formula = weight ~ group)

Residuals:
Min      1Q  Median      3Q     Max
-1.0710 -0.4180 -0.0060  0.2627  1.3690

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591```

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

`> F=summary.lm(results)\$fstatistic`
`> F=as.numeric(F)`
`> pf(F,F,F)`

Below table is a nice summary the methods applicable to corresponding data type.

 PREDICTOR VARIABLE (S) OUTCOME VARIABLE Categorical Continuous Categorical Chi Square, Log linear, Logistic t-test, ANOVA (Analysis of Varirance), Linear regression Continuous Logistic regression Linear regression,  Pearson correlation Mixture of Categorical and Continuous Logistic regression Linear regression, Analysis of Covariance
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html

#### 1 comment:

1. Seemed to be very useful. But to be honest, it's a little bit difficult for me to understand.
-Caroline
http://www.creativebiomart.net/