Sunday, August 16, 2015

Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))

library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    summary.lm(aov(y~x))$r.squared
  })
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    anova(lm(y~x))$`Pr(>F)`[1]
  })
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

> summary.lm(results)

Call:
aov(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])

Below table is a nice summary the methods applicable to corresponding data type.

PREDICTOR VARIABLE (S)
OUTCOME VARIABLE
CategoricalContinuous
CategoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance)Linear regression
ContinuousLogistic regressionLinear regression,  Pearson correlation
Mixture of Categorical and ContinuousLogistic regressionLinear regressionAnalysis of Covariance
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html

2 comments:

  1. Seemed to be very useful. But to be honest, it's a little bit difficult for me to understand.
    -Caroline
    http://www.creativebiomart.net/

    ReplyDelete

  2. Most companies use telephone interviews as part of their recruitment process to whittle down a pile of CVs into a manageable interview shortlist. Visit interviewguidance.com

    ReplyDelete