How to calculate the correlation between categorical variables and continuous variables?
This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).
One solution I found is, I can use ANOVA to calculate the Rsquare between categorical input and continuous output.
Here is my R code snip:
## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# regenerate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))
library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 < laply(xvars, function(x) {
laply(yvars, function(y) {
summary.lm(aov(y~x))$r.squared
})
})
rownames(r2) < colnames(xvars)
colnames(r2) < colnames(yvars)
pvalue < laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(pvalue) < colnames(xvars)
colnames(pvalue) < colnames(yvars)
require(pheatmap);
pheatmap(log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")
I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:
> summary.lm(results)
Call:
aov(formula = weight ~ group)
Residuals:
Min 1Q Median 3Q Max
1.0710 0.4180 0.0060 0.2627 1.3690
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 5.0320 0.1971 25.527 <2e16
grouptrt1 0.3710 0.2788 1.331 0.1944
grouptrt2 0.4940 0.2788 1.772 0.0877
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple Rsquared: 0.2641, Adjusted Rsquared: 0.2096
Fstatistic: 4.846 on 2 and 27 DF, pvalue: 0.01591
R^2 and pvalue are shown at the end of output.
Note: the
summary.lm() object doesn't contain value of pvalue directly. But we can compute pvalue in command like:
> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])
Below table is a nice summary the methods applicable to corresponding data type.
PREDICTOR VARIABLE (S) 
OUTCOME VARIABLE

 Categorical  Continuous 
Categorical  Chi Square, Log linear, Logistic  ttest, ANOVA (Analysis of Varirance), Linear regression 
Continuous  Logistic regression  Linear regression, Pearson correlation 
Mixture of Categorical and Continuous  Logistic regression  Linear regression, Analysis of Covariance 
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm
Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.
I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as
r = b1 * SDx / SDy
When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/Rtutorials/oneway.html