## Wednesday, May 02, 2012

### data transformation: variance-stabilizing transformation

Feeling good to read this wiki article in the morning, now I understand a bit more why we usually use logarithm (e.g. log2(x)) to transform the data before we do regression.

The reason to do variance-stabilizing transformation is to limit/remove the relationship between mean and variance. One of very important assumptions of linear regression is the constant variance (also know as homoskedasticity), see the Assumptions section of Linear regression wiki (http://en.wikipedia.org/wiki/Linear_regression#Assumptions). Violation of the assumption will lead to "less precise parameter estimates and misleading inferential quantities such as standard errors" (from wiki). I've not fully understood this part yet. But nevertheless, wiki article has pointed out several ways to fix the problem, among which is variance-stabilizing transformation (VST), e.g. logarithm. I'd like to try the other way, such as Bayesian linear regression, in future study.

OK. Another key question is: how to choose a proper VST?

Here is a very nice one-page math: http://www.stat.ufl.edu/~winner/sta6207/transform.pdf

To be simple, if variance of X is function of mean, V(X)=g(u), where u is mean of variable X (e.g. u=E(X)), then the proper VST is:
For example, the Anscombe transform is to transform Poisson distribution (where u=v) to Gaussian distribution:

If V(X)=g(u)=u^2, then the integration of 1/x is log(x).
That's what we usually used for microarray, RNAseq transformation. But do we test the V(X)=u^2 relationship?

We can test the homoskedasticity of residual by Breusch–Pagan testbptest {lmtest} is the R function. Here is the example output:
> library('lmtest')
> ## generate a regressor
> x <- rep(c(-1,1), 50)
> ## generate heteroskedastic and homoskedastic disturbances
> err1 <- rnorm(100, sd=rep(c(1,2), 50))
> err2 <- rnorm(100)
> ## generate a linear relationship
> y1 <- 1 + x + err1
> y2 <- 1 + x + err2
> ## perform Breusch-Pagan test
> bptest(y1 ~ x)
studentized Breusch-Pagan test
data:  y1 ~ x
BP = 15.867, df = 1, p-value = 6.795e-05
> bptest(y2 ~ x)
studentized Breusch-Pagan test
data:  y2 ~ x
BP = 7e-04, df = 1, p-value = 0.9784

y1 is significantly heteroskedastic, because the p-value is <0.05.