It's not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.

Some terms first:

**- Type I error (false positive) and Type II error (false negative): **
When we do a hypothesis test, we can categorize the result into the following 2x2 table:

Table of error types | Null hypothesis (*H*_{0}) is |

Valid/True | Invalid/False |

Judgement of Null Hypothesis (*H*_{0}) | Reject | Type I error
(False Positive) | Correct inference
(True Positive) |

Fail to reject | Correct inference
(True Negative) | Type II error
(False Negative) |

Type I error is "you reject a true thing". If the true thing is a null hypothesis (H0), which is what people usually assume (e.g. no difference, no effect), then you reject it (or yes, there is difference), it's like a false positive. The similar logics for Type II error, or false negative.

Also note that people use Greek letter α for type I error rate and β for type II error rate. α is also the significant level for a test, e.g. 5%. So when a single test reaches p-value 0.05, we can intuitively understand that with 5% of chance we make a mistake or 5% of cases we thought significant are actually not. β is related with the power of a test. Power of a test = the ability to detect True Positive among all real positive cases.

**- Sensitivity and Specificity**
Total test (m) | Null hypothesis (*H*_{0}) is |

Valid/True | Invalid/False |

Judgement of Null Hypothesis (*H*_{0}) | Reject (R) | V | S |

Fail to reject | U | T |

Sensitivity = S / (S+T) = power = 1-β

Specificity = U / (U+V) = 1-α

**- Why multiple testing matters?**
It matters because we usually perform the same hypothesis tests not just once, but many many times. If your chance of making an error in single test is α, then your chance to make one or more errors in m tests will be

Pr(at least one error)=1−(1−α)^{m}

So, then m is large, the chance will be nearly 100%. That's why we need to adjust the p-values for the number of hypothesis tests performed, or to control type I error rate.

**- How to control type I error rate in multiple test?**

There are many different ways to control the type I errors, such as

**Per comparison error rate (PCER)**: the expected value of the number
of Type I errors over the number of hypotheses,
PCER = E(V)/m

**Per-family error rate (PFER)**: the expected number of Type I errors,
PFE = E(V).

**Family-wise error rate (FWER)**: the probability of at least one type I error, FWER = P(V ≥ 1)

**False discovery rate (FDR)** is the expected proportion of Type I errors
among the rejected hypotheses, FDR = E(V/R | R>0)P(R>0)

**Positive false discovery rate (pFDR)**: the rate that discoveries are
false, pFDR = E(V/R | R > 0)

**- Controlling Family-Wise Error Rate**
Many procedures have been developed to control the family-wise error rate P(V≥ 1), including the Bonferroni, Holm (1979), Hochberg (1988), and Sidak. It consists of two types

: single-step (e.g. Bonferroni) and sequential adjustment (e.g. Holm or Hochberg). Bonferroni correction is to control the overall type I errors when all tests are independent. It rejects any hypothesis with p-value ≤ α/m. So, when doing corrections, simply multiply the nominal p-value by m to get the adjusted p-values. In R, it's the following function

p.adjust(p, method = "bonferroni")

The sequential corrections is slightly more powerful than Bonferroni test. The Holm step-down procedure is the easiest to understand. First, sort your thousand p-values from low to high. Multiply the smallest p-value by one thousand. If that adjusted p-value is less than 0.05, then that gene shows evidence of differential expression. There is no difference as Bonferroni test for the gene. Then for the 2nd one, multiply its p-value by 999 (not one thousand) and see if it is less than 0.05. Multiply the third smallest p-value by 998, the fourth smallest by 997, etc. Compare each of these adjusted p-values to 0.05. We then insure that any adjusted p-value is at least as large as any preceding adjusted p-value. If it is not make sure it is equal to the largest of the preceding p-values. This is the algorithm of Holm step-down procedure. In R, it's

p.adjust(p, method = "holm")

**- Controlling FDR**

FWER is appropriate when you want to guard against
ANY false positives. However, in many cases (particularly in genomics) we
can live with a certain number of false positives. In these cases, the more relevant quantity to control is
the false discovery rate (FDR). False discovery rate (FDR) is designed to control the proportion
of false positives (V) among the set of rejected hypotheses (R). The FDR control has generated a lot of interest
due to its more balanced trade-off between error rate control and power than
the traditional Family-wise Error Rate control

Procedures controlling FDR include Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Benjamini & Hochberg (2000) and two-stage Benjamini & Hochberg (2006).

Here are the steps for Benjamini & Hochberg FDR:

1. sort nominal p-values from small to big: p1 ≤ p2 ≤ … ≤ pm

2. find a highest rank of j with pj < (j/m) x δ, where δ is the controlled FDR level.

3. declare the tests of rank 1, 2, …, j as significant, and their adjusted p-values as pj*m/j.

Reference:

http://www.r-bloggers.com/adjustment-for-multiple-comparison-tests-with-r-resources-on-the-web/

http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf

http://www.stat.berkeley.edu/~mgoldman/Section0402.pdf