In the output peaks.xls file from MACS(v1.4), there are 3 columns: -10log10(pvalue), fold_enrichment, and FDR(%). We can use these three columns to sort/filter peaks. Here are what they means:
The '−10*log10(Pvalue)' column lists the transformed P value of each peak, which makes peak sorting easier. For example, a P value of 1e−5 would be transformed to 50. The 'fold_enrichment' column shows the ratio of the ChIP-seq read count to the local value of lambda within each peak. The 'FDR(%)' column contains the empirical FDR percentage for each peak. For example, the fourth peak in the list has an 'FDR(%)' value of '6.45' and '−10*log10(Pvalue)' value of '83.67'; using the same P value cutoff of 4.4e−09 = 1083.67/−10, the ratio of the number of peaks identified by MACS after and before exchanging control and ChIP-seq samples is 6.45:100. The FDR column is only available when the control sample is available.
As the MACS paper (Zhang et al., Genome Biology, 2008) clearly stated, MACS uses Poisson distribution to measure the distribution of ChIP-seq tag. Poisson distribution is characterized as having only one parameter, λ, for both mean and variance. The origin of using Poisson distribution to measure ChIPseq tag can be referred to a Nature paper (see its supplementary note):
The number of sequence reads required to map a chromatin feature can be estimated from a simple model.Xianjun: For those who have difficulty to understand the formula, I would suggest to just understand it like this: if there is a single Poisson distribution, the λ should be R/N; when there are two parts/distribution, we just divide N into Nf and N(1-f) and give different weight to them. For the unenriched part, you want to weight the enriched bins with weight e (because it's more enriched). That is R/(Nfe+N(1-f))=M; For the enriched part, you want to weight unenriched bins with 1/e, then the λ is R/(Nf+N(1-f)/e)=eM. (Thanks to Shikui and Sowmya for helping understanding Poisson distribution and the formula!)
Suppose that the genome is divided into N non-overlapping bins of fixed size, that a fraction f of these bins contain a particular chromatin feature and that one performs ChIP-Seq with an antibody that enriches the sequence in these bins by a factor of e. If one collects a total of R sequence reads, the number of reads in a bin should approximately follow a Poisson distribution with mean eM for bins containing the feature and M for the other bins, where M = R/N(ef+(1-f)).
MACS uses a dynamic λ to compensate the local fluctuations and biases even in control sample:
The question is: how to select MACS peaks based on the measurements?
To illustrate the relationship of the three measurements, I plot them as below. Note: the figures below are generated by Excel. Looking not so nice, but only for illustration purpose.
X-axis: -10log10(p-value); Y-axis: FDR (%)
But we can choose a smaller FDR cutoff (e.g. 5%), since it's underestimated due to a small control size in our case. That would correspond to a p-value of 10^-50, significant enough.
Here is the fold_enrichment histogram if we choose FDR cutoff as 10%.