In the output peaks.xls file from MACS(v1.4), there are 3 columns: -10log10(pvalue), fold_enrichment, and FDR(%). We can use these three columns to sort/filter peaks. Here are what they means:

The '−10*log10(Pvalue)' column lists the transformed P value of each peak, which makes peak sorting easier. For example, a P value of 1e−5 would be transformed to 50. The 'fold_enrichment' column shows the ratio ofthe ChIP-seq read counttothe local value of lambdawithin each peak. The 'FDR(%)' column contains the empirical FDR percentage for each peak. For example, the fourth peak in the list has an 'FDR(%)' value of '6.45' and '−10*log10(Pvalue)' value of '83.67'; using the same P value cutoff of 4.4e−09 = 1083.67/−10, the ratio of the number of peaks identified by MACS after and before exchanging control and ChIP-seq samples is 6.45:100. The FDR column is only available when the control sample is available.

As the MACS paper (

*Zhang et al., Genome Biology, 2008*) clearly stated, MACS uses Poisson distribution to measure the distribution of ChIP-seq tag. Poisson distribution is characterized as having only one parameter, λ, for both mean and variance. The origin of using Poisson distribution to measure ChIPseq tag can be referred to a*Nature*paper (see its supplementary note):The number of sequence reads required to map a chromatin feature can be estimated from a simple model.

Suppose that the genome is divided intoNnon-overlapping bins of fixed size, that a fractionfof these bins contain a particular chromatin feature and that one performs ChIP-Seq with an antibody that enriches the sequence in these bins by a factor ofe. If one collects a total ofRsequence reads, the number of reads in a bin should approximately follow a Poisson distribution with meaneMfor bins containing the feature andMfor the other bins, whereM =R/N(ef+(1-f)).

**Xianjun**: For those who have difficulty to understand the formula, I would suggest to just understand it like this: if there is a single Poisson distribution, the λ should be R/N; when there are two parts/distribution, we just divide N into Nf and N(1-f) and give different weight to them. For the unenriched part, you want to weight the enriched bins with weight e (because it's more enriched). That is R/(Nfe+N(1-f))=M; For the enriched part, you want to weight unenriched bins with 1/e, then the λ is R/(Nf+N(1-f)/e)=eM. (Thanks to Shikui and Sowmya for helping understanding Poisson distribution and the formula!)

MACS uses a dynamic λ to compensate the local fluctuations and biases even in control sample:

λ

_{local }= max(λ_{BG}, [λ_{1k},] λ_{5k}, λ_{10k})
where λ

_{BG }is a uniform estimation for the whole genome, λ_{1k}, λ_{5k }and λ_{10k }are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case λ_{1k }is not used). λ_{local }captures the influence of local biases, and is robust against occasional low tag counts at small local regions.**Xianjun**: That's why you will get different p-value and fold_enrichment for the same peak when using different control samples. MACS uses λ_{local }to calculate the*p*-value of each candidate peak (**Xianjun**: Once you have λ, you have the Poisson distribution; once you have a distribution and the observed value, which is the observed reads count in the peak region for this case, then you can get p-value based on the distribution. See wikipedia for p-value computation) and removes potential false positives due to local biases (that is, peaks significantly under λ_{BG}, but not under λ_{local}). Candidate peaks with*p*-values below a user-defined threshold*p*-value (default 10^{-5}) are called, and the ratio between the ChIP-Seq tag count and λ_{local }is reported as the*fold_enrichment*.
The question is: how to select MACS peaks based on the measurements?

To illustrate the relationship of the three measurements, I plot them as below. Note: the figures below are generated by Excel. Looking not so nice, but only for illustration purpose.

X-axis: -10log10(p-value); Y-axis: FDR (%)

But we can choose a smaller FDR cutoff (e.g. 5%), since it's underestimated due to a small control size in our case. That would correspond to a p-value of 10^-50, significant enough.

Here is the fold_enrichment histogram if we choose FDR cutoff as 10%.

## No comments:

## Post a Comment