Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post:

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.

Thursday, July 20, 2017

Educational papers in Bioinformatics from Nature Methods and Nature Biotechnology

Nature Methods and Nature Biotechnology have this super-great monthly column in bioinformatics for years. For example, Nature Methods has both the "Points of Significance" covering topics in statistics and "Points of View" in data visualization. And Nature Biotechnology's Primer column covers a collection of topics in Bioinformatics, such as "What's a hidden Markov model?". 

Today I've compiled all the articles from above three columns together and made a PDF for you. Here is the link

Monday, May 15, 2017

Kingdoms and domains of life

Notes of revisiting the topic:

  1. Linnaeus originally placed all living things into either plant or animal kingdom, and nowaday we place living things (e.g. not virus) into six (Animalia 动物界, Plantae 植物界, Fungi 真菌界, Protista 原生生物界, Archaea/Archaeabacteria 古细菌界, and Bacteria/Eubacteria 细菌界in US's textbook) or five (Animalia, Plantae, Fungi, Protista and Monera in other countries) kingdoms.
  2. Protista is any eukaryotic organism that is not an animal, plant, or fungi. 原生生物是除动物植物真菌之外的一切真核生物,包括海带、紫菜、藻类、变形虫等。所以海带不是植物,长知识了!
  3. Archaea/Archaeabacteria 古细菌界 vs. Bacteria/Eubacteria 细菌界: Archaea (/ar-kee/, ending sounds like IKEA) and Bacteria are both prokaryotes. Archaea is similar to Bacteria in structure (e.g. No membrane-bound organelles or nucleus, circular chromosome), but is closer to eukaryote in biochemistry (e.g. similar transcription and translation to Eukarya, e.g. TFs bound to the promoter, RNA pol II). Archaea was originally found in the extreme environment, e.g. extreme hot spring in Yellowstone, but now it's found everywhere in the world, e.g. in the gut of a human. 
  4. Not all extremophiles are Archaeons. Thermus aquaticus, originally found in Yellowstone National Park, is not an Archaea, but a thermophilic Bacteria. The famous Taq enzyme for PCR is from the bacteria. T. aquaticus 最初是由Thomas Brock在1969年发现的,1976年DNA聚合酶polymerase 首次从T. aquaticus中分离出来,1988年Kary Mullis等发现并提纯了这个酶,指出其可用于PCR,5年后(1993)因此获得诺贝尔化学奖。
  5. Six kingdoms are grouped into three domains (Bacteria, Archaea, and Eukarya), although both Bacteria and Archaea are Prokarya (原核,无核膜). The main characteristics that scientists focus on when determining which kingdom an organism belongs to are summarized below:
       • Cell Type: Organisms are either prokaryotic or eukaryotic (有无核膜).
       • Cell Walls: Some organisms have cell walls and others do not (有无细胞壁).
       • Body Type: Organisms can be unicellular or multicellular (单细胞 or 多细胞).
       • Nutrition: Many organisms are autotrophs and others are heterotrophs, respectively depending on whether they make their own food or receive nutrients by consuming other organisms (自养 or 它养.

Tuesday, January 24, 2017

Simple note for github vs. Rstudio

Having used both Rstudio and GitHub for a while, but never thought to combine them together. Hadley Wickham already put a detailed instruction on how to link GitHub with Rstudio ( I just list few quick note from my own experience.

1. The "project" concept of Rstudio is the key. I was not very organized before, very often just working on multiple files from different projects at the same time. Actually it's very distractive. It seems that only files organized into a project can be pushed to GitHub via Rstudio. Very good!

2. How to install R package in private GitHub repo? The trick is here.
## add your PAT (Personal Access Token, get here: into your ~/.Renviron file
# echo "GITHUB_PAT=xxxxxxxxxxYOUR-PAT-GO-HERExxxxxxxxxx" >> ~/.Renviron
# library(devtools);install_github("sterding/powerEQTL")

Output, Reads, Cluster Numbers etc. for Illumina

Tuesday, December 20, 2016

remove and disable .DS_Store in mounted disk

Do you see this when you want to extend your path by typing the TAB key? What are those .DS_Store files?

They are generated by MacOS (see [1] for detail). Because I mount my remote disk via SMB, MacOS Finder will generate such a configure file whenever you open the mounted folder in Finder locally.

It's harmless to remove those files [2].

How to remove those already in your folder? [3]
find ./ -name ".DS_Store" -depth -exec rm {} \;

How to stop generating .DS_Store file on Network drive? [4]
defaults write DSDontWriteNetworkStores -boolean true


Wednesday, November 09, 2016


Regarding the question what's IBD and IBS, here is my understanding:

A SNP (or a segment of DNA) is identical by state (IBS) if they have the same alleles in two of more individuals. For example, if person A has genotype A/C at a SNP position while person B has A/T at the position, then that SNP position has a IBS=1 status. IBD is a IBS that their shared alleles are from a common ancestor. That is, in my above example, if A and B’s parents have genotype A/C(mom) and A/T (dad), then we don’t know if the shared “A” allele in A and B came from mom or dad. In that case, it’s just an IBS, not an IBD. But if their parents’ genotypes are A/G (mom) and C/T (dad), then we know for largely sure the “A” is from mom, not from dad. So, it’s both an IBD and IBS. In practice, for a pair of individuals, without knowing their parents, we could estimate their IBD likelihood for each SNP using their IBS information. I think the math behind is Bayesian theorem. 

PLINK use IBS-based clustering to remove sample outliers (which have too-far IBS distance from the rest), see here:
PLINK use IBD to detect samples too-close to each other (e.g. relatives, contamination). See here:

Below is a nice piece for how to use IBD to detect relative. From Anderson, C. et al. Data quality control in genetic case-control association studies. Nat Protoc. 5, 1564–1573 (2010).
The expectation is that IBD = 1 for duplicates or monozygotic twins, IBD = 0.5 for first-degree relatives, IBD = 0.25 for second-degree relatives and IBD = 0.125 for third- degree relatives. Owing to genotyping error, LD and population structure, there is often some variation around these theoretical values and it is typical to remove one individual from each pair with an IBD value of > 0.1875, which is halfway between the expected IBD for third- and second-degree relatives. For these reasons an IBD value of > 0.98 identifies duplicates.

Another review on this is also worthy reading: Joseph E. Powell, Peter M. Visscher & Michael E. Goddard. Reconciling the analysis of IBD and IBS in complex trait studies. Nature Reviews Genetics 11, 800-805 (2010)

The following slide (from could be easier to understand IBD vs. IBS:

Friday, July 08, 2016

Best way to draw heatmap for publication

Here are two tips I can share if you were also working on a big dataset towards a high quality heatmap:

1. Don't generate PDF using pheatmap() or heatmap.2() as (i) the file is unnecessarily SUPER large if you have a lot of data points in the heatmap, so that you can kill your Illustrator; (ii) annoying grey boxes added to the grip (see here). Use basic image() with zero margins (e.g. par(mar=c(0,0,0,0))) to generate high-resolution PNG (or JPEG, TIFF) and place in Adobe Illustrator. You can freely add legend/annotation there easily.

2. When you use image(), rotate your matrix 90 degree clockwise first, so that the conventional printed layout of the matrix is same what you see in the image (e.g. top-left corner --> x(1,1) in your matrix etc.). An elegant piece of code for clockwise rotation can be found here:
rotate <- function(x) t(apply(x, 2, rev))

Tuesday, February 16, 2016

My 15 practical tips for a bioinformatician

Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this:
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
  9. Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (, as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading:

Thursday, January 21, 2016

Continuous tab in indentation could cause problem

When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.

for i in 1 2 3 4 5
    echo $i
    for j in eRNA promoter exon random
        echo $j

I figure it out when I display all the white characters, as shown below:

enter image description here

Do you see the difference?

The “working” code, which I copied from the Linux document ( has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.

To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option:

Tuesday, January 12, 2016

Calculate the odd of winning Powerball in R

This Wednesday’s Powerball grand prize already climbed up to $1.5 BILLION. If you choose to cash out, it would be $930 million. And it keeps increasing…
So, what’s the odd of winning the jackpot prize?
Here is the game rule according to

…we draw five white balls out of a drum with 69 balls and one red ball
out of a drum with 26 red balls.

We can calculate the total number of different combinations in R:

> choose(69,5)*26
[1] 292201338

If we are super super lucky to win the Jackpot of $930 million cash value, given that we have to pay 39.6% as federal tax, how much we expect to return for a $2 investment? (Of course, everyone expect to win the $1.5 billion jackpot)

> 930e6*(1-0.396)/(choose(69,5)*26)
[1] 1.92

Actually it’s not a good investment. (Thanks for the comment below. I made a mistake; 930million should be 930e6, not 930e9).
If we want to be 100% guaranteed, we have to buy all 292 million combinations. In that case, can we earn?
enter image description here

# this is what we pay
> choose(69,5)*26*2
[1] 584,402,676

# this is what we earn in total, before tax
> 930000000 + 1000000*choose(25,1) + 50000*choose(5,4)*choose(69-5,1) + 100*choose(5,4)*choose(69-5,1)*choose(25,1) + 100*choose(5,3)*choose(69-5,2) + 7*choose(5,3)*choose(69-5,2)*choose(25,1) + 7*choose(5,2)*choose(69-5,3) + 4*choose(5,1)*choose(69-5,4) + 4*choose(69-5,5)
[1] 1,023,466,048

# Nearly $5 billion!!! Then we need to pay 40% of tax. Maybe not for the minor prize, let's simplify it for all.
> 1023466048 * (1-0.396)
[1] 618,173,493

That’s still more than what we paid. Why don’t we do that?
Remember, people share the prize if multiple persons got the same winning number, which we don’t know. :D

Just some fun! :)

1. Fascinating Math Behind Why You Won’t Win Powerball
2. Tax for lottery (

Wednesday, December 09, 2015

Dopamine is the "Kim Kardashian of molecule" in brain -- I like this!

The main dopaminergic pathways of the human brain. (Wikipedia)

, and it's interesting to know the impressive people in the Kardashian family. 

Kim Kardashian's step father, Bruce Jenner (and now Caitlyn Jenner) 


SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema:
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post:  -- Forward is from dbSNP, Plus is from 1000 Genome. See #6 below. 
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  6. Important note: The FinalReport.txt from Illumina genotype tool GenomeStudio can also export other columns like "Allele1 - Forward", "Allele2 - Forward", "Allele1 - Plus", "Allele2 - Plus". I called Illumina tech that the Forward is for dbSNP, and Plus is for 1000 Genome. Be cautious that dbSNP and +/- can also be change between different version of dbSNP or genome building. (Full description is here, login required). To get to know which version your allele is referred to, you have to get the manifest file from your base caller. 

Friday, December 04, 2015

My note on multiple testing

It's not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.

Some terms first:
- Type I error (false positive) and Type II error (false negative): 
When we do a hypothesis test, we can categorize the result into the following 2x2 table:
 Table of error types Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)RejectType I error
(False Positive)
Correct inference
(True Positive)
Fail to rejectCorrect inference
(True Negative)
Type II error
(False Negative)
Type I error is "you reject a true thing". If the true thing is a null hypothesis (H0), which is what people usually assume (e.g. no difference, no effect), then you reject it (or yes, there is difference), it's like a false positive. The similar logics for Type II error, or false negative.

Also note that people use Greek letter α for type I error rate and β for type II error rate. α is also the significant level for a test, e.g. 5%. So when a single test reaches p-value 0.05, we can intuitively understand that with 5% of chance we make a mistake or 5% of cases we thought significant are actually not. β is related with the power of a test. Power of a test = the ability to detect True Positive among all real positive cases.

- Sensitivity and Specificity
 Total test (m)Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)Reject (R)VS
Fail to rejectUT
Sensitivity = S / (S+T)  = power = 1-β
Specificity = U / (U+V) = 1-α

- Why multiple testing matters?
It matters because we usually perform the same hypothesis tests not just once, but many many times. If your chance of making an error in single test is α, then your chance to make one or more errors in m tests will be
Pr(at least one error)=1−(1−α)m
So, then m is large, the chance will be nearly 100%. That's why we need to adjust the p-values for the number of hypothesis tests performed, or to control type I error rate.

- How to control type I error rate in multiple test?
There are many different ways to control the type I errors, such as
Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses, PCER = E(V)/m
Per-family error rate (PFER): the expected number of Type I errors, PFE = E(V).
Family-wise error rate (FWER): the probability of at least one type I error, FWER = P(V ≥ 1)
False discovery rate (FDR) is the expected proportion of Type I errors among the rejected hypotheses, FDR = E(V/R | R>0)P(R>0)
Positive false discovery rate (pFDR): the rate that discoveries are false, pFDR = E(V/R | R > 0)

- Controlling Family-Wise Error Rate
Many procedures have been developed to control the family-wise error rate P(V≥ 1), including the Bonferroni, Holm (1979), Hochberg (1988), and Sidak. It consists of two typessingle-step (e.g. Bonferroni) and sequential adjustment (e.g. Holm or Hochberg). Bonferroni correction is to control the overall type I errors when all tests are independent. It rejects any hypothesis with p-value ≤ α/m. So, when doing corrections, simply multiply the nominal p-value by m to get the adjusted p-values. In R, it's the following function
p.adjust(p, method = "bonferroni")
The sequential corrections is slightly more powerful than Bonferroni test. The Holm step-down procedure is the easiest to understand. First, sort your thousand p-values from low to high. Multiply the smallest p-value by one thousand. If that adjusted p-value is less than 0.05, then that gene shows evidence of differential expression. There is no difference as Bonferroni test for the gene. Then for the 2nd one, multiply its p-value by 999 (not one thousand) and see if it is less than 0.05. Multiply the third smallest p-value by 998, the fourth smallest by 997, etc. Compare each of these adjusted p-values to 0.05. We then insure that any adjusted p-value is at least as large as any preceding adjusted p-value. If it is not make sure it is equal to the largest of the preceding p-values. This is the algorithm of Holm step-down procedure. In R, it's
p.adjust(p, method = "holm")

- Controlling FDR
FWER is appropriate when you want to guard against ANY false positives. However, in many cases (particularly in genomics) we can live with a certain number of false positives. In these cases, the more relevant quantity to control is the false discovery rate (FDR). False discovery rate (FDR) is designed to control the proportion of false positives (V) among the set of rejected hypotheses (R). The FDR control has generated a lot of interest due to its more balanced trade-off between error rate control and power than the traditional Family-wise Error Rate control

Procedures controlling FDR include Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Benjamini & Hochberg (2000) and two-stage Benjamini & Hochberg (2006).

Here are the steps for Benjamini & Hochberg FDR:
1. sort nominal p-values from small to big: p1 ≤ p2 ≤ … ≤ pm
2. find a highest rank of j with pj < (j/m) x δ, where δ is the controlled FDR level. 
3. declare the tests of rank 1, 2, …, j as significant, and their adjusted p-values as pj*m/j. 


Wednesday, November 18, 2015

DNA and RNA structure

Let's review the basic thing we learned from Molecular Biology! :) Actually I'm thinking if someone can design a toy for kids to learn DNA structure, something like a puzzle, each unit is a magnetic piece and they can only be assembled/aligned in certain direction. That would be cool!


Each strand of DNA double helix is a chain of bases linked to the sugar-phosphate backbone. The two strands are connected with hydrogen bonds. As below:
The phosphate group is linked between the 5th carbon of an upstream sugar and 3rd carbon and a downstream sugar. The 5-carbon sugar is called deoxyribose in DNA and ribose in RNA, which can be distinguished by the loss or presence of an oxygen atom at the 2' carbon, as highlighted in red in the following figure. 

RNA structure is very similar as single-strand DNA, except that
  • The sugar in RNA is a ribose sugar (as opposed to deoxy-ribose) and has an –OH at the 2' C position highlighted in red in the figure below (DNA sugars have –H at that position)
  • Thymine in DNA is replaced by Uracil in RNA. T has a methyl (-CH3) group instead of the H atom shown in red in U.

Now let's see the cap structure of RNA:

Cap structure are most commonly added to the 5' of RNA, known as mRNA capping. The 5' capping, as the left brown part shown in following figure, is like a guanosine(G) nucleotide but with a methyl group added at the 7' in guanosine. Its 5' is connected to the 5' of RNA with a triphosphate bridge. 

The following figure also contains a 2' cap, which can rarely happen in some eukaryote and viral genome.

Note that mitochondrial mRNAs don't have cap.

Friday, October 23, 2015

Steps to push code to a github private repository

  • If you are the first time user:
Step1: Create a new repository on (supposed you already have a github account), private or public. In example below, the repository is named as PREDICT-HD

Step2: [optional] generate the public key and add to your github account
Follow this help page exactly:

Step3: Push the code to github from your local computer,
cd ~/project/PREDICT-HD/src
git init
git add *
git commit -m "my first commit"

git config --global "Your Name"
git config --global
git remote add origin
git push origin master
  • If collaborator already push a commit before you want to push your commit
git add xxx
git commit -m ""
git pull
git push origin master

  • If you want to remove files only from git repo, not physically delete from the file system:
git rm --cached file1.txt

Monday, October 05, 2015

A parallel and fast way to download multiple files

We can write a short script to download multiple files easily in command, e..g

for i in X Y Z; do wget$i.url; done

If we want them to run in background (so that in a pseudo-parallel way), we can use -b option for wget.

But this is still not fast enough, and the parallel with wget -b won't give me any notice once it's done.

Here is my solution: axel + parallel

parallel -a urls.file axel

Let's say I want to download all brain sample bigwig files of H3K4me1 marks from the Roadmap Epigenomics project. Here is the code:

> url.$mark # to generate an empty file
for i in E071 E074 E068 E069 E072 E067 E073 E070 E082 E081;
  echo$i-$mark.pval.signal.bigwig >> url.$mark
parallel -a url.$mark axel -n 5 

Regarding what's axel and how fast it is comparing to wget, please refer to this link:

Friday, September 11, 2015

PLINK2 vs. SNAP for linkage disequilibrium (LD) calculation

Among other ways to calculate SNPs in linkage disequilibrium (LD), two methods have been used in many literatures: PLINK2 and SNAP

With PLINK2, SNPs in LD can be calculated using parameters like:
PLINK2 -r2 dprime --ld-window-kb 1000 --ld-window 10 --ld-window-r2 0.8
SNAP is a web-based service with pre-calculated data: You can set the population, distance limit, r2 threshold.

Note that PLINK2 has one more control: --ld-window

In its manual (, it says "By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart".

That means PLINK2 limits SNPs with two kinds of radius: both the genomic distance (--ld-window-kb) and the number of SNPs apart (--ld-window), while SNAP only has the distance limit. Both of them have the r2 threshold.

Sunday, August 16, 2015

Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))

# ref:
r2 <- laply(xvars, function(x) {
  laply(yvars, function(y) {
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
  laply(yvars, function(y) {
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

> summary.lm(results)

aov(formula = weight ~ group)

    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])

Below table is a nice summary the methods applicable to corresponding data type.

CategoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance)Linear regression
ContinuousLogistic regressionLinear regression,  Pearson correlation
Mixture of Categorical and ContinuousLogistic regressionLinear regressionAnalysis of Covariance

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.

Wednesday, August 12, 2015

use getline to capture system command output in awk

I just learnt this today: awk also has its pipe (|) and getline, just like unix. If I want to call a system command in awk and capture its output (Note: system() won't work as it only return the exit status), I can use pipe the output to getline. 

For example,

$cat > test.txt
aa bb cc
11 22 33
44 55 cc

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd | getline a; print a}'


$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; system(cmd);}'

Note that if only use cmd, it won't print out anything, because cmd itself won't switch to console (unlike system). 

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd;}'

If you want to output the multiple lines and process them in awk, you can do

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | cut -f2 -d\" \""; while( (cmd | getline a) >0) print a;}'