Friday, September 13, 2019

Note on the case-insensitive file system in Mac OS

I recently noticed that when my script wrote a file (e.g. condition_pd_vs_hc.pdf) into the Linux server disk mounted in my Mac. It will override another existing file although with different file name (e.g. CONDITION_PD_vs_HC.pdf). It turned out that my Mac OS filesystem is HFS+ and the disk was formated as "Mac OS Extended" (by default). Unlike "Mac OS Extended (Case-sensitive, Journaled)", "Mac OS Extended" is not case-sensitive, but case-presertive. "This means that the file system will consider foo and FoO to be the same, but when you create a new file it will remember which letters where capitalized and which were not." (quoted from

I totally don't get the point to have a case-insensitive but case-preserving file system.  Bummer!

Monday, November 19, 2018

TERM_SWAP: job killed after reaching LSF swap usage limit.

When your job was killed due to this error, it means you've exceeded the swap usage set for the queue.

bsub -v option can set the limit:

   -v swap_limit
               Set the total process virtual memory limit to
               swap_limit for the whole job. The default is
               no limit. Exceeding the limit causes the job to

               By default, the limit is specified in KB. Use
               LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
               larger unit for the limit (MB, GB, TB, PB, or EB).

Without administrator permission, one cannot change the lsf.conf. 

So as a user what you can do is to switch to use a different queue with larger swap limit. Here is the command to check the limit for a queue, e.g. 

$bqueues -l normal | grep -A 1 SWAP
0 M 8 G 12 G

Here is what I got from our cluster:

$ for i in `bqueues | grep -v QUEUE_NAME | cut -f1 -d' '`; do echo $i; bqueues -l $i | grep -A 1 SWAP; done
     39 G      47 G 
    488 G     586 G 
      0 M      23 G      29 G 
      0 M       2 T       2 T 
      0 M     498 G     500 G 
      0 M     498 G     500 G 
      0 M       8 G      12 G 
      0 M       9 G      12 G 
      0 M       8 G      10 G 
      0 M       8 G      10 G 
      0 M       9 G      12 G 
      0 M       8 G      12 G 
      0 M       8 G      10 G 
      0 M     248 G     254 G 
      0 M     117 G     125 G 
      0 M     250 G     250 G 
      0 M     250 G     250 G 
      0 M       4 G     250 G 
      0 M       8 G      12 G 
      0 M      12 G      12 G 

Monday, October 15, 2018

Making Art in R

Amazing artworks people made in R:

See their source code and more arts at:

Tuesday, September 25, 2018

PCA plot with fill, color, and shape all together

When I plotted the PCA results (e.g. scatter plot for PC1 and PC2) and was about to annotate the dataset with different covariates (e.g. gender, diagnosis, and ethic group), I noticed that it's not straightforward to annotate >2 covariates at the same time using ggplot.

Here is what works for me in ggplot:

pcaData <- plotPCA(vsd, intgroup = c( "Diagnosis", "Ethnicity", "Sex"), returnData = TRUE) # vsd and plotPCA are part of DESeq2 package, nothing with my example below. 
percentVar <- round(100 * attr(pcaData, "percentVar")) 
ggplot(pcaData, aes(x = PC1, y = PC2, color = factor(Diagnosis), shape = factor(Ethnicity))) + 
geom_point(size =3, aes(fill=factor(Diagnosis), alpha=as.character(Sex))) + 
geom_point(size =3) + 
scale_shape_manual(values=c(21,22)) + 
scale_alpha_manual(values=c("F"=0, "M"=1)) + 
xlab(paste0("PC1: ", percentVar[1], "% variance")) + 
ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
ggtitle("PCA of all genes, no covariate adjusted")

I also found that you can use the male and female symbol (♂ ♀) as shapes in your plot. Here is how:

df <- data.frame(x = runif(10), y = runif(10), sex = sample(c("m","f"), 10, rep = T)) 
qplot(x, y, data = df, shape = sex, size = I(5)) + 
scale_shape_manual(values = c("m" = "\u2642", f = "\u2640"))

I've not figured out a way to combine the two ideas above.

Friday, June 15, 2018

About snRNAs and snoRNAs

Some quick survey about snRNAs and snoRNAs:

- snRNA

small nuclear RNAs, ~150nt,  also called uRNA (due to rich of uridylic residues), functions in RNA splicing
There are 1916 snRNAs in the human genome. Their length distribution as below:
 20  1
 40 * 20
 60 ********** 209
 80 ****** 129
100 ************************************************************ 1217
120 ***** 95
140 **** 83
160 ****** 116
180 ** 42
200  3
220  1

- snoRNAs

small nucleolar RNAs, functions in guiding the modification of other ncRNAs, e,g. mainly rRNAs, snRNAs, tRNA. 
snoRNAs form cluster and 
In human genome (GENCODE v19), there are 1457 snoRNA genes. Length distribution as below:
 40 * 9
 60 *********************************** 252
 80 *************************************** 275
100 ************************************************************ 426
120 ************************************************ 342
140 ******** 59
160 ** 11
180 *** 18
200 ******** 56
220  3
240  1
260 * 4
280  0
300  0
320  1

Some of them have many copies and cluster in a imprinting gene locus (, e.g.  SNORD113 and SNORD114 are clustered and located in intron of MEG8, a long ncRNA only maternally expressed. 
And the number of snoRNA copies in the cluster is associated with diseases. e.g. Loss of the 29 copies of SNORD116 (HBII-85) from this region has been identified as a cause of Prader-Willi syndrome whereas gain of additional copies of SNORD115 has been linked to autism.

Friday, June 08, 2018

Download all KEGG pathway KGML files for SPIA analysis

Most people know KEGG pathway, but not everyone knows that it costs at least $2000 to subscribe its database. If you want to save the cost a bit, you can manually download the KEGG pathway KGML files and install in SPIA. Here I have a workaround to download all KEGG pathway files using their REST API.
## Claim: this is my personal trick. I recommend people to subscribe their KEGG FTP download to support the authors.

# change folder to the folder where your SPIA installed
cd /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SPIA
# download all XML files for all human pathway
curl -s | awk '{split($1,a,":"); print "curl -s"a[2]"/kgml -o extdata/keggxml/hsa/"a[2]".xml"}' | bash

# then switch to R console


Monday, March 26, 2018

Brace expansion

Let's say I want to download the chromHMM result for all brain tissues from Roadmap project. The brain tissues are numbers from E067-E074, E081, and E082 according to Anshul's table:

So, what's the wildcard to match all the ten brain tissues? E067-E074, E081, and E082

Here is the trick:


There are two types of brace expansion in Bash:
1. String list: {string1,string2,...,stringN}
2. Range list: {<START>..<END>}

You can combine them or nest them, for example
{{67..74},81,81} ==> 67,68,69,70,71,72,73,74,81,82
{A..C}{0..2}  ==> A0, B0, C0, A1, B1, C1, A2, B2, C2
{{A..C},{0..2}}  ==> A, B, C, 0, 1, 2

In the new Bash 4.0, you can even do more, for example
- padding them, e.g. {0001..5} ==> 0001 0002 0003 0004 0005
- specify an increment using ranges, e.g. {1..10..2} ==> 1 3 5 7 9; {a..z..3} ==> a d g j m p s v y
- Using a prefix: 0{1..5} ==> 01 02 03 04 05
- or postfix: ---{A..E}---  ==> ---A--- ---B--- ---C--- ---D--- ---E---

Anyone understand the fun of code below:

function braceify {
    [[ $1 == +([[:digit:]]) ]] || return
    typeset -a a
    read -ra a < <(factor "$1")
    eval "echo $(printf '{$(printf ,%%.s {1..%s})}' "${a[@]:1}")"

printf 'eval printf "$arg"%s' "$(braceify 1000000)"

See explain here:

Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post:

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.

Thursday, July 20, 2017

Educational papers in Bioinformatics from Nature Methods and Nature Biotechnology

Nature Methods and Nature Biotechnology have this super-great monthly column in bioinformatics for years. For example, Nature Methods has both the "Points of Significance" covering topics in statistics and "Points of View" in data visualization. And Nature Biotechnology's Primer column covers a collection of topics in Bioinformatics, such as "What's a hidden Markov model?". 

Today I've compiled all the articles from above three columns together and made a PDF for you. Here is the link

Monday, May 15, 2017

Kingdoms and domains of life

Notes of revisiting the topic:

  1. Linnaeus originally placed all living things into either plant or animal kingdom, and nowaday we place living things (e.g. not virus) into six (Animalia 动物界, Plantae 植物界, Fungi 真菌界, Protista 原生生物界, Archaea/Archaeabacteria 古细菌界, and Bacteria/Eubacteria 细菌界in US's textbook) or five (Animalia, Plantae, Fungi, Protista and Monera in other countries) kingdoms.
  2. Protista is any eukaryotic organism that is not an animal, plant, or fungi. 原生生物是除动物植物真菌之外的一切真核生物,包括海带、紫菜、藻类、变形虫等。所以海带不是植物,长知识了!
  3. Archaea/Archaeabacteria 古细菌界 vs. Bacteria/Eubacteria 细菌界: Archaea (/ar-kee/, ending sounds like IKEA) and Bacteria are both prokaryotes. Archaea is similar to Bacteria in structure (e.g. No membrane-bound organelles or nucleus, circular chromosome), but is closer to eukaryote in biochemistry (e.g. similar transcription and translation to Eukarya, e.g. TFs bound to the promoter, RNA pol II). Archaea was originally found in the extreme environment, e.g. extreme hot spring in Yellowstone, but now it's found everywhere in the world, e.g. in the gut of a human. 
  4. Not all extremophiles are Archaeons. Thermus aquaticus, originally found in Yellowstone National Park, is not an Archaea, but a thermophilic Bacteria. The famous Taq enzyme for PCR is from the bacteria. T. aquaticus 最初是由Thomas Brock在1969年发现的,1976年DNA聚合酶polymerase 首次从T. aquaticus中分离出来,1988年Kary Mullis等发现并提纯了这个酶,指出其可用于PCR,5年后(1993)因此获得诺贝尔化学奖。
  5. Six kingdoms are grouped into three domains (Bacteria, Archaea, and Eukarya), although both Bacteria and Archaea are Prokarya (原核,无核膜). The main characteristics that scientists focus on when determining which kingdom an organism belongs to are summarized below:
       • Cell Type: Organisms are either prokaryotic or eukaryotic (有无核膜).
       • Cell Walls: Some organisms have cell walls and others do not (有无细胞壁).
       • Body Type: Organisms can be unicellular or multicellular (单细胞 or 多细胞).
       • Nutrition: Many organisms are autotrophs and others are heterotrophs, respectively depending on whether they make their own food or receive nutrients by consuming other organisms (自养 or 它养.

Tuesday, January 24, 2017

Simple note for github vs. Rstudio

Having used both Rstudio and GitHub for a while, but never thought to combine them together. Hadley Wickham already put a detailed instruction on how to link GitHub with Rstudio ( I just list few quick note from my own experience.

1. The "project" concept of Rstudio is the key. I was not very organized before, very often just working on multiple files from different projects at the same time. Actually it's very distractive. It seems that only files organized into a project can be pushed to GitHub via Rstudio. Very good!

2. How to install R package in private GitHub repo? The trick is here.
## add your PAT (Personal Access Token, get here: into your ~/.Renviron file
# echo "GITHUB_PAT=xxxxxxxxxxYOUR-PAT-GO-HERExxxxxxxxxx" >> ~/.Renviron
# library(devtools);install_github("sterding/powerEQTL")

Output, Reads, Cluster Numbers etc. for Illumina

Tuesday, December 20, 2016

remove and disable .DS_Store in mounted disk

Do you see this when you want to extend your path by typing the TAB key? What are those .DS_Store files?

They are generated by MacOS (see [1] for detail). Because I mount my remote disk via SMB, MacOS Finder will generate such a configure file whenever you open the mounted folder in Finder locally.

It's harmless to remove those files [2].

How to remove those already in your folder? [3]
find ./ -name ".DS_Store" -depth -exec rm {} \;

How to stop generating .DS_Store file on Network drive? [4]
defaults write DSDontWriteNetworkStores -boolean true


Wednesday, November 09, 2016


Regarding the question what's IBD and IBS, here is my understanding:

A SNP (or a segment of DNA) is identical by state (IBS) if they have the same alleles in two of more individuals. For example, if person A has genotype A/C at a SNP position while person B has A/T at the position, then that SNP position has a IBS=1 status. IBD is a IBS that their shared alleles are from a common ancestor. That is, in my above example, if A and B’s parents have genotype A/C(mom) and A/T (dad), then we don’t know if the shared “A” allele in A and B came from mom or dad. In that case, it’s just an IBS, not an IBD. But if their parents’ genotypes are A/G (mom) and C/T (dad), then we know for largely sure the “A” is from mom, not from dad. So, it’s both an IBD and IBS. In practice, for a pair of individuals, without knowing their parents, we could estimate their IBD likelihood for each SNP using their IBS information. I think the math behind is Bayesian theorem. 

PLINK use IBS-based clustering to remove sample outliers (which have too-far IBS distance from the rest), see here:
PLINK use IBD to detect samples too-close to each other (e.g. relatives, contamination). See here:

Below is a nice piece for how to use IBD to detect relative. From Anderson, C. et al. Data quality control in genetic case-control association studies. Nat Protoc. 5, 1564–1573 (2010).
The expectation is that IBD = 1 for duplicates or monozygotic twins, IBD = 0.5 for first-degree relatives, IBD = 0.25 for second-degree relatives and IBD = 0.125 for third- degree relatives. Owing to genotyping error, LD and population structure, there is often some variation around these theoretical values and it is typical to remove one individual from each pair with an IBD value of > 0.1875, which is halfway between the expected IBD for third- and second-degree relatives. For these reasons an IBD value of > 0.98 identifies duplicates.

Another review on this is also worthy reading: Joseph E. Powell, Peter M. Visscher & Michael E. Goddard. Reconciling the analysis of IBD and IBS in complex trait studies. Nature Reviews Genetics 11, 800-805 (2010)

The following slide (from could be easier to understand IBD vs. IBS:

Friday, July 08, 2016

Best way to draw heatmap for publication

Here are two tips I can share if you were also working on a big dataset towards a high quality heatmap:

1. Don't generate PDF using pheatmap() or heatmap.2() as (i) the file is unnecessarily SUPER large if you have a lot of data points in the heatmap, so that you can kill your Illustrator; (ii) annoying grey boxes added to the grip (see here). Use basic image() with zero margins (e.g. par(mar=c(0,0,0,0))) to generate high-resolution PNG (or JPEG, TIFF) and place in Adobe Illustrator. You can freely add legend/annotation there easily.

2. When you use image(), rotate your matrix 90 degree clockwise first, so that the conventional printed layout of the matrix is same what you see in the image (e.g. top-left corner --> x(1,1) in your matrix etc.). An elegant piece of code for clockwise rotation can be found here:
rotate <- function(x) t(apply(x, 2, rev))

Tuesday, February 16, 2016

My 15 practical tips for a bioinformatician

Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this:
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
  9. Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (, as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading:

Thursday, January 21, 2016

Continuous tab in indentation could cause problem

When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.

for i in 1 2 3 4 5
    echo $i
    for j in eRNA promoter exon random
        echo $j

I figure it out when I display all the white characters, as shown below:

enter image description here

Do you see the difference?

The “working” code, which I copied from the Linux document ( has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.

To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option:

Tuesday, January 12, 2016

Calculate the odd of winning Powerball in R

This Wednesday’s Powerball grand prize already climbed up to $1.5 BILLION. If you choose to cash out, it would be $930 million. And it keeps increasing…
So, what’s the odd of winning the jackpot prize?
Here is the game rule according to

…we draw five white balls out of a drum with 69 balls and one red ball
out of a drum with 26 red balls.

We can calculate the total number of different combinations in R:

> choose(69,5)*26
[1] 292201338

If we are super super lucky to win the Jackpot of $930 million cash value, given that we have to pay 39.6% as federal tax, how much we expect to return for a $2 investment? (Of course, everyone expect to win the $1.5 billion jackpot)

> 930e6*(1-0.396)/(choose(69,5)*26)
[1] 1.92

Actually it’s not a good investment. (Thanks for the comment below. I made a mistake; 930million should be 930e6, not 930e9).
If we want to be 100% guaranteed, we have to buy all 292 million combinations. In that case, can we earn?
enter image description here

# this is what we pay
> choose(69,5)*26*2
[1] 584,402,676

# this is what we earn in total, before tax
> 930000000 + 1000000*choose(25,1) + 50000*choose(5,4)*choose(69-5,1) + 100*choose(5,4)*choose(69-5,1)*choose(25,1) + 100*choose(5,3)*choose(69-5,2) + 7*choose(5,3)*choose(69-5,2)*choose(25,1) + 7*choose(5,2)*choose(69-5,3) + 4*choose(5,1)*choose(69-5,4) + 4*choose(69-5,5)
[1] 1,023,466,048

# Nearly $5 billion!!! Then we need to pay 40% of tax. Maybe not for the minor prize, let's simplify it for all.
> 1023466048 * (1-0.396)
[1] 618,173,493

That’s still more than what we paid. Why don’t we do that?
Remember, people share the prize if multiple persons got the same winning number, which we don’t know. :D

Just some fun! :)

1. Fascinating Math Behind Why You Won’t Win Powerball
2. Tax for lottery (

Wednesday, December 09, 2015

Dopamine is the "Kim Kardashian of molecule" in brain -- I like this!

The main dopaminergic pathways of the human brain. (Wikipedia)

, and it's interesting to know the impressive people in the Kardashian family. 

Kim Kardashian's step father, Bruce Jenner (and now Caitlyn Jenner) 


SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema:
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post:  -- Forward is from dbSNP, Plus is from 1000 Genome. See #6 below. 
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  6. Important note: The FinalReport.txt from Illumina genotype tool GenomeStudio can also export other columns like "Allele1 - Forward", "Allele2 - Forward", "Allele1 - Plus", "Allele2 - Plus". I called Illumina tech that the Forward is for dbSNP, and Plus is for 1000 Genome. Be cautious that dbSNP and +/- can also be change between different version of dbSNP or genome building. (Full description is here, login required). To get to know which version your allele is referred to, you have to get the manifest file from your base caller.