Monday, June 08, 2020

Be careful of "sort -k1,1 -k2,2n -u"

When you attempted to sort and extract the unique genomic regions using "sort -k1,1 -k2,2n -u", you might make a mistake by missing the region with the same chr and start, but different end position.

The right way should be  "sort -k1,1 -k2,2n -k3,3n -u" or  "sort -k1,1 -k2,2n | sort -u"

Tuesday, April 14, 2020

Innate vs. Adaptive Immunity



Innate Immunity

Adaptive immunity

1.PresenceInnate immunity is something already present in the body.Adaptive immunity is created in response to exposure to a foreign substance.
3.ResponseFights any foreign invaderFight only specific infection
4.ResponseRapidSlow (1-2 weeks)
5.PotencyLimited and Lower potencyHigh potency
6.Time spanOnce activated against a specific type of antigen, the immunity remains throughout the life.The span of developed immunity can be lifelong or short.
7.InheritanceInnate type of immunity is generally inherited from parents and passed to offspring.Adaptive immunity is not passed from the parents to offspring, hence it cannot be inherited.
8.MemoryCannot react with equal potency upon repeated exposure to the same pathogen.Adaptive system can remember the specific pathogens which have encountered before.
9.PresencePresent at birthDevelops during a person’s lifetime and can be short-lived.
10.Allergic ReactionNoneImmediate and Delay hypersensitivity
11.Used AgainstFor microbesMicrobes and non-microbial substances called antigens
12.MemoryNo memoryLong term memory
14.SpeedFaster responseSlower response
15.Complement system activationAlternative and lectin pathwaysClassical pathway
16.Anatomic and physiological barriersSkin, Mucous membranes, Temp, pH, chemicals, etc.Lymph nodes, spleen, mucosal associated lymphoid tissue.
17.CompositionThe innate immune system is composed of physical and chemical barriers, phagocytic leukocytes, dendritic cells, natural killer cells, and plasma proteins.Adaptive immune system is composed of B cells and T cells.
18.DevelopmentEvolutionary, older and is found in both vertebrates and invertebrates.Adaptive immunity system has been developed recently and is found only in the vertebrates.
19.ExampleWhite blood cells fighting bacteria, causing redness and swelling, when you have a cut.Chickenpox vaccination so that we don’t get chickenpox because adaptive immunity system has remembered the foreign body.

Monday, December 16, 2019

Illustrator: how to fill shapes with a 45 degree line pattern?

Use a pattern ...
There are a bunch of line patterns loaded with Illustrator by default (Open Swatch Library → Patterns → Basic Graphics → Basic Graphics Lines).
You can use them as a second fill using the appearance panel and use blending etc to get the effect you want. You can add a Transform effect to that specific fill (make sure to check "Transform Patterns") to get the rotation & scale you want:
enter image description here
...and the same with a different blending mode:
enter image description here
If the default line patterns don't work for you then you can, of course, make your own pattern; which should be as easy as creating a small section of the lines you want (you could do it with a single line if you really wanted to) and dragging them to the Swatches panel, then double-clicking to enter the pattern editor:
enter image description here
Read more here:

Friday, September 13, 2019

Note on the case-insensitive file system in Mac OS

I recently noticed that when my script wrote a file (e.g. condition_pd_vs_hc.pdf) into the Linux server disk mounted in my Mac. It will override another existing file although with different file name (e.g. CONDITION_PD_vs_HC.pdf). It turned out that my Mac OS filesystem is HFS+ and the disk was formated as "Mac OS Extended" (by default). Unlike "Mac OS Extended (Case-sensitive, Journaled)", "Mac OS Extended" is not case-sensitive, but case-presertive. "This means that the file system will consider foo and FoO to be the same, but when you create a new file it will remember which letters where capitalized and which were not." (quoted from

I totally don't get the point to have a case-insensitive but case-preserving file system.  Bummer!

Monday, November 19, 2018

TERM_SWAP: job killed after reaching LSF swap usage limit.

When your job was killed due to this error, it means you've exceeded the swap usage set for the queue.

bsub -v option can set the limit:

   -v swap_limit
               Set the total process virtual memory limit to
               swap_limit for the whole job. The default is
               no limit. Exceeding the limit causes the job to

               By default, the limit is specified in KB. Use
               LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
               larger unit for the limit (MB, GB, TB, PB, or EB).

Without administrator permission, one cannot change the lsf.conf. 

So as a user what you can do is to switch to use a different queue with larger swap limit. Here is the command to check the limit for a queue, e.g. 

$bqueues -l normal | grep -A 1 SWAP
0 M 8 G 12 G

Here is what I got from our cluster:

$ for i in `bqueues | grep -v QUEUE_NAME | cut -f1 -d' '`; do echo $i; bqueues -l $i | grep -A 1 SWAP; done
     39 G      47 G 
    488 G     586 G 
      0 M      23 G      29 G 
      0 M       2 T       2 T 
      0 M     498 G     500 G 
      0 M     498 G     500 G 
      0 M       8 G      12 G 
      0 M       9 G      12 G 
      0 M       8 G      10 G 
      0 M       8 G      10 G 
      0 M       9 G      12 G 
      0 M       8 G      12 G 
      0 M       8 G      10 G 
      0 M     248 G     254 G 
      0 M     117 G     125 G 
      0 M     250 G     250 G 
      0 M     250 G     250 G 
      0 M       4 G     250 G 
      0 M       8 G      12 G 
      0 M      12 G      12 G 

Monday, October 15, 2018

Making Art in R

Amazing artworks people made in R:

See their source code and more arts at:

Tuesday, September 25, 2018

PCA plot with fill, color, and shape all together

When I plotted the PCA results (e.g. scatter plot for PC1 and PC2) and was about to annotate the dataset with different covariates (e.g. gender, diagnosis, and ethic group), I noticed that it's not straightforward to annotate >2 covariates at the same time using ggplot.

Here is what works for me in ggplot:

pcaData <- plotPCA(vsd, intgroup = c( "Diagnosis", "Ethnicity", "Sex"), returnData = TRUE) # vsd and plotPCA are part of DESeq2 package, nothing with my example below. 
percentVar <- round(100 * attr(pcaData, "percentVar")) 
ggplot(pcaData, aes(x = PC1, y = PC2, color = factor(Diagnosis), shape = factor(Ethnicity))) + 
geom_point(size =3, aes(fill=factor(Diagnosis), alpha=as.character(Sex))) + 
geom_point(size =3) + 
scale_shape_manual(values=c(21,22)) + 
scale_alpha_manual(values=c("F"=0, "M"=1)) + 
xlab(paste0("PC1: ", percentVar[1], "% variance")) + 
ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
ggtitle("PCA of all genes, no covariate adjusted")

I also found that you can use the male and female symbol (♂ ♀) as shapes in your plot. Here is how:

df <- data.frame(x = runif(10), y = runif(10), sex = sample(c("m","f"), 10, rep = T)) 
qplot(x, y, data = df, shape = sex, size = I(5)) + 
scale_shape_manual(values = c("m" = "\u2642", f = "\u2640"))

I've not figured out a way to combine the two ideas above.

Friday, June 15, 2018

About snRNAs and snoRNAs

Some quick survey about snRNAs and snoRNAs:

- snRNA

small nuclear RNAs, ~150nt,  also called uRNA (due to rich of uridylic residues), functions in RNA splicing
There are 1916 snRNAs in the human genome. Their length distribution as below:
 20  1
 40 * 20
 60 ********** 209
 80 ****** 129
100 ************************************************************ 1217
120 ***** 95
140 **** 83
160 ****** 116
180 ** 42
200  3
220  1

- snoRNAs

small nucleolar RNAs, functions in guiding the modification of other ncRNAs, e,g. mainly rRNAs, snRNAs, tRNA. 
snoRNAs form cluster and 
In human genome (GENCODE v19), there are 1457 snoRNA genes. Length distribution as below:
 40 * 9
 60 *********************************** 252
 80 *************************************** 275
100 ************************************************************ 426
120 ************************************************ 342
140 ******** 59
160 ** 11
180 *** 18
200 ******** 56
220  3
240  1
260 * 4
280  0
300  0
320  1

Some of them have many copies and cluster in a imprinting gene locus (, e.g.  SNORD113 and SNORD114 are clustered and located in intron of MEG8, a long ncRNA only maternally expressed. 
And the number of snoRNA copies in the cluster is associated with diseases. e.g. Loss of the 29 copies of SNORD116 (HBII-85) from this region has been identified as a cause of Prader-Willi syndrome whereas gain of additional copies of SNORD115 has been linked to autism.

Friday, June 08, 2018

Download all KEGG pathway KGML files for SPIA analysis

Most people know KEGG pathway, but not everyone knows that it costs at least $2000 to subscribe its database. If you want to save the cost a bit, you can manually download the KEGG pathway KGML files and install in SPIA. Here I have a workaround to download all KEGG pathway files using their REST API.
## Claim: this is my personal trick. I recommend people to subscribe their KEGG FTP download to support the authors.

# change folder to the folder where your SPIA installed
cd /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SPIA
# download all XML files for all human pathway
curl -s | awk '{split($1,a,":"); print "curl -s"a[2]"/kgml -o extdata/keggxml/hsa/"a[2]".xml"}' | bash

# then switch to R console


Monday, March 26, 2018

Brace expansion

Let's say I want to download the chromHMM result for all brain tissues from Roadmap project. The brain tissues are numbers from E067-E074, E081, and E082 according to Anshul's table:

So, what's the wildcard to match all the ten brain tissues? E067-E074, E081, and E082

Here is the trick:


There are two types of brace expansion in Bash:
1. String list: {string1,string2,...,stringN}
2. Range list: {<START>..<END>}

You can combine them or nest them, for example
{{67..74},81,81} ==> 67,68,69,70,71,72,73,74,81,82
{A..C}{0..2}  ==> A0, B0, C0, A1, B1, C1, A2, B2, C2
{{A..C},{0..2}}  ==> A, B, C, 0, 1, 2

In the new Bash 4.0, you can even do more, for example
- padding them, e.g. {0001..5} ==> 0001 0002 0003 0004 0005
- specify an increment using ranges, e.g. {1..10..2} ==> 1 3 5 7 9; {a..z..3} ==> a d g j m p s v y
- Using a prefix: 0{1..5} ==> 01 02 03 04 05
- or postfix: ---{A..E}---  ==> ---A--- ---B--- ---C--- ---D--- ---E---

Anyone understand the fun of code below:

function braceify {
    [[ $1 == +([[:digit:]]) ]] || return
    typeset -a a
    read -ra a < <(factor "$1")
    eval "echo $(printf '{$(printf ,%%.s {1..%s})}' "${a[@]:1}")"

printf 'eval printf "$arg"%s' "$(braceify 1000000)"

See explain here:

Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post:

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.

Thursday, July 20, 2017

Educational papers in Bioinformatics from Nature Methods and Nature Biotechnology

Nature Methods and Nature Biotechnology have this super-great monthly column in bioinformatics for years. For example, Nature Methods has both the "Points of Significance" covering topics in statistics and "Points of View" in data visualization. And Nature Biotechnology's Primer column covers a collection of topics in Bioinformatics, such as "What's a hidden Markov model?". 

Today I've compiled all the articles from above three columns together and made a PDF for you. Here is the link

Monday, May 15, 2017

Kingdoms and domains of life

Notes of revisiting the topic:

  1. Linnaeus originally placed all living things into either plant or animal kingdom, and nowaday we place living things (e.g. not virus) into six (Animalia 动物界, Plantae 植物界, Fungi 真菌界, Protista 原生生物界, Archaea/Archaeabacteria 古细菌界, and Bacteria/Eubacteria 细菌界in US's textbook) or five (Animalia, Plantae, Fungi, Protista and Monera in other countries) kingdoms.
  2. Protista is any eukaryotic organism that is not an animal, plant, or fungi. 原生生物是除动物植物真菌之外的一切真核生物,包括海带、紫菜、藻类、变形虫等。所以海带不是植物,长知识了!
  3. Archaea/Archaeabacteria 古细菌界 vs. Bacteria/Eubacteria 细菌界: Archaea (/ar-kee/, ending sounds like IKEA) and Bacteria are both prokaryotes. Archaea is similar to Bacteria in structure (e.g. No membrane-bound organelles or nucleus, circular chromosome), but is closer to eukaryote in biochemistry (e.g. similar transcription and translation to Eukarya, e.g. TFs bound to the promoter, RNA pol II). Archaea was originally found in the extreme environment, e.g. extreme hot spring in Yellowstone, but now it's found everywhere in the world, e.g. in the gut of a human. 
  4. Not all extremophiles are Archaeons. Thermus aquaticus, originally found in Yellowstone National Park, is not an Archaea, but a thermophilic Bacteria. The famous Taq enzyme for PCR is from the bacteria. T. aquaticus 最初是由Thomas Brock在1969年发现的,1976年DNA聚合酶polymerase 首次从T. aquaticus中分离出来,1988年Kary Mullis等发现并提纯了这个酶,指出其可用于PCR,5年后(1993)因此获得诺贝尔化学奖。
  5. Six kingdoms are grouped into three domains (Bacteria, Archaea, and Eukarya), although both Bacteria and Archaea are Prokarya (原核,无核膜). The main characteristics that scientists focus on when determining which kingdom an organism belongs to are summarized below:
       • Cell Type: Organisms are either prokaryotic or eukaryotic (有无核膜).
       • Cell Walls: Some organisms have cell walls and others do not (有无细胞壁).
       • Body Type: Organisms can be unicellular or multicellular (单细胞 or 多细胞).
       • Nutrition: Many organisms are autotrophs and others are heterotrophs, respectively depending on whether they make their own food or receive nutrients by consuming other organisms (自养 or 它养.

Tuesday, January 24, 2017

Simple note for github vs. Rstudio

Having used both Rstudio and GitHub for a while, but never thought to combine them together. Hadley Wickham already put a detailed instruction on how to link GitHub with Rstudio ( I just list few quick note from my own experience.

1. The "project" concept of Rstudio is the key. I was not very organized before, very often just working on multiple files from different projects at the same time. Actually it's very distractive. It seems that only files organized into a project can be pushed to GitHub via Rstudio. Very good!

2. How to install R package in private GitHub repo? The trick is here.
## add your PAT (Personal Access Token, get here: into your ~/.Renviron file
# echo "GITHUB_PAT=xxxxxxxxxxYOUR-PAT-GO-HERExxxxxxxxxx" >> ~/.Renviron
# library(devtools);install_github("sterding/powerEQTL")

Output, Reads, Cluster Numbers etc. for Illumina

Tuesday, December 20, 2016

remove and disable .DS_Store in mounted disk

Do you see this when you want to extend your path by typing the TAB key? What are those .DS_Store files?

They are generated by MacOS (see [1] for detail). Because I mount my remote disk via SMB, MacOS Finder will generate such a configure file whenever you open the mounted folder in Finder locally.

It's harmless to remove those files [2].

How to remove those already in your folder? [3]
find ./ -name ".DS_Store" -depth -exec rm {} \;

How to stop generating .DS_Store file on Network drive? [4]
defaults write DSDontWriteNetworkStores -boolean true


Wednesday, November 09, 2016


Regarding the question what's IBD and IBS, here is my understanding:

A SNP (or a segment of DNA) is identical by state (IBS) if they have the same alleles in two of more individuals. For example, if person A has genotype A/C at a SNP position while person B has A/T at the position, then that SNP position has a IBS=1 status. IBD is a IBS that their shared alleles are from a common ancestor. That is, in my above example, if A and B’s parents have genotype A/C(mom) and A/T (dad), then we don’t know if the shared “A” allele in A and B came from mom or dad. In that case, it’s just an IBS, not an IBD. But if their parents’ genotypes are A/G (mom) and C/T (dad), then we know for largely sure the “A” is from mom, not from dad. So, it’s both an IBD and IBS. In practice, for a pair of individuals, without knowing their parents, we could estimate their IBD likelihood for each SNP using their IBS information. I think the math behind is Bayesian theorem. 

PLINK use IBS-based clustering to remove sample outliers (which have too-far IBS distance from the rest), see here:
PLINK use IBD to detect samples too-close to each other (e.g. relatives, contamination). See here:

Below is a nice piece for how to use IBD to detect relative. From Anderson, C. et al. Data quality control in genetic case-control association studies. Nat Protoc. 5, 1564–1573 (2010).
The expectation is that IBD = 1 for duplicates or monozygotic twins, IBD = 0.5 for first-degree relatives, IBD = 0.25 for second-degree relatives and IBD = 0.125 for third- degree relatives. Owing to genotyping error, LD and population structure, there is often some variation around these theoretical values and it is typical to remove one individual from each pair with an IBD value of > 0.1875, which is halfway between the expected IBD for third- and second-degree relatives. For these reasons an IBD value of > 0.98 identifies duplicates.

Another review on this is also worthy reading: Joseph E. Powell, Peter M. Visscher & Michael E. Goddard. Reconciling the analysis of IBD and IBS in complex trait studies. Nature Reviews Genetics 11, 800-805 (2010)

The following slide (from could be easier to understand IBD vs. IBS:

Friday, July 08, 2016

Best way to draw heatmap for publication

Here are two tips I can share if you were also working on a big dataset towards a high quality heatmap:

1. Don't generate PDF using pheatmap() or heatmap.2() as (i) the file is unnecessarily SUPER large if you have a lot of data points in the heatmap, so that you can kill your Illustrator; (ii) annoying grey boxes added to the grip (see here). Use basic image() with zero margins (e.g. par(mar=c(0,0,0,0))) to generate high-resolution PNG (or JPEG, TIFF) and place in Adobe Illustrator. You can freely add legend/annotation there easily.

2. When you use image(), rotate your matrix 90 degree clockwise first, so that the conventional printed layout of the matrix is same what you see in the image (e.g. top-left corner --> x(1,1) in your matrix etc.). An elegant piece of code for clockwise rotation can be found here:
rotate <- function(x) t(apply(x, 2, rev))

Tuesday, February 16, 2016

My 15 practical tips for a bioinformatician

Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this:
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
  9. Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (, as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading:

Thursday, January 21, 2016

Continuous tab in indentation could cause problem

When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.

for i in 1 2 3 4 5
    echo $i
    for j in eRNA promoter exon random
        echo $j

I figure it out when I display all the white characters, as shown below:

enter image description here

Do you see the difference?

The “working” code, which I copied from the Linux document ( has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.

To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option: