Tuesday, December 29, 2020

compression level in samtools and gzip

samtools fastq can convert bam to fastq format, e.g. samtools fastq input.bam -o output.fastq

The output file will be automatically compressed if the file names have a .gz or .bgzf extension, e.g. 

samtools fastq input.bam -o output1.fastq.gz

Alternatively, you can also pipe the stdout to compressor explicitly, e.g.

samtools fastq input.bam | gzip > output2.fastq.gz

Interestingly, I noticed that output2.fastq.gz is significantly smaller than output1.fastq.gz, even though the uncompressed file content is the same. 

Actually, this is because of the different default compression ratio used in samtools and gzip. 

In samtools fastq, its default compression level is 1 (out of [0..9]) while gzip's default compression level is 6 (out of [1..9]). 

Wednesday, December 09, 2020

managing multiple SSH keys

 Let's say you have already generated an SSH key for GitHub, as instructed here:


Now your .ssh folder will be like this:

PHS015945:.ssh xd010$ ll

-rw-r--r--  1 xd010  staff   165B Dec  9 23:21 config

-rw-------  1 xd010  staff   411B Dec  9 23:12 id_ed25519

-rw-r--r--  1 xd010  staff   100B Dec  9 23:12 id_ed25519.pub

where config file will be like:

Host *

  AddKeysToAgent yes

  UseKeychain yes

  IdentityFile ~/.ssh/id_ed25519

Now, you want to ssh to your HPC server without a password. You will follow instructions like this http://www.linuxproblem.org/art_9.html, e.g. 

a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa): 
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

Now append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:

a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password: 

You will find that you are still asked to enter the password when you want to ssh to your HPC. Where's the problem?

Sunday, November 29, 2020

How to take ultra high resolution screenshot?

This is very useful esp. when you want to use a high-resolution screenshot for a poster/printout. I currently don't find a way in Safari to save the screenshot in Responsive Mode. 

See stepwise instructions below from David Augustat's blog: https://davidaugustat.com/web/take-ultra-high-resolution-screenshots-in-chrome

Friday, November 27, 2020

Conda - A must-have for bioinformatician

As a bioinformatician, when you are given a new system (Mac, HPC, or any Linux environment), the first thing to do is probably to configure your work environment, install required tools, etc. 

Here is what I did for the new Mac:

1. This step is optional. Apple changed to default shell to zsh since the last OS. If you like bash, you can change the default zsh shell from bash:

chsh -s /bin/bash

2. Install Miniconda. Miniconda is a free minimal installer for conda, incl. only the basic ones (e.g. conda, python, zlib, etc). If you are under Linux, download the one for Linux

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

bash Miniconda3-latest-MacOSX-x86_64.sh

3. Install commonly used tools with bioconda:

# update bash (the one coming with MacOS is too old)

conda install -c conda-forge bash

# samtools etc.

conda install -c bioconda samtools plink plink2

# UCSC Kent Utilities etc.

conda install -c bioconda ucsc-bedgraphtobigwig ucsc-wigtobigwig ucsc-liftover ucsc-bigwigsummary

# GNU coreutils etc.

conda install -c bioconda coreutils gawk wget

# R and R package etc.

conda install -c r r-base

conda install -c r r-essentials

conda install -c r r-tidyverse

# Google cloud gsutil

conda install -c conda-forge gsutil 

Life is much easier after conda!

Wednesday, November 04, 2020

Explaination of "samtools flagstat" output

==> CRR119891.sam.flagstat <==

192013910 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

38830640 + 0 supplementary

0 + 0 duplicates

191681231 + 0 mapped (99.83% : N/A)

153183270 + 0 paired in sequencing

76591635 + 0 read1

76591635 + 0 read2

118230182 + 0 properly paired (77.18% : N/A)

152555312 + 0 with itself and mate mapped

295279 + 0 singletons (0.19% : N/A)

10160332 + 0 with mate mapped to a different chr

5532624 + 0 with mate mapped to a different chr (mapQ>=5)

Here are the notes from our careful check:

1. What's secondary, supplementary, and singleton? I was using "bwa mem" with default parameters. In that case, it only reports one alignment if the read can be mapped to multiple locations with equal quality. See discussion here: https://www.biostars.org/p/304614/#304620. That's why you don't see the secondary alignment in above example. Supplementary alignments are those alignments that are part of a chimeric alignment. For example, in a circRNA case, one read is broke into two pieces, mapped to different locations in back-splicing order. Then one of the two fragment will be supplementary (the other is not). Singleton is a mapped read whose mate is unmapped. So its flag has 0x1 and 0x8 but 0x4 is unset. Note that a singleton's mate (which is unmapped) can have a flag of 0110.  

2. What's the total number of reads? The number in the output is NOT for reads, but rather for alignment in the SAM files. The samtools flagstat only check the FLAG, not the read ID. So, in the above example, the total number of reads (R1+R2) should be 192013910 - 38830640. (If there is secondary or duplicate, it should also be excluded). It's also indicated by "paired in sequencing", 153183270 (R1 +R2). This can also be calculated by checking the read ID column e.g. cat CRR119891.sam | cut -f1 | sort -u | wc -l, then times 2 for paired-end reads.

3. How many reads are unmapped? The total alignment is sum of "mapped" + "singletons" + unmapped. So, the number of unmapped reads should be calcualted as 192013910-191681231-295279 = 37400 in above case. The unmapped reads can also be counted by checking the RNAME column in the SAM file (e.g. cat CRR119891.sam | awk '$3=="*"' | wc -l). 

4. Mappability? The above mapped % is based on "mapped" / "total" alignment, not based on reads. If for reads, it should be calculated based on above two numbers. 

Monday, June 08, 2020

Be careful of "sort -k1,1 -k2,2n -u"

When you attempted to sort and extract the unique genomic regions using "sort -k1,1 -k2,2n -u", you might make a mistake by missing the region with the same chr and start, but different end position.

The right way should be  "sort -k1,1 -k2,2n -k3,3n -u" or  "sort -k1,1 -k2,2n | sort -u"

Tuesday, April 14, 2020

Innate vs. Adaptive Immunity



Innate Immunity

Adaptive immunity

1.PresenceInnate immunity is something already present in the body.Adaptive immunity is created in response to exposure to a foreign substance.
3.ResponseFights any foreign invaderFight only specific infection
4.ResponseRapidSlow (1-2 weeks)
5.PotencyLimited and Lower potencyHigh potency
6.Time spanOnce activated against a specific type of antigen, the immunity remains throughout the life.The span of developed immunity can be lifelong or short.
7.InheritanceInnate type of immunity is generally inherited from parents and passed to offspring.Adaptive immunity is not passed from the parents to offspring, hence it cannot be inherited.
8.MemoryCannot react with equal potency upon repeated exposure to the same pathogen.Adaptive system can remember the specific pathogens which have encountered before.
9.PresencePresent at birthDevelops during a person’s lifetime and can be short-lived.
10.Allergic ReactionNoneImmediate and Delay hypersensitivity
11.Used AgainstFor microbesMicrobes and non-microbial substances called antigens
12.MemoryNo memoryLong term memory
14.SpeedFaster responseSlower response
15.Complement system activationAlternative and lectin pathwaysClassical pathway
16.Anatomic and physiological barriersSkin, Mucous membranes, Temp, pH, chemicals, etc.Lymph nodes, spleen, mucosal associated lymphoid tissue.
17.CompositionThe innate immune system is composed of physical and chemical barriers, phagocytic leukocytes, dendritic cells, natural killer cells, and plasma proteins.Adaptive immune system is composed of B cells and T cells.
18.DevelopmentEvolutionary, older and is found in both vertebrates and invertebrates.Adaptive immunity system has been developed recently and is found only in the vertebrates.
19.ExampleWhite blood cells fighting bacteria, causing redness and swelling, when you have a cut.Chickenpox vaccination so that we don’t get chickenpox because adaptive immunity system has remembered the foreign body.
Ref: https://microbiologyinfo.com/difference-between-innate-and-adaptive-immunity/

Monday, December 16, 2019

Illustrator: how to fill shapes with a 45 degree line pattern?

Reference: https://graphicdesign.stackexchange.com/a/93398
Use a pattern ...
There are a bunch of line patterns loaded with Illustrator by default (Open Swatch Library → Patterns → Basic Graphics → Basic Graphics Lines).
You can use them as a second fill using the appearance panel and use blending etc to get the effect you want. You can add a Transform effect to that specific fill (make sure to check "Transform Patterns") to get the rotation & scale you want:
enter image description here
...and the same with a different blending mode:
enter image description here
If the default line patterns don't work for you then you can, of course, make your own pattern; which should be as easy as creating a small section of the lines you want (you could do it with a single line if you really wanted to) and dragging them to the Swatches panel, then double-clicking to enter the pattern editor:
enter image description here
Read more here:

Friday, September 13, 2019

Note on the case-insensitive file system in Mac OS

I recently noticed that when my script wrote a file (e.g. condition_pd_vs_hc.pdf) into the Linux server disk mounted in my Mac. It will override another existing file although with different file name (e.g. CONDITION_PD_vs_HC.pdf). It turned out that my Mac OS filesystem is HFS+ and the disk was formated as "Mac OS Extended" (by default). Unlike "Mac OS Extended (Case-sensitive, Journaled)", "Mac OS Extended" is not case-sensitive, but case-presertive. "This means that the file system will consider foo and FoO to be the same, but when you create a new file it will remember which letters where capitalized and which were not." (quoted from https://apple.stackexchange.com/a/22304).

I totally don't get the point to have a case-insensitive but case-preserving file system.  Bummer!

Monday, November 19, 2018

TERM_SWAP: job killed after reaching LSF swap usage limit.

When your job was killed due to this error, it means you've exceeded the swap usage set for the queue.

bsub -v option can set the limit:

   -v swap_limit
               Set the total process virtual memory limit to
               swap_limit for the whole job. The default is
               no limit. Exceeding the limit causes the job to

               By default, the limit is specified in KB. Use
               LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
               larger unit for the limit (MB, GB, TB, PB, or EB).

Without administrator permission, one cannot change the lsf.conf. 

So as a user what you can do is to switch to use a different queue with larger swap limit. Here is the command to check the limit for a queue, e.g. 

$bqueues -l normal | grep -A 1 SWAP
0 M 8 G 12 G

Here is what I got from our cluster:

$ for i in `bqueues | grep -v QUEUE_NAME | cut -f1 -d' '`; do echo $i; bqueues -l $i | grep -A 1 SWAP; done
     39 G      47 G 
    488 G     586 G 
      0 M      23 G      29 G 
      0 M       2 T       2 T 
      0 M     498 G     500 G 
      0 M     498 G     500 G 
      0 M       8 G      12 G 
      0 M       9 G      12 G 
      0 M       8 G      10 G 
      0 M       8 G      10 G 
      0 M       9 G      12 G 
      0 M       8 G      12 G 
      0 M       8 G      10 G 
      0 M     248 G     254 G 
      0 M     117 G     125 G 
      0 M     250 G     250 G 
      0 M     250 G     250 G 
      0 M       4 G     250 G 
      0 M       8 G      12 G 
      0 M      12 G      12 G 

Monday, October 15, 2018

Making Art in R

Amazing artworks people made in R:

See their source code and more arts at:

More about generative arts in R here:

Tuesday, September 25, 2018

PCA plot with fill, color, and shape all together

When I plotted the PCA results (e.g. scatter plot for PC1 and PC2) and was about to annotate the dataset with different covariates (e.g. gender, diagnosis, and ethic group), I noticed that it's not straightforward to annotate >2 covariates at the same time using ggplot.

Here is what works for me in ggplot:

pcaData <- plotPCA(vsd, intgroup = c( "Diagnosis", "Ethnicity", "Sex"), returnData = TRUE) # vsd and plotPCA are part of DESeq2 package, nothing with my example below. 
percentVar <- round(100 * attr(pcaData, "percentVar")) 
ggplot(pcaData, aes(x = PC1, y = PC2, color = factor(Diagnosis), shape = factor(Ethnicity))) + 
geom_point(size =3, aes(fill=factor(Diagnosis), alpha=as.character(Sex))) + 
geom_point(size =3) + 
scale_shape_manual(values=c(21,22)) + 
scale_alpha_manual(values=c("F"=0, "M"=1)) + 
xlab(paste0("PC1: ", percentVar[1], "% variance")) + 
ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
ggtitle("PCA of all genes, no covariate adjusted")

I also found that you can use the male and female symbol (♂ ♀) as shapes in your plot. Here is how:

df <- data.frame(x = runif(10), y = runif(10), sex = sample(c("m","f"), 10, rep = T)) 
qplot(x, y, data = df, shape = sex, size = I(5)) + 
scale_shape_manual(values = c("m" = "\u2642", f = "\u2640"))
(Reference: https://github.com/kmiddleton/rexamples/blob/master/ggplot2%20male-female%20symbols.R)

I've not figured out a way to combine the two ideas above.

Friday, June 15, 2018

About snRNAs and snoRNAs

Some quick survey about snRNAs and snoRNAs:

- snRNA

small nuclear RNAs, ~150nt,  also called uRNA (due to rich of uridylic residues), functions in RNA splicing
There are 1916 snRNAs in the human genome. Their length distribution as below:
 20  1
 40 * 20
 60 ********** 209
 80 ****** 129
100 ************************************************************ 1217
120 ***** 95
140 **** 83
160 ****** 116
180 ** 42
200  3
220  1

- snoRNAs

small nucleolar RNAs, functions in guiding the modification of other ncRNAs, e,g. mainly rRNAs, snRNAs, tRNA. 
snoRNAs form cluster and 
In human genome (GENCODE v19), there are 1457 snoRNA genes. Length distribution as below:
 40 * 9
 60 *********************************** 252
 80 *************************************** 275
100 ************************************************************ 426
120 ************************************************ 342
140 ******** 59
160 ** 11
180 *** 18
200 ******** 56
220  3
240  1
260 * 4
280  0
300  0
320  1

Some of them have many copies and cluster in a imprinting gene locus (http://www.geneimprint.com/site/genes-by-species), e.g.  SNORD113 and SNORD114 are clustered and located in intron of MEG8, a long ncRNA only maternally expressed. 
And the number of snoRNA copies in the cluster is associated with diseases. e.g. Loss of the 29 copies of SNORD116 (HBII-85) from this region has been identified as a cause of Prader-Willi syndrome whereas gain of additional copies of SNORD115 has been linked to autism.

Friday, June 08, 2018

Download all KEGG pathway KGML files for SPIA analysis

Most people know KEGG pathway, but not everyone knows that it costs at least $2000 to subscribe its database. If you want to save the cost a bit, you can manually download the KEGG pathway KGML files and install in SPIA. Here I have a workaround to download all KEGG pathway files using their REST API.
## Claim: this is my personal trick. I recommend people to subscribe their KEGG FTP download to support the authors.

# change folder to the folder where your SPIA installed
cd /Library/Frameworks/R.framework/Versions/3.4/Resources/library/SPIA
# download all XML files for all human pathway
curl -s http://rest.kegg.jp/list/pathway/hsa | awk '{split($1,a,":"); print "curl -s http://rest.kegg.jp/get/"a[2]"/kgml -o extdata/keggxml/hsa/"a[2]".xml"}' | bash

# then switch to R console


Monday, March 26, 2018

Brace expansion

Let's say I want to download the chromHMM result for all brain tissues from Roadmap project. The brain tissues are numbers from E067-E074, E081, and E082 according to Anshul's table: https://docs.google.com/spreadsheets/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15

So, what's the wildcard to match all the ten brain tissues? E067-E074, E081, and E082

Here is the trick:

wget http://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E0{{67..74},81,81}_15_coreMarks_segments.bed

There are two types of brace expansion in Bash:
1. String list: {string1,string2,...,stringN}
2. Range list: {<START>..<END>}

You can combine them or nest them, for example
{{67..74},81,81} ==> 67,68,69,70,71,72,73,74,81,82
{A..C}{0..2}  ==> A0, B0, C0, A1, B1, C1, A2, B2, C2
{{A..C},{0..2}}  ==> A, B, C, 0, 1, 2

In the new Bash 4.0, you can even do more, for example
- padding them, e.g. {0001..5} ==> 0001 0002 0003 0004 0005
- specify an increment using ranges, e.g. {1..10..2} ==> 1 3 5 7 9; {a..z..3} ==> a d g j m p s v y
- Using a prefix: 0{1..5} ==> 01 02 03 04 05
- or postfix: ---{A..E}---  ==> ---A--- ---B--- ---C--- ---D--- ---E---

Anyone understand the fun of code below:

function braceify {
    [[ $1 == +([[:digit:]]) ]] || return
    typeset -a a
    read -ra a < <(factor "$1")
    eval "echo $(printf '{$(printf ,%%.s {1..%s})}' "${a[@]:1}")"

printf 'eval printf "$arg"%s' "$(braceify 1000000)"

See explain here: http://wiki.bash-hackers.org/syntax/expansion/brace

Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post: https://github.com/arq5x/bedtools/issues/109

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.

Thursday, July 20, 2017

Educational papers in Bioinformatics from Nature Methods and Nature Biotechnology

Nature Methods and Nature Biotechnology have this super-great monthly column in bioinformatics for years. For example, Nature Methods has both the "Points of Significance" covering topics in statistics and "Points of View" in data visualization. And Nature Biotechnology's Primer column covers a collection of topics in Bioinformatics, such as "What's a hidden Markov model?". 

Today I've compiled all the articles from above three columns together and made a PDF for you. Here is the link

Monday, May 15, 2017

Kingdoms and domains of life

Notes of revisiting the topic:

  1. Linnaeus originally placed all living things into either plant or animal kingdom, and nowaday we place living things (e.g. not virus) into six (Animalia 动物界, Plantae 植物界, Fungi 真菌界, Protista 原生生物界, Archaea/Archaeabacteria 古细菌界, and Bacteria/Eubacteria 细菌界in US's textbook) or five (Animalia, Plantae, Fungi, Protista and Monera in other countries) kingdoms.
  2. Protista is any eukaryotic organism that is not an animal, plant, or fungi. 原生生物是除动物植物真菌之外的一切真核生物,包括海带、紫菜、藻类、变形虫等。所以海带不是植物,长知识了!
  3. Archaea/Archaeabacteria 古细菌界 vs. Bacteria/Eubacteria 细菌界: Archaea (/ar-kee/, ending sounds like IKEA) and Bacteria are both prokaryotes. Archaea is similar to Bacteria in structure (e.g. No membrane-bound organelles or nucleus, circular chromosome), but is closer to eukaryote in biochemistry (e.g. similar transcription and translation to Eukarya, e.g. TFs bound to the promoter, RNA pol II). Archaea was originally found in the extreme environment, e.g. extreme hot spring in Yellowstone, but now it's found everywhere in the world, e.g. in the gut of a human. 
  4. Not all extremophiles are Archaeons. Thermus aquaticus, originally found in Yellowstone National Park, is not an Archaea, but a thermophilic Bacteria. The famous Taq enzyme for PCR is from the bacteria. T. aquaticus 最初是由Thomas Brock在1969年发现的,1976年DNA聚合酶polymerase 首次从T. aquaticus中分离出来,1988年Kary Mullis等发现并提纯了这个酶,指出其可用于PCR,5年后(1993)因此获得诺贝尔化学奖。
  5. Six kingdoms are grouped into three domains (Bacteria, Archaea, and Eukarya), although both Bacteria and Archaea are Prokarya (原核,无核膜). The main characteristics that scientists focus on when determining which kingdom an organism belongs to are summarized below:
       • Cell Type: Organisms are either prokaryotic or eukaryotic (有无核膜).
       • Cell Walls: Some organisms have cell walls and others do not (有无细胞壁).
       • Body Type: Organisms can be unicellular or multicellular (单细胞 or 多细胞).
       • Nutrition: Many organisms are autotrophs and others are heterotrophs, respectively depending on whether they make their own food or receive nutrients by consuming other organisms (自养 or 它养.

Tuesday, January 24, 2017

Simple note for github vs. Rstudio

Having used both Rstudio and GitHub for a while, but never thought to combine them together. Hadley Wickham already put a detailed instruction on how to link GitHub with Rstudio (http://r-pkgs.had.co.nz/git.html). I just list few quick note from my own experience.

1. The "project" concept of Rstudio is the key. I was not very organized before, very often just working on multiple files from different projects at the same time. Actually it's very distractive. It seems that only files organized into a project can be pushed to GitHub via Rstudio. Very good!

2. How to install R package in private GitHub repo? The trick is here.
## add your PAT (Personal Access Token, get here: https://github.com/settings/tokens) into your ~/.Renviron file
# echo "GITHUB_PAT=xxxxxxxxxxYOUR-PAT-GO-HERExxxxxxxxxx" >> ~/.Renviron
# library(devtools);install_github("sterding/powerEQTL")

Output, Reads, Cluster Numbers etc. for Illumina