Thursday, October 07, 2021

A bug related to R factor

Note a bug in my code today. Sometimes you need to put a certain level (e.g. healthy control) in the first position for your covariance. 

Here is my old code:

levels(dds[[variable]])= union(variable_REF, levels(dds[[variable]])))

Note that this can cause problem. For example, you have two levels: HC and AD in your diagnosis. By default, setting the covariance to factor() will make the levels sorting in alphabetic order (e.g. AD then HC). In the 2nd line, if you force to change the levels to c("HC", "AD"), you are actually doing something like c(AD = "HC", HC = "AD"). This is not what you want. 

What you wanted is actually this:

dds[[variable]]=factor(dds[[variable]], levels= union(variable_REF, unique(dds[[variable]]))) 

* Too bad that blogger does not adapt to markdown language yet. 

Saturday, July 03, 2021

Note (2) for DESeq2 time series data analysis

More notes on using LRT to test time-series data. Thanks for the discussion with Jie. 
  1. swapping the levels of time factor won't change the LRT results, as if the time variable is a factor, LRT won't see it as a trajectory analysis but rather a factor analysis (e.g. condition-specific difference at ANY time point). 
  2. subsetting only two time points {t0, ti} in LRT will get different numbers of DE genes at different time point i.  See example below; when only testing the {0, 60} minutes, 7 DE genes found. If including all time points in LRT, it will only find 4 DE genes. This could be the case that the likelihood ratio gets smaller when including time points with no or smaller condition-specific difference. Another possibility this may be happening is that there is a large dependence on t2 independent of the condition. When you add t2, you are adding it in the variable “time” to both the numerator and denominator of the LRT, and that may result in a smaller ratio.
  3. converting the time covariate from a factor / categorical variable to a continuous variable can get different results. The categorical variable in LRT does not consider the slope or trajectory nature, but it can detect genes that contribute a big condition-specific difference at a specific time point while the overall slope may not change. A continuous variable can consider the slope change or trajectory analysis. It's more like a time-series analysis.  Sometimes we may need to do both. 

Monday, April 19, 2021

Note for DEseq2 time course analysis

In many cases, we need to perform differential expression across the time course data, e.g. finding genes that react in a condition-specific manner over time, compared to a set of baseline samples. DEseq2 has such an implementation for time-course experiments

There are a number of ways to analyze time-series experiments, depending on the biological question of interest. In order to test for any differences over multiple time points, once can use a design including the time factor, and then test using the likelihood ratio test as described in the following section, where the time factor is removed in the reduced formula. For a control and treatment time series, one can use a design formula containing the condition factor, the time factor, and the interaction of the two. In this case, using the likelihood ratio test with a reduced model which does not contain the interaction terms will test whether the condition induces a change in gene expression at any time point after the reference level time point (time 0). An example of the later analysis is provided in our RNA-seq workflow.

Below is the example in DEseq workflow using LRT to test the interaction term (e.g. any condition-specific changes at any timepoints after time 0):

library("fission") data("fission") ddsTC <- DESeqDataSet(fission, ~ strain + minute + strain:minute)

ddsTC <- DESeq(ddsTC, test="LRT", reduced = ~ strain + minute) resTC <- results(ddsTC)

Several notes from the DEseq2 forum:

  1. By default, the result() function will return the LRT test p-value and MLE log2FC for the difference between Mut vs WT at the last timepoints, controlling for baseline
  2. To get the log2FC for the difference between Mut vs WT at a different timepoint, you have to manually specify it, e.g. "strainmut.minute15" is the difference between Mut vs WT at minute 15, controlling for baseline. 
  3. To generate the tables of log2 fold change of 60 minutes vs 0 minutes for the WT strain would be results(dds, name="minute_60_vs_0"); 
  4. To generate the tables of log2 fold change of 60 minutes vs 0 minutes for the mut strain would be the sum of the WT term above and the interaction term which is an additional effect beyond the effect for the reference level (WT): results(dds, contrast=list(c("minute_60_vs_0","strainmut.minute60"))
  5. "strainmut.minute15" is the difference between Mut vs WT at minute 15, controlling for baseline. If you add "strain_mut_vs_wt" to this, you get the LFC for Mutant vs WT at minute 15, not controlling for baseline. So the second one is the observed difference at minute 15 between the two groups (because you added in the change that was present at time=0).

Two kinds of hypothesis tests in DEseq2:

Wald test: to test if the estimated standard error of a log2 fold change is equal to zero

LRT (likelihood ratio test) between a full model and a reduced model: to test if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero.

Tuesday, December 29, 2020

compression level in samtools and gzip

samtools fastq can convert bam to fastq format, e.g. samtools fastq input.bam -o output.fastq

The output file will be automatically compressed if the file names have a .gz or .bgzf extension, e.g. 

samtools fastq input.bam -o output1.fastq.gz

Alternatively, you can also pipe the stdout to compressor explicitly, e.g.

samtools fastq input.bam | gzip > output2.fastq.gz

Interestingly, I noticed that output2.fastq.gz is significantly smaller than output1.fastq.gz, even though the uncompressed file content is the same. 

Actually, this is because of the different default compression ratios used in samtools and gzip. 

In samtools fastq, its default compression level is 1 (out of [0..9]) while gzip's default compression level is 6 (out of [1..9]). 

Wednesday, December 09, 2020

managing multiple SSH keys

 Let's say you have already generated an SSH key for GitHub, as instructed here:

Now your .ssh folder will be like this:

PHS015945:.ssh xd010$ ll

-rw-r--r--  1 xd010  staff   165B Dec  9 23:21 config

-rw-------  1 xd010  staff   411B Dec  9 23:12 id_ed25519

-rw-r--r--  1 xd010  staff   100B Dec  9 23:12

where config file will be like:

Host *

  AddKeysToAgent yes

  UseKeychain yes

  IdentityFile ~/.ssh/id_ed25519

Now, you want to ssh to your HPC server without a password. You will follow instructions like this, e.g. 

a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa): 
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

Now append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:

a@A:~> cat .ssh/ | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password: 

You will find that you are still asked to enter the password when you want to ssh to your HPC. Where's the problem?

Sunday, November 29, 2020

How to take ultra high resolution screenshot?

This is very useful esp. when you want to use a high-resolution screenshot for a poster/printout. I currently don't find a way in Safari to save the screenshot in Responsive Mode. 

See stepwise instructions below from David Augustat's blog:

Friday, November 27, 2020

Conda - A must-have for bioinformatician

As a bioinformatician, when you are given a new system (Mac, HPC, or any Linux environment), the first thing to do is probably to configure your work environment, install required tools, etc. 

Here is what I did for the new Mac:

1. This step is optional. Apple changed to default shell to zsh since the last OS. If you like bash, you can change the default zsh shell from bash:

chsh -s /bin/bash

2. Install Miniconda. Miniconda is a free minimal installer for conda, incl. only the basic ones (e.g. conda, python, zlib, etc). If you are under Linux, download the one for Linux



3. Install commonly used tools with bioconda:

# update bash (the one coming with MacOS is too old)

conda install -c conda-forge bash

# samtools etc.

conda install -c bioconda samtools plink plink2

# UCSC Kent Utilities etc.

conda install -c bioconda ucsc-bedgraphtobigwig ucsc-wigtobigwig ucsc-liftover ucsc-bigwigsummary

# GNU utils etc.

conda install -c conda-forge coreutils gawk wget gzip datamash parallel make readline sed

# R and R package etc.

conda install -c r r-base

conda install -c r r-essentials

conda install -c r r-tidyverse

# Google cloud gsutil

conda install -c conda-forge gsutil 

Life is much easier after conda!

Wednesday, November 04, 2020

Explaination of "samtools flagstat" output

==> CRR119891.sam.flagstat <==

192013910 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

38830640 + 0 supplementary

0 + 0 duplicates

191681231 + 0 mapped (99.83% : N/A)

153183270 + 0 paired in sequencing

76591635 + 0 read1

76591635 + 0 read2

118230182 + 0 properly paired (77.18% : N/A)

152555312 + 0 with itself and mate mapped

295279 + 0 singletons (0.19% : N/A)

10160332 + 0 with mate mapped to a different chr

5532624 + 0 with mate mapped to a different chr (mapQ>=5)

Here are the notes from our careful check:

1. What's secondary, supplementary, and singleton? I was using "bwa mem" with default parameters. In that case, it only reports one alignment if the read can be mapped to multiple locations with equal quality. See discussion here: That's why you don't see the secondary alignment in above example. Supplementary alignments are those alignments that are part of a chimeric alignment. For example, in a circRNA case, one read is broke into two pieces, mapped to different locations in back-splicing order. Then one of the two fragment will be supplementary (the other is not). Singleton is a mapped read whose mate is unmapped. So its flag has 0x1 and 0x8 but 0x4 is unset. Note that a singleton's mate (which is unmapped) can have a flag of 0110.  

2. What's the total number of reads? The number in the output is NOT for reads, but rather for alignment in the SAM files. The samtools flagstat only check the FLAG, not the read ID. So, in the above example, the total number of reads (R1+R2) should be 192013910 - 38830640. (If there is secondary or duplicate, it should also be excluded). It's also indicated by "paired in sequencing", 153183270 (R1 +R2). This can also be calculated by checking the read ID column e.g. cat CRR119891.sam | cut -f1 | sort -u | wc -l, then times 2 for paired-end reads.

3. How many reads are unmapped? The total alignment is sum of "mapped" + "singletons" + unmapped. So, the number of unmapped reads should be calcualted as 192013910-191681231-295279 = 37400 in above case. The unmapped reads can also be counted by checking the RNAME column in the SAM file (e.g. cat CRR119891.sam | awk '$3=="*"' | wc -l). 

4. Mappability? The above mapped % is based on "mapped" / "total" alignment, not based on reads. If for reads, it should be calculated based on above two numbers. 

Monday, June 08, 2020

Be careful of "sort -k1,1 -k2,2n -u"

When you attempted to sort and extract the unique genomic regions using "sort -k1,1 -k2,2n -u", you might make a mistake by missing the region with the same chr and start, but different end position.

The right way should be  "sort -k1,1 -k2,2n -k3,3n -u" or  "sort -k1,1 -k2,2n | sort -u"

Tuesday, April 14, 2020

Innate vs. Adaptive Immunity



Innate Immunity

Adaptive immunity

1.PresenceInnate immunity is something already present in the body.Adaptive immunity is created in response to exposure to a foreign substance.
3.ResponseFights any foreign invaderFight only specific infection
4.ResponseRapidSlow (1-2 weeks)
5.PotencyLimited and Lower potencyHigh potency
6.Time spanOnce activated against a specific type of antigen, the immunity remains throughout the life.The span of developed immunity can be lifelong or short.
7.InheritanceInnate type of immunity is generally inherited from parents and passed to offspring.Adaptive immunity is not passed from the parents to offspring, hence it cannot be inherited.
8.MemoryCannot react with equal potency upon repeated exposure to the same pathogen.Adaptive system can remember the specific pathogens which have encountered before.
9.PresencePresent at birthDevelops during a person’s lifetime and can be short-lived.
10.Allergic ReactionNoneImmediate and Delay hypersensitivity
11.Used AgainstFor microbesMicrobes and non-microbial substances called antigens
12.MemoryNo memoryLong term memory
14.SpeedFaster responseSlower response
15.Complement system activationAlternative and lectin pathwaysClassical pathway
16.Anatomic and physiological barriersSkin, Mucous membranes, Temp, pH, chemicals, etc.Lymph nodes, spleen, mucosal associated lymphoid tissue.
17.CompositionThe innate immune system is composed of physical and chemical barriers, phagocytic leukocytes, dendritic cells, natural killer cells, and plasma proteins.Adaptive immune system is composed of B cells and T cells.
18.DevelopmentEvolutionary, older and is found in both vertebrates and invertebrates.Adaptive immunity system has been developed recently and is found only in the vertebrates.
19.ExampleWhite blood cells fighting bacteria, causing redness and swelling, when you have a cut.Chickenpox vaccination so that we don’t get chickenpox because adaptive immunity system has remembered the foreign body.

Monday, December 16, 2019

Illustrator: how to fill shapes with a 45 degree line pattern?

Use a pattern ...
There are a bunch of line patterns loaded with Illustrator by default (Open Swatch Library → Patterns → Basic Graphics → Basic Graphics Lines).
You can use them as a second fill using the appearance panel and use blending etc to get the effect you want. You can add a Transform effect to that specific fill (make sure to check "Transform Patterns") to get the rotation & scale you want:
enter image description here
...and the same with a different blending mode:
enter image description here
If the default line patterns don't work for you then you can, of course, make your own pattern; which should be as easy as creating a small section of the lines you want (you could do it with a single line if you really wanted to) and dragging them to the Swatches panel, then double-clicking to enter the pattern editor:
enter image description here
Read more here:

Friday, September 13, 2019

Note on the case-insensitive file system in Mac OS

I recently noticed that when my script wrote a file (e.g. condition_pd_vs_hc.pdf) into the Linux server disk mounted in my Mac. It will override another existing file although with different file name (e.g. CONDITION_PD_vs_HC.pdf). It turned out that my Mac OS filesystem is HFS+ and the disk was formated as "Mac OS Extended" (by default). Unlike "Mac OS Extended (Case-sensitive, Journaled)", "Mac OS Extended" is not case-sensitive, but case-presertive. "This means that the file system will consider foo and FoO to be the same, but when you create a new file it will remember which letters where capitalized and which were not." (quoted from

I totally don't get the point to have a case-insensitive but case-preserving file system.  Bummer!

Monday, November 19, 2018

TERM_SWAP: job killed after reaching LSF swap usage limit.

When your job was killed due to this error, it means you've exceeded the swap usage set for the queue.

bsub -v option can set the limit:

   -v swap_limit
               Set the total process virtual memory limit to
               swap_limit for the whole job. The default is
               no limit. Exceeding the limit causes the job to

               By default, the limit is specified in KB. Use
               LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
               larger unit for the limit (MB, GB, TB, PB, or EB).

Without administrator permission, one cannot change the lsf.conf. 

So as a user what you can do is to switch to use a different queue with larger swap limit. Here is the command to check the limit for a queue, e.g. 

$bqueues -l normal | grep -A 1 SWAP
0 M 8 G 12 G

Here is what I got from our cluster:

$ for i in `bqueues | grep -v QUEUE_NAME | cut -f1 -d' '`; do echo $i; bqueues -l $i | grep -A 1 SWAP; done
     39 G      47 G 
    488 G     586 G 
      0 M      23 G      29 G 
      0 M       2 T       2 T 
      0 M     498 G     500 G 
      0 M     498 G     500 G 
      0 M       8 G      12 G 
      0 M       9 G      12 G 
      0 M       8 G      10 G 
      0 M       8 G      10 G 
      0 M       9 G      12 G 
      0 M       8 G      12 G 
      0 M       8 G      10 G 
      0 M     248 G     254 G 
      0 M     117 G     125 G 
      0 M     250 G     250 G 
      0 M     250 G     250 G 
      0 M       4 G     250 G 
      0 M       8 G      12 G 
      0 M      12 G      12 G 

Monday, October 15, 2018

Making Art in R

Amazing artworks people made in R:

See their source code and more arts at:

More about generative arts in R here:

Tuesday, September 25, 2018

PCA plot with fill, color, and shape all together

When I plotted the PCA results (e.g. scatter plot for PC1 and PC2) and was about to annotate the dataset with different covariates (e.g. gender, diagnosis, and ethic group), I noticed that it's not straightforward to annotate >2 covariates at the same time using ggplot.

Here is what works for me in ggplot:

pcaData <- plotPCA(vsd, intgroup = c( "Diagnosis", "Ethnicity", "Sex"), returnData = TRUE) # vsd and plotPCA are part of DESeq2 package, nothing with my example below. 
percentVar <- round(100 * attr(pcaData, "percentVar")) 
ggplot(pcaData, aes(x = PC1, y = PC2, color = factor(Diagnosis), shape = factor(Ethnicity))) + 
geom_point(size =3, aes(fill=factor(Diagnosis), alpha=as.character(Sex))) + 
geom_point(size =3) + 
scale_shape_manual(values=c(21,22)) + 
scale_alpha_manual(values=c("F"=0, "M"=1)) + 
xlab(paste0("PC1: ", percentVar[1], "% variance")) + 
ylab(paste0("PC2: ", percentVar[2], "% variance")) + 
ggtitle("PCA of all genes, no covariate adjusted")

I also found that you can use the male and female symbol (♂ ♀) as shapes in your plot. Here is how:

df <- data.frame(x = runif(10), y = runif(10), sex = sample(c("m","f"), 10, rep = T)) 
qplot(x, y, data = df, shape = sex, size = I(5)) + 
scale_shape_manual(values = c("m" = "\u2642", f = "\u2640"))

I've not figured out a way to combine the two ideas above.

Friday, June 15, 2018

About snRNAs and snoRNAs

Some quick survey about snRNAs and snoRNAs:

- snRNA

small nuclear RNAs, ~150nt,  also called uRNA (due to rich of uridylic residues), functions in RNA splicing
There are 1916 snRNAs in the human genome. Their length distribution as below:
 20  1
 40 * 20
 60 ********** 209
 80 ****** 129
100 ************************************************************ 1217
120 ***** 95
140 **** 83
160 ****** 116
180 ** 42
200  3
220  1

- snoRNAs

small nucleolar RNAs, functions in guiding the modification of other ncRNAs, e,g. mainly rRNAs, snRNAs, tRNA. 
snoRNAs form cluster and 
In human genome (GENCODE v19), there are 1457 snoRNA genes. Length distribution as below:
 40 * 9
 60 *********************************** 252
 80 *************************************** 275
100 ************************************************************ 426
120 ************************************************ 342
140 ******** 59
160 ** 11
180 *** 18
200 ******** 56
220  3
240  1
260 * 4
280  0
300  0
320  1

Some of them have many copies and cluster in a imprinting gene locus (, e.g.  SNORD113 and SNORD114 are clustered and located in intron of MEG8, a long ncRNA only maternally expressed. 
And the number of snoRNA copies in the cluster is associated with diseases. e.g. Loss of the 29 copies of SNORD116 (HBII-85) from this region has been identified as a cause of Prader-Willi syndrome whereas gain of additional copies of SNORD115 has been linked to autism.

Friday, June 08, 2018

Download all KEGG pathway KGML files for SPIA analysis

Most people know the KEGG pathway, but not everyone knows that it costs at least $2000 to subscribe to its database. If you want to save the cost a bit, you can manually download the KEGG pathway KGML files and install them in SPIA. Here I have a workaround to download all KEGG pathway files using their REST API.
## Claim: this is my personal trick. I recommend people subscribe to their KEGG FTP download to support the authors.

# change folder to the folder where your SPIA installed, for example in my Mac
cd /Library/Frameworks/R.framework/Versions/4.0/Resources/library/SPIA
# download all XML and png files for all human pathway
curl -s | awk '{split($1,a,":"); print "curl -s"a[2]"/kgml -o extdata/keggxml/hsa/"a[2]".xml"}' | bash

curl -s | awk '{split($1,a,":"); print "curl -s"a[2]"/image -o extdata/keggxml/hsa/"a[2]".png"}' | bash

# then switch to R console


Monday, March 26, 2018

Brace expansion

Let's say I want to download the chromHMM result for all brain tissues from Roadmap project. The brain tissues are numbers from E067-E074, E081, and E082 according to Anshul's table:

So, what's the wildcard to match all the ten brain tissues? E067-E074, E081, and E082

Here is the trick:


There are two types of brace expansion in Bash:
1. String list: {string1,string2,...,stringN}
2. Range list: {<START>..<END>}

You can combine them or nest them, for example
{{67..74},81,81} ==> 67,68,69,70,71,72,73,74,81,82
{A..C}{0..2}  ==> A0, B0, C0, A1, B1, C1, A2, B2, C2
{{A..C},{0..2}}  ==> A, B, C, 0, 1, 2

In the new Bash 4.0, you can even do more, for example
- padding them, e.g. {0001..5} ==> 0001 0002 0003 0004 0005
- specify an increment using ranges, e.g. {1..10..2} ==> 1 3 5 7 9; {a..z..3} ==> a d g j m p s v y
- Using a prefix: 0{1..5} ==> 01 02 03 04 05
- or postfix: ---{A..E}---  ==> ---A--- ---B--- ---C--- ---D--- ---E---

Anyone understand the fun of code below:

function braceify {
    [[ $1 == +([[:digit:]]) ]] || return
    typeset -a a
    read -ra a < <(factor "$1")
    eval "echo $(printf '{$(printf ,%%.s {1..%s})}' "${a[@]:1}")"

printf 'eval printf "$arg"%s' "$(braceify 1000000)"

See explain here:

Thursday, September 07, 2017

try to be consistent

Worthy to note the lesson I learned today: try to be consistent when sorting with different chromosome order.

I got the same error message as the post:

In short, the chromosome order in my sorted BAM file header is like this chr10, chr11, chr12... chr19, chr1, chr20, ... chr22, chr2, chr3 ....chrM, chrX, chrY. The order is determined by the aligner which generates @SQ headers based on how the reference sequences are arranged in the FASTA file from which you built its aligner indices. That means, the order in @SQ is same as the order in your genome.fa.fai (index) file.

Thursday, July 20, 2017

Educational papers in Bioinformatics from Nature Methods and Nature Biotechnology

Nature Methods and Nature Biotechnology have this super-great monthly column in bioinformatics for years. For example, Nature Methods has both the "Points of Significance" covering topics in statistics and "Points of View" in data visualization. And Nature Biotechnology's Primer column covers a collection of topics in Bioinformatics, such as "What's a hidden Markov model?". 

Today I've compiled all the articles from above three columns together and made a PDF for you. Here is the link