Tuesday, February 16, 2016

My 15 practical tips for a bioinformatician

Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/how-to-write-a-good-readme
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
  9. Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (https://www.biostars.org/p/66927/), as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medical-researchers-have-figured-out-how-much-time-is-okay-to-spend-sitting-each-day/

Thursday, January 21, 2016

Continuous tab in indentation could cause problem

When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.

for i in 1 2 3 4 5
    echo $i
    for j in eRNA promoter exon random
        echo $j

I figure it out when I display all the white characters, as shown below:

enter image description here

Do you see the difference?

The “working” code, which I copied from the Linux document (http://www.tldp.org/LDP/abs/html/nestedloops.html) has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.

To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option:

Tuesday, January 12, 2016

Calculate the odd of winning Powerball in R

This Wednesday’s Powerball grand prize already climbed up to $1.5 BILLION. If you choose to cash out, it would be $930 million. And it keeps increasing…
So, what’s the odd of winning the jackpot prize?
Here is the game rule according to Powerball.com:

…we draw five white balls out of a drum with 69 balls and one red ball
out of a drum with 26 red balls.

We can calculate the total number of different combinations in R:

> choose(69,5)*26
[1] 292201338

If we are super super lucky to win the Jackpot of $930 million cash value, given that we have to pay 39.6% as federal tax, how much we expect to return for a $2 investment? (Of course, everyone expect to win the $1.5 billion jackpot)

> 930e6*(1-0.396)/(choose(69,5)*26)
[1] 1.92

Actually it’s not a good investment. (Thanks for the comment below. I made a mistake; 930million should be 930e6, not 930e9).
If we want to be 100% guaranteed, we have to buy all 292 million combinations. In that case, can we earn?
enter image description here

# this is what we pay
> choose(69,5)*26*2
[1] 584,402,676

# this is what we earn in total, before tax
> 930000000 + 1000000*choose(25,1) + 50000*choose(5,4)*choose(69-5,1) + 100*choose(5,4)*choose(69-5,1)*choose(25,1) + 100*choose(5,3)*choose(69-5,2) + 7*choose(5,3)*choose(69-5,2)*choose(25,1) + 7*choose(5,2)*choose(69-5,3) + 4*choose(5,1)*choose(69-5,4) + 4*choose(69-5,5)
[1] 1,023,466,048

# Nearly $5 billion!!! Then we need to pay 40% of tax. Maybe not for the minor prize, let's simplify it for all.
> 1023466048 * (1-0.396)
[1] 618,173,493

That’s still more than what we paid. Why don’t we do that?
Remember, people share the prize if multiple persons got the same winning number, which we don’t know. :D

Just some fun! :)

1. Fascinating Math Behind Why You Won’t Win Powerball
2. Tax for lottery (http://classroom.synonym.com/much-federal-taxes-held-lottery-winnings-20644.html)

Wednesday, December 09, 2015

Dopamine is the "Kim Kardashian of molecule" in brain -- I like this!

The main dopaminergic pathways of the human brain. (Wikipedia)

, and it's interesting to know the impressive people in the Kardashian family. 

Kim Kardashian's step father, Bruce Jenner (and now Caitlyn Jenner) 

  1. http://www.slate.com/articles/health_and_science/science/2013/07/what_is_dopamine_love_lust_sex_addiction_gambling_motivation_reward.html
  2. http://www.theguardian.com/science/2013/feb/03/dopamine-the-unsexy-truth
  3. http://neuroskeptic.blogspot.co.uk/2010/05/do-it-like-you-dopamine-it.html

SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:  http://www.ncbi.nlm.nih.gov/books/NBK44455/#Build.define_the_term__orientation_as_us
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema: http://www.ncbi.nlm.nih.gov/books/NBK44393/#Submit.does_dbsnp_use_topbot_nomenclatur_1
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post: https://www.biostars.org/p/16982/ (need to confirm)
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  1. https://triticeaetoolbox.org/wheat/snps.php
  2. http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
  3. http://www.sciencedirect.com/science/article/pii/S0168952512000704
  4. https://mathgen.stats.ox.ac.uk/impute/1000GP%20Phase%203%20haplotypes%206%20October%202014.html
  5. http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook

Friday, December 04, 2015

My note on multiple testing

It's not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.

Some terms first:
- Type I error (false positive) and Type II error (false negative): 
When we do a hypothesis test, we can categorize the result into the following 2x2 table:
 Table of error types Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)RejectType I error
(False Positive)
Correct inference
(True Positive)
Fail to rejectCorrect inference
(True Negative)
Type II error
(False Negative)
Type I error is "you reject a true thing". If the true thing is a null hypothesis (H0), which is what people usually assume (e.g. no difference, no effect), then you reject it (or yes, there is difference), it's like a false positive. The similar logics for Type II error, or false negative.

Also note that people use Greek letter α for type I error rate and β for type II error rate. α is also the significant level for a test, e.g. 5%. So when a single test reaches p-value 0.05, we can intuitively understand that with 5% of chance we make a mistake or 5% of cases we thought significant are actually not. β is related with the power of a test. Power of a test = the ability to detect True Positive among all real positive cases.

- Sensitivity and Specificity
 Total test (m)Null hypothesis (H0) is
Judgement of Null Hypothesis (H0)Reject (R)VS
Fail to rejectUT
Sensitivity = S / (S+T)  = power = 1-β
Specificity = U / (U+V) = 1-α

- Why multiple testing matters?
It matters because we usually perform the same hypothesis tests not just once, but many many times. If your chance of making an error in single test is α, then your chance to make one or more errors in m tests will be
Pr(at least one error)=1−(1−α)m
So, then m is large, the chance will be nearly 100%. That's why we need to adjust the p-values for the number of hypothesis tests performed, or to control type I error rate.

- How to control type I error rate in multiple test?
There are many different ways to control the type I errors, such as
Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses, PCER = E(V)/m
Per-family error rate (PFER): the expected number of Type I errors, PFE = E(V).
Family-wise error rate (FWER): the probability of at least one type I error, FWER = P(V ≥ 1)
False discovery rate (FDR) is the expected proportion of Type I errors among the rejected hypotheses, FDR = E(V/R | R>0)P(R>0)
Positive false discovery rate (pFDR): the rate that discoveries are false, pFDR = E(V/R | R > 0)

- Controlling Family-Wise Error Rate
Many procedures have been developed to control the family-wise error rate P(V≥ 1), including the Bonferroni, Holm (1979), Hochberg (1988), and Sidak. It consists of two typessingle-step (e.g. Bonferroni) and sequential adjustment (e.g. Holm or Hochberg). Bonferroni correction is to control the overall type I errors when all tests are independent. It rejects any hypothesis with p-value ≤ α/m. So, when doing corrections, simply multiply the nominal p-value by m to get the adjusted p-values. In R, it's the following function
p.adjust(p, method = "bonferroni")
The sequential corrections is slightly more powerful than Bonferroni test. The Holm step-down procedure is the easiest to understand. First, sort your thousand p-values from low to high. Multiply the smallest p-value by one thousand. If that adjusted p-value is less than 0.05, then that gene shows evidence of differential expression. There is no difference as Bonferroni test for the gene. Then for the 2nd one, multiply its p-value by 999 (not one thousand) and see if it is less than 0.05. Multiply the third smallest p-value by 998, the fourth smallest by 997, etc. Compare each of these adjusted p-values to 0.05. We then insure that any adjusted p-value is at least as large as any preceding adjusted p-value. If it is not make sure it is equal to the largest of the preceding p-values. This is the algorithm of Holm step-down procedure. In R, it's
p.adjust(p, method = "holm")

- Controlling FDR
FWER is appropriate when you want to guard against ANY false positives. However, in many cases (particularly in genomics) we can live with a certain number of false positives. In these cases, the more relevant quantity to control is the false discovery rate (FDR). False discovery rate (FDR) is designed to control the proportion of false positives (V) among the set of rejected hypotheses (R). The FDR control has generated a lot of interest due to its more balanced trade-off between error rate control and power than the traditional Family-wise Error Rate control

Procedures controlling FDR include Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Benjamini & Hochberg (2000) and two-stage Benjamini & Hochberg (2006).

Here are the steps for Benjamini & Hochberg FDR:
1. sort nominal p-values from small to big: p1 ≤ p2 ≤ … ≤ pm
2. find a highest rank of j with pj < (j/m) x δ, where δ is the controlled FDR level. 
3. declare the tests of rank 1, 2, …, j as significant, and their adjusted p-values as pj*m/j. 


Wednesday, November 18, 2015

DNA and RNA structure

Let's review the basic thing we learned from Molecular Biology! :) Actually I'm thinking if someone can design a toy for kids to learn DNA structure, something like a puzzle, each unit is a magnetic piece and they can only be assembled/aligned in certain direction. That would be cool!

Reference: http://passel.unl.edu/pages/informationmodule.php?idinformationmodule=957882007&topicorder=4&maxto=7

Each strand of DNA double helix is a chain of bases linked to the sugar-phosphate backbone. The two strands are connected with hydrogen bonds. As below:
The phosphate group is linked between the 5th carbon of an upstream sugar and 3rd carbon and a downstream sugar. The 5-carbon sugar is called deoxyribose in DNA and ribose in RNA, which can be distinguished by the loss or presence of an oxygen atom at the 2' carbon, as highlighted in red in the following figure. 

RNA structure is very similar as single-strand DNA, except that
  • The sugar in RNA is a ribose sugar (as opposed to deoxy-ribose) and has an –OH at the 2' C position highlighted in red in the figure below (DNA sugars have –H at that position)
  • Thymine in DNA is replaced by Uracil in RNA. T has a methyl (-CH3) group instead of the H atom shown in red in U.
Reference: http://tigger.uic.edu/classes/phys/phys461/phys450/ANJUM04/

Now let's see the cap structure of RNA:

Cap structure are most commonly added to the 5' of RNA, known as mRNA capping. The 5' capping, as the left brown part shown in following figure, is like a guanosine(G) nucleotide but with a methyl group added at the 7' in guanosine. Its 5' is connected to the 5' of RNA with a triphosphate bridge. 

The following figure also contains a 2' cap, which can rarely happen in some eukaryote and viral genome.

Note that mitochondrial mRNAs don't have cap.
Reference: https://en.wikipedia.org/wiki/Five-prime_cap

Friday, October 23, 2015

Steps to push code to a github private repository

  • If you are the first time user:
Step1: Create a new repository on github.com (supposed you already have a github account), private or public. In example below, the repository is named as PREDICT-HD

Step2: [optional] generate the public key and add to your github account
Follow this help page exactly: https://help.github.com/articles/generating-ssh-keys/

Step3: Push the code to github from your local computer,
cd ~/project/PREDICT-HD/src
git init
git add *
git commit -m "my first commit"

git config --global user.name "Your Name"
git config --global user.email you@example.com
git remote add origin git@github.com:sterding/PREDICT-HD.git
git push origin master
  • If collaborator already push a commit before you want to push your commit
git add xxx
git commit -m ""
git pull
git push origin master

Monday, October 05, 2015

A parallel and fast way to download multiple files

We can write a short script to download multiple files easily in command, e..g

for i in X Y Z; do wget http://www.site.com/folder/$i.url; done

If we want them to run in background (so that in a pseudo-parallel way), we can use -b option for wget.

But this is still not fast enough, and the parallel with wget -b won't give me any notice once it's done.

Here is my solution: axel + parallel

parallel -a urls.file axel

Let's say I want to download all brain sample bigwig files of H3K4me1 marks from the Roadmap Epigenomics project. Here is the code:

> url.$mark # to generate an empty file
for i in E071 E074 E068 E069 E072 E067 E073 E070 E082 E081;
  echo http://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/pval/$i-$mark.pval.signal.bigwig >> url.$mark
parallel -a url.$mark axel -n 5 

Regarding what's axel and how fast it is comparing to wget, please refer to this link: http://www.cyberciti.biz/tips/download-accelerator-for-linux-command-line-tools.html

Friday, September 11, 2015

PLINK2 vs. SNAP for linkage disequilibrium (LD) calculation

Among other ways to calculate SNPs in linkage disequilibrium (LD), two methods have been used in many literatures: PLINK2 and SNAP

With PLINK2, SNPs in LD can be calculated using parameters like:
PLINK2 -r2 dprime --ld-window-kb 1000 --ld-window 10 --ld-window-r2 0.8
SNAP is a web-based service with pre-calculated data: https://www.broadinstitute.org/mpg/snap. You can set the population, distance limit, r2 threshold.

Note that PLINK2 has one more control: --ld-window

In its manual (https://www.cog-genomics.org/plink2/ld), it says "By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart".

That means PLINK2 limits SNPs with two kinds of radius: both the genomic distance (--ld-window-kb) and the number of SNPs apart (--ld-window), while SNAP only has the distance limit. Both of them have the r2 threshold.

Sunday, August 16, 2015

Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))

# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
  laply(yvars, function(y) {
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
  laply(yvars, function(y) {
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

> summary.lm(results)

aov(formula = weight ~ group)

    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])

Below table is a nice summary the methods applicable to corresponding data type.

CategoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance)Linear regression
ContinuousLogistic regressionLinear regression,  Pearson correlation
Mixture of Categorical and ContinuousLogistic regressionLinear regressionAnalysis of Covariance

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.

Wednesday, August 12, 2015

use getline to capture system command output in awk

I just learnt this today: awk also has its pipe (|) and getline, just like unix. If I want to call a system command in awk and capture its output (Note: system() won't work as it only return the exit status), I can use pipe the output to getline. 

For example,

$cat > test.txt
aa bb cc
11 22 33
44 55 cc

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd | getline a; print a}'


$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; system(cmd);}'

Note that if only use cmd, it won't print out anything, because cmd itself won't switch to console (unlike system). 

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd;}'

If you want to output the multiple lines and process them in awk, you can do

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | cut -f2 -d\" \""; while( (cmd | getline a) >0) print a;}'

Reference: http://stackoverflow.com/questions/1960895/awk-assigning-system-commands-output-to-variable

Wednesday, August 05, 2015

Dopamine, L-DOPA, catecholamines, TH, Nuur1 etc.

catecholamines (儿茶酚胺) is the collection name for three neurotransmitters: epinephrine (adrenaline), norepinephrine (noradrenaline) and dopamine.

L-DOPA is its precursor. Its corresponding drug is called levodopa.

L-DOPA is produced from the amino acid L-tyrosine by the enzyme tyrosine hydroxylase (TH).

Tyrosine cannot pass the BBB (blood brain barrier), but L-DOPA can.

L-DOPA can be converted to dopamine, which can be further converted to epinephrine (adrenaline), norepinephrine (noradrenaline).

Dopaminergic neurons can be divided into 11 cell groups according to where they located, using histochemical fluorescence method. It includes A8-A16, Aaq and Telencephalic group. Among that, A9 is where the substantia nigra pars compacta (SNpc) refers.

Number of TH-positive neurons in SN declines with age, while a-synuclein level increases with age. This inverse relationship is also seen in the surviving DA neurons of PD patients.

And it shows that the down-regulation of TH is linked to a reduction of expression of transcription factor called Nurr1 (orphan nuclear receptor, encoded by gene NR4A2).

Nurr1 is shown to regulate a category of nuclear-encoded mitochondrial genes. (http://www.ncbi.nlm.nih.gov/pubmed/23341612)

Tuesday, July 28, 2015

Note from GEUVADIS paper

Note from GEUVADIS papers:

Lappalainen et al. Nature 2013 : Transcriptome and genome sequencing uncovers functional variation in humans, http://dx.doi.org/10.1038/nature12531
‘t Hoen et al. Nature Biotechnology 2013: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories, http://dx.doi.org/10.1038/nbt.2702

1. how to detect sample outlier?
a. before alignment: distance of k-mer profile
b. after alignment: Spearman rank correlation between samples --> D-statistics (i.e.  the median correlation of one sample against all the other samples)
c. gender mismatch: XIST vs. chrY
d. ASE bias rate among heterozygous sites

2. eQTL
a. exon/gene quantification
b. filter out lowly expressed ones (e.g. 0 in >50% samples)
c. for each group, normalize with PEER, adding mean
    c1. use subset (??e.g. chr20, or chr20- 22??) using K=????0,1,,3,,57,10,13,15,20 for each dataset
    c2. run eQTL and number of genes (eGenes) for each K.
    c3. get the optimal K = K(with most number of eQTL genes)
    c4. run PEER on 20,000 exons to get covairantes for the final normalization
    c5. final PEER normalization using all dataset, residual + mean as final quantification
d. transform the final quantification to standard normal distribution (by ?)
e. eQTL using Matrix-eQTL: linear regression of quantification ~ genotypes + genotype_covariates
3. Differential expression analysis
a. TMM normalization (from edgeR)
b. filter: genes with more than 5 counts per million in at least 1 sample were analyzed in pairwise population comparisons
c. tweeDEseq (good for large samples), significance: FDR < 0.05 and log2 fold change greater than 2

Wednesday, June 24, 2015

median filter in AWK

Here is what median filter does from wikipedia:
To demonstrate, using a window size of three with one entry immediately preceding and following each entry, a median filter will be applied to the following simple 1D signal:
x = [2 80 6 3]
So, the median filtered output signal y will be:
y[1] = Median[2 2 80] = 2
y[2] = Median[2 80 6] = Median[2 6 80] = 6
y[3] = Median[80 6 3] = Median[3 6 80] = 6
y[4] = Median[6 3 3] = Median[3 3 6] = 3
i.e. y = [2 6 6 3].
Note that, in the example above, because there is no entry preceding the first value, the first value is repeated, as with the last value, to obtain enough entries to fill the window. This is one way of handling missing window entries at the boundaries of the signal.
Here is my awk code to implement this:
#!/bin/awk -f
# awk script to filter noise by sliding a window and taking mean or median per window
# Authos: Xianjun Dong
# Date: 2015-06-23
# Usage: _filter.awk values.txt
# bigWigSummary input.bigwig chr10 101130293 101131543 104 | _filter.awk -vW=5 -vtype=median
  if(W=="") W=5; 
  if(type=="") type="median";

  for(i=1;i<=NF;i++) {
    if(type=="median") {asort(array); x=array[half+1];}
    if(type=="mean") {x=0; for(j=1;j<=W;j++) x+=array[j]; x=x/W;}
    printf("%s\t", x);
  print "";

Friday, May 22, 2015

Be cautious of using grep --colour=always

I noticed that grep --colour=always can embed additional code (for coloring, for example, ^[[01;31m^[[K    ^[[m^[[K) in your text, which could lead downstream confusion. Here is one such example:

$ echo 1 2 3 | grep 2
1 2 3
$ echo 1 2 3 | grep 2 | cat -v
1 2 3
$ echo 1 2 3 | grep --colour=always 2 
2 3
$ echo 1 2 3 | grep --colour=always 2 | cat -v
1 ^[[01;31m^[[K2^[[m^[[K 3

See my answer for more detail.

The solution is to use --colour=autoor not use the --colour option at all.

$ echo 1 2 3 | grep --colour=auto 2 
2 3
$ echo 1 2 3 | grep --colour=auto 2 | cat -v
1 2 3

Check your .bashrc to see if you set alias grep='grep --colour=always' like I did.

Thursday, May 21, 2015

Which version should I install? 32 bit or 64 bit? Ubuntu or CentOS?

I was posting two relevant articles on this topic:

Unix vs. Linux
Which version of tools I should install? i386 vs. x86_64, and 32bit vs. 64bit kernel (http://onetipperday.blogspot.com/2013/02/software-vs-hardware-i386-vs-x8664-and.html)

However, this can still be confusing sometimes... for example, when you visit the SRA toolkit download page: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

OK. Here is the short answer.

1. How to tell my CPU architecture is 32 bit or 64 bit?

$ uname -aIf the output is i386 or i586, it's 32 bit. If it's x86_64, it's 64 bit.

Ref: http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/

2. How to tell my OS (operation system) kernel is 32 bit or 64 bit?

Note that the OS kernel can be different from the above hardware architecture. Usually the OS changes version following the update of hardware, but the 32bit version can work in a 64bit machine (not the other way around), so it might not be consistent. The Macbook Pro I am using is running a 32-bit kernel on a 64-bit processor (see my previous post). 

3. How to tell which OS I am using? CentOS or Ubuntu?

In my case, I used:
$ ls -d /etc/[A-Za-z]*[_-][rv]e[lr]* | grep -v "lsb" | cut -d'/' -f3 | cut -d'-' -f1 | cut -d'_' -f1

it returns redhat, which is part of CentOS now.

## For more detail, please refer terdon's answer in http://askubuntu.com/a/459425

Friday, May 15, 2015

How to correctly set color in the image() function?

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:

I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval.

For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.


image(1,1:length(x), matrix(x, nrow=1, ncol=length(x)), col=c("blue","red",'green','yellow'))

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")

Friday, May 08, 2015

Tips and Tools you may need for working on BIG data

Nowadays everyone is talking about big data. As a genomic scientist, I could feel hungry of a collection of tools more specialized for the mediate-to-big data we deal everyday.

Here are some tips I found useful when getting, processing or visualizing large data set:

1. How to download data faster than wget?

We can use wget to download the data to local disk. If it's large, we can download with other faster alternative, such as axel, aria2.


2. Process the data in parallel with hidden option in GNU commands

  • If you have many many files to process, and they are independent, you can process them in a parallel manner. GNU has a command called parallel. Lindenbaum Pierre wrote a nice notebook for "GNU Parallel in Bioinformatics", worthy to read. 
  • Many commonly used commands also have a hidden option to run in a parallel way. For example, GNU sort command has --parallel=N to set it with multiple cores. 
  • You can set -F when doing grep -f on a large seed file. People also suggest to set export LC_ALL=C line to get X2 speed.

3. In R, there are several must-have tips for large data, e.g. data.table
  • If using read.table(), set stringsAsFactors = F and colClass. See the example here
  • use fread(), not read.table(). Some more details here. But so far, fread() does not support reading *.gz file directly. Use fread('zcat file.gz')
  • use data.table, rather data.frame. Learn the difference online here.
  • There is a nice View for how to process data in parallel in R: http://cran.r-project.org/web/views/HighPerformanceComputing.html, but I have not followed them practically. Hopefully there will be some easy tutorials there, or I become less procrastinated to learn some of them ... At least I can start with foreach()
  • http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r
4. How to open scatter plot with too many points in Illustrator?

This is really a problem for me as we usually have a figure with >30k dots (i.e. each dot is a gene). Even though they are highly overlapping each other, opening it in Illustrator is extremely slow. Here is a tip: http://tex.stackexchange.com/questions/39974/problem-with-a-very-heavy-eps-image-scatter-plot-too-heavy-as-eps
From that, probably a better idea is to "compress" the data before plotting it, such as merge the overlapped ones if they overlapped some %.
or this one:
or this one:

Still working on the post...

Thursday, May 07, 2015

A clarity on the Illumina TruSeq Small RNA prep kit manual

In the TruSeq® Small RNALibrary Prep Guide, below the Figure 1, there is a sentence: "The RNA 3' adapter is modified to target microRNAs and other small RNAs that have a 3' hydroxyl group resulting from enzymatic cleavage by Dicer or other RNA processing enzymes." It's right, but could be very misleading if you are not clear of the diverse picture of transcriptome (scroll down for more detail). I want to emphasize that the 3' hydroxyl group (and the 5'-phosphate group) is NOT specific to microRNAs or any small RNAs. And it doesn't necessarily result from enzymatic cleavage by Dicer. Sonic fragmentation can also break the full length mRNA (with 5'-cap and 3'-polyA) into truncated RNA pieces with 5'-phosphate and 3' hydroxyl free ends. I just called Illumina to confirm that the 3' and 5' ligation steps don't guarantee the selection of miRNAs (but rather any RNAs with 5'-phosphate and 3' hydroxyl ends, if more accurately). The last step of gel purification is the key to select (or enrich, if more accurately) miRNAs.

OK. Here is what I learned from my colleagues about the different RNA species in the trancriptome:

There are 4 species in the transcriptome, where the later 3 are intermediates of transcription (or half product of degradation).
  • me7Gppp-------------------------3' (1) 
  •                 p------------- 3' (2) 
  •                 OH-------------3' (3) 
  •        ppp---------------------3' (4) 
Only group(2) will ligate to 5’adaptor. The 3' end can also have different format, at least two:
  • 5' ----------------- AAAAAA (1) 
  • 5' ---------- OH (2) 
Also note there are two enzymes used to repair the 5' ends: CIP and TAP. CIP (Calf Intestinal alkaline phosphatase) can remove the 5’ phosphate group of DNA strand. The TAP (Tobacco acid pyrophosphatase) is to remove the 5' cap structure (or 5'-5' triphosphate linkage) and leave a mono-phosphate at the 5' end. So, applying first CIP and then TAP will convert the above (2) and (4) to (3), then convert (1) to (2). That's one way to do capture the 5' cap structure, same purpose as CAGE (but CAGE use the type IIs restriction enzyme MmeI and type III restriction enzyme EcoP15I)