Wednesday, November 09, 2016

IBS vs. IBD

Regarding the question what's IBD and IBS, here is my understanding:

A SNP (or a segment of DNA) is identical by state (IBS) if they have the same alleles in two of more individuals. For example, if person A has genotype A/C at a SNP position while person B has A/T at the position, then that SNP position has a IBS=1 status. IBD is a IBS that their shared alleles are from a common ancestor. That is, in my above example, if A and B’s parents have genotype A/C(mom) and A/T (dad), then we don’t know if the shared “A” allele in A and B came from mom or dad. In that case, it’s just an IBS, not an IBD. But if their parents’ genotypes are A/G (mom) and C/T (dad), then we know for largely sure the “A” is from mom, not from dad. So, it’s both an IBD and IBS. In practice, for a pair of individuals, without knowing their parents, we could estimate their IBD likelihood for each SNP using their IBS information. I think the math behind is Bayesian theorem. 

PLINK use IBS-based clustering to remove sample outliers (which have too-far IBS distance from the rest), see here: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml
PLINK use IBD to detect samples too-close to each other (e.g. relatives, contamination). See here: http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml

Below is a nice piece for how to use IBD to detect relative. From Anderson, C. et al. Data quality control in genetic case-control association studies. Nat Protoc. 5, 1564–1573 (2010).
The expectation is that IBD = 1 for duplicates or monozygotic twins, IBD = 0.5 for first-degree relatives, IBD = 0.25 for second-degree relatives and IBD = 0.125 for third- degree relatives. Owing to genotyping error, LD and population structure, there is often some variation around these theoretical values and it is typical to remove one individual from each pair with an IBD value of > 0.1875, which is halfway between the expected IBD for third- and second-degree relatives. For these reasons an IBD value of > 0.98 identifies duplicates.

Another review on this is also worthy reading: Joseph E. Powell, Peter M. Visscher & Michael E. Goddard. Reconciling the analysis of IBD and IBS in complex trait studies. Nature Reviews Genetics 11, 800-805 (2010)


The following slide (from http://ibgwww.colorado.edu/workshop2005/cdrom/ScriptsA/evans/IBDestimation/IBD--2005.pdf) could be easier to understand IBD vs. IBS:

Friday, July 08, 2016

Best way to draw heatmap for publication

Here are two tips I can share if you were also working on a big dataset towards a high quality heatmap:

1. Don't generate PDF using pheatmap() or heatmap.2() as (i) the file is unnecessarily SUPER large if you have a lot of data points in the heatmap, so that you can kill your Illustrator; (ii) annoying grey boxes added to the grip (see here). Use basic image() with zero margins (e.g. par(mar=c(0,0,0,0))) to generate high-resolution PNG (or JPEG, TIFF) and place in Adobe Illustrator. You can freely add legend/annotation there easily.

2. When you use image(), rotate your matrix 90 degree clockwise first, so that the conventional printed layout of the matrix is same what you see in the image (e.g. top-left corner --> x(1,1) in your matrix etc.). An elegant piece of code for clockwise rotation can be found here:
rotate <- function(x) t(apply(x, 2, rev))


Tuesday, February 16, 2016

My 15 practical tips for a bioinformatician


Tips below are based on the lessons I learnt from making mistakes during my years of research. It's purely personal opinion. Order doesn't mean anything. If you think I should include something else, please comment below.
  1. Always set a seed number when you run tools with random option, e.g. bedtools shuffle, random etc.; You (or your boss in a day) want your work reproducible. 
  2. Set your own temporary folder (via --tmp, $TMPDIR etc., depending on your program). By default, many tools, e.g. sort, use the system /tmp as temporary folder, which may have limited quote that is not enough for your big NGS data. 
  3. Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a '-' in the name will be transferred to '.' in R unless you specify check.names=F.
  4. Always make a README file for the folder of your data; For why and how, read this: http://stackoverflow.com/questions/2304863/how-to-write-a-good-readme
  5. Always comment your code properly, for yourself and for others, as you very likely will read your ugly code again.
  6. Always backup your code timely, using github, svn, Time Machine, or simply copy/paste whatever.
  7. Always clean up the intermediate or unnecessary data, as you can easily shock your boss and yourself by generating so much data (and perhaps most of them are useless). 
  8. Don't save into *.sam if you can use *.bam. Always zip your fastq (and other large plain files) as much as you. This applies to other file format if you can use the compressed one. As you cannot imagine how much data (aka "digital garbage") you will generate soon.
  9. Using parallel as much as you can, e.g. using "LC_ALL=C sort --parallel=24 --buffer-size=5G" for sorting (https://www.biostars.org/p/66927/), as multi-core CPU/GPU is common nowaday.
  10. When a project is completed, remember to clean up your project folder, incl. removing the unnecessary code/data/intermediate files, and burn a CD or host it somewhere in cloud for the project. You never know when you, your boss or your collaborators will need the data again;
  11. Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)
  12. When you learn some tips from others by Google search, remember to list the URL for future reference and also for acknowledging others' credit. This applies to this post, of course :)
  13. Keep learning, otherwise you will be out soon. Just like the rapid development of NGS techniques, computing skills are also evolving quickly. Always catch up with the new skills/tools/techniques.
  14. When you learn some tips from others, remember to share something you learned to the community as well, as that's how the community grows healthily. 
  15. Last but not least, stand up and move around after sitting for 1-2 hours. This is especially important for us bioinformaticians who usually sit in front of computer for hours. Only good health can last your career long. More reading: https://www.washingtonpost.com/news/wonk/wp/2015/06/02/medical-researchers-have-figured-out-how-much-time-is-okay-to-spend-sitting-each-day/

Thursday, January 21, 2016

Continuous tab in indentation could cause problem

When I tested the following code in terminal and got an error at the 2nd echo. It asked to list all files in the current folder, e.g. “Display all 148 possibilities? (y or n)”, rather than moving forwards. I cannot tell any error in the code.

for i in 1 2 3 4 5
do
    echo $i
    for j in eRNA promoter exon random
    do
        echo $j
    done
done

I figure it out when I display all the white characters, as shown below:

enter image description here

Do you see the difference?

The “working” code, which I copied from the Linux document (http://www.tldp.org/LDP/abs/html/nestedloops.html) has different characters for the nested indentation (-> .. means a tab and two spaces) than the two tabs in the “not working” part (which I typed in Rstudio). I realized that it’s because the tab-autocompletion feature in Linux. Pressing tab a second time will tell the shell to provide a list of matching files.

To avoid the problem, simply click on the “Insert spaces for tab” option in your editor. I used Rstudio, here is where you can find the option:

Tuesday, January 12, 2016

Calculate the odd of winning Powerball in R

This Wednesday’s Powerball grand prize already climbed up to $1.5 BILLION. If you choose to cash out, it would be $930 million. And it keeps increasing…
So, what’s the odd of winning the jackpot prize?
Here is the game rule according to Powerball.com:

…we draw five white balls out of a drum with 69 balls and one red ball
out of a drum with 26 red balls.

We can calculate the total number of different combinations in R:

> choose(69,5)*26
[1] 292201338

If we are super super lucky to win the Jackpot of $930 million cash value, given that we have to pay 39.6% as federal tax, how much we expect to return for a $2 investment? (Of course, everyone expect to win the $1.5 billion jackpot)

> 930e6*(1-0.396)/(choose(69,5)*26)
[1] 1.92

Actually it’s not a good investment. (Thanks for the comment below. I made a mistake; 930million should be 930e6, not 930e9).
If we want to be 100% guaranteed, we have to buy all 292 million combinations. In that case, can we earn?
enter image description here

# this is what we pay
> choose(69,5)*26*2
[1] 584,402,676

# this is what we earn in total, before tax
> 930000000 + 1000000*choose(25,1) + 50000*choose(5,4)*choose(69-5,1) + 100*choose(5,4)*choose(69-5,1)*choose(25,1) + 100*choose(5,3)*choose(69-5,2) + 7*choose(5,3)*choose(69-5,2)*choose(25,1) + 7*choose(5,2)*choose(69-5,3) + 4*choose(5,1)*choose(69-5,4) + 4*choose(69-5,5)
[1] 1,023,466,048

# Nearly $5 billion!!! Then we need to pay 40% of tax. Maybe not for the minor prize, let's simplify it for all.
> 1023466048 * (1-0.396)
[1] 618,173,493

That’s still more than what we paid. Why don’t we do that?
Remember, people share the prize if multiple persons got the same winning number, which we don’t know. :D

Just some fun! :)


References:
1. Fascinating Math Behind Why You Won’t Win Powerball
(http://www.wired.com/2016/01/the-fascinating-math-behind-why-you-wont-win-powerball/)
2. Tax for lottery (http://classroom.synonym.com/much-federal-taxes-held-lottery-winnings-20644.html)

Wednesday, December 09, 2015

Dopamine is the "Kim Kardashian of molecule" in brain -- I like this!


The main dopaminergic pathways of the human brain. (Wikipedia)

, and it's interesting to know the impressive people in the Kardashian family. 

Kim Kardashian's step father, Bruce Jenner (and now Caitlyn Jenner) 

References:
  1. http://www.slate.com/articles/health_and_science/science/2013/07/what_is_dopamine_love_lust_sex_addiction_gambling_motivation_reward.html
  2. http://www.theguardian.com/science/2013/feb/03/dopamine-the-unsexy-truth
  3. http://neuroskeptic.blogspot.co.uk/2010/05/do-it-like-you-dopamine-it.html

SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:  http://www.ncbi.nlm.nih.gov/books/NBK44455/#Build.define_the_term__orientation_as_us
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema: http://www.ncbi.nlm.nih.gov/books/NBK44393/#Submit.does_dbsnp_use_topbot_nomenclatur_1
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post: https://www.biostars.org/p/16982/  -- Forward is from dbSNP, Plus is from 1000 Genome. See #6 below. 
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  6. Important note: The FinalReport.txt from Illumina genotype tool GenomeStudio can also export other columns like "Allele1 - Forward", "Allele2 - Forward", "Allele1 - Plus", "Allele2 - Plus". I called Illumina tech that the Forward is for dbSNP, and Plus is for 1000 Genome. Be cautious that dbSNP and +/- can also be change between different version of dbSNP or genome building. (Full description is here, login required). To get to know which version your allele is referred to, you have to get the manifest file from your base caller. 
References:
  1. https://triticeaetoolbox.org/wheat/snps.php
  2. http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
  3. http://www.sciencedirect.com/science/article/pii/S0168952512000704
  4. https://mathgen.stats.ox.ac.uk/impute/1000GP%20Phase%203%20haplotypes%206%20October%202014.html
  5. http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook

Friday, December 04, 2015

My note on multiple testing

It's not a shame to put a note on something (probably) everyone knows and you thought you know but actually you are not 100% sure. Multiple testing is such a piece in my knowledge map.

Some terms first:
- Type I error (false positive) and Type II error (false negative): 
When we do a hypothesis test, we can categorize the result into the following 2x2 table:
 Table of error types Null hypothesis (H0) is
Valid/TrueInvalid/False
Judgement of Null Hypothesis (H0)RejectType I error
(False Positive)
Correct inference
(True Positive)
Fail to rejectCorrect inference
(True Negative)
Type II error
(False Negative)
Type I error is "you reject a true thing". If the true thing is a null hypothesis (H0), which is what people usually assume (e.g. no difference, no effect), then you reject it (or yes, there is difference), it's like a false positive. The similar logics for Type II error, or false negative.

Also note that people use Greek letter α for type I error rate and β for type II error rate. α is also the significant level for a test, e.g. 5%. So when a single test reaches p-value 0.05, we can intuitively understand that with 5% of chance we make a mistake or 5% of cases we thought significant are actually not. β is related with the power of a test. Power of a test = the ability to detect True Positive among all real positive cases.

- Sensitivity and Specificity
 Total test (m)Null hypothesis (H0) is
Valid/TrueInvalid/False
Judgement of Null Hypothesis (H0)Reject (R)VS
Fail to rejectUT
Sensitivity = S / (S+T)  = power = 1-β
Specificity = U / (U+V) = 1-α

- Why multiple testing matters?
It matters because we usually perform the same hypothesis tests not just once, but many many times. If your chance of making an error in single test is α, then your chance to make one or more errors in m tests will be
Pr(at least one error)=1−(1−α)m
So, then m is large, the chance will be nearly 100%. That's why we need to adjust the p-values for the number of hypothesis tests performed, or to control type I error rate.

- How to control type I error rate in multiple test?
There are many different ways to control the type I errors, such as
Per comparison error rate (PCER): the expected value of the number of Type I errors over the number of hypotheses, PCER = E(V)/m
Per-family error rate (PFER): the expected number of Type I errors, PFE = E(V).
Family-wise error rate (FWER): the probability of at least one type I error, FWER = P(V ≥ 1)
False discovery rate (FDR) is the expected proportion of Type I errors among the rejected hypotheses, FDR = E(V/R | R>0)P(R>0)
Positive false discovery rate (pFDR): the rate that discoveries are false, pFDR = E(V/R | R > 0)

- Controlling Family-Wise Error Rate
Many procedures have been developed to control the family-wise error rate P(V≥ 1), including the Bonferroni, Holm (1979), Hochberg (1988), and Sidak. It consists of two typessingle-step (e.g. Bonferroni) and sequential adjustment (e.g. Holm or Hochberg). Bonferroni correction is to control the overall type I errors when all tests are independent. It rejects any hypothesis with p-value ≤ α/m. So, when doing corrections, simply multiply the nominal p-value by m to get the adjusted p-values. In R, it's the following function
p.adjust(p, method = "bonferroni")
The sequential corrections is slightly more powerful than Bonferroni test. The Holm step-down procedure is the easiest to understand. First, sort your thousand p-values from low to high. Multiply the smallest p-value by one thousand. If that adjusted p-value is less than 0.05, then that gene shows evidence of differential expression. There is no difference as Bonferroni test for the gene. Then for the 2nd one, multiply its p-value by 999 (not one thousand) and see if it is less than 0.05. Multiply the third smallest p-value by 998, the fourth smallest by 997, etc. Compare each of these adjusted p-values to 0.05. We then insure that any adjusted p-value is at least as large as any preceding adjusted p-value. If it is not make sure it is equal to the largest of the preceding p-values. This is the algorithm of Holm step-down procedure. In R, it's
p.adjust(p, method = "holm")

- Controlling FDR
FWER is appropriate when you want to guard against ANY false positives. However, in many cases (particularly in genomics) we can live with a certain number of false positives. In these cases, the more relevant quantity to control is the false discovery rate (FDR). False discovery rate (FDR) is designed to control the proportion of false positives (V) among the set of rejected hypotheses (R). The FDR control has generated a lot of interest due to its more balanced trade-off between error rate control and power than the traditional Family-wise Error Rate control

Procedures controlling FDR include Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001), Benjamini & Hochberg (2000) and two-stage Benjamini & Hochberg (2006).

Here are the steps for Benjamini & Hochberg FDR:
1. sort nominal p-values from small to big: p1 ≤ p2 ≤ … ≤ pm
2. find a highest rank of j with pj < (j/m) x δ, where δ is the controlled FDR level. 
3. declare the tests of rank 1, 2, …, j as significant, and their adjusted p-values as pj*m/j. 

Reference:
http://www.r-bloggers.com/adjustment-for-multiple-comparison-tests-with-r-resources-on-the-web/
http://www.gs.washington.edu/academics/courses/akey/56008/lecture/lecture10.pdf
http://www.stat.berkeley.edu/~mgoldman/Section0402.pdf

Wednesday, November 18, 2015

DNA and RNA structure

Let's review the basic thing we learned from Molecular Biology! :) Actually I'm thinking if someone can design a toy for kids to learn DNA structure, something like a puzzle, each unit is a magnetic piece and they can only be assembled/aligned in certain direction. That would be cool!

Reference: http://passel.unl.edu/pages/informationmodule.php?idinformationmodule=957882007&topicorder=4&maxto=7

Each strand of DNA double helix is a chain of bases linked to the sugar-phosphate backbone. The two strands are connected with hydrogen bonds. As below:
The phosphate group is linked between the 5th carbon of an upstream sugar and 3rd carbon and a downstream sugar. The 5-carbon sugar is called deoxyribose in DNA and ribose in RNA, which can be distinguished by the loss or presence of an oxygen atom at the 2' carbon, as highlighted in red in the following figure. 

RNA structure is very similar as single-strand DNA, except that
  • The sugar in RNA is a ribose sugar (as opposed to deoxy-ribose) and has an –OH at the 2' C position highlighted in red in the figure below (DNA sugars have –H at that position)
  • Thymine in DNA is replaced by Uracil in RNA. T has a methyl (-CH3) group instead of the H atom shown in red in U.
Reference: http://tigger.uic.edu/classes/phys/phys461/phys450/ANJUM04/

Now let's see the cap structure of RNA:

Cap structure are most commonly added to the 5' of RNA, known as mRNA capping. The 5' capping, as the left brown part shown in following figure, is like a guanosine(G) nucleotide but with a methyl group added at the 7' in guanosine. Its 5' is connected to the 5' of RNA with a triphosphate bridge. 

The following figure also contains a 2' cap, which can rarely happen in some eukaryote and viral genome.

Note that mitochondrial mRNAs don't have cap.
Reference: https://en.wikipedia.org/wiki/Five-prime_cap

Friday, October 23, 2015

Steps to push code to a github private repository

  • If you are the first time user:
Step1: Create a new repository on github.com (supposed you already have a github account), private or public. In example below, the repository is named as PREDICT-HD

Step2: [optional] generate the public key and add to your github account
Follow this help page exactly: https://help.github.com/articles/generating-ssh-keys/

Step3: Push the code to github from your local computer,
cd ~/project/PREDICT-HD/src
git init
git add *
git commit -m "my first commit"

git config --global user.name "Your Name"
git config --global user.email you@example.com
git remote add origin git@github.com:sterding/PREDICT-HD.git
git push origin master
  • If collaborator already push a commit before you want to push your commit
git add xxx
git commit -m ""
git pull
git push origin master

Monday, October 05, 2015

A parallel and fast way to download multiple files

We can write a short script to download multiple files easily in command, e..g

for i in X Y Z; do wget http://www.site.com/folder/$i.url; done

If we want them to run in background (so that in a pseudo-parallel way), we can use -b option for wget.

But this is still not fast enough, and the parallel with wget -b won't give me any notice once it's done.

Here is my solution: axel + parallel

parallel -a urls.file axel

Let's say I want to download all brain sample bigwig files of H3K4me1 marks from the Roadmap Epigenomics project. Here is the code:

mark=H3K4me1
> url.$mark # to generate an empty file
for i in E071 E074 E068 E069 E072 E067 E073 E070 E082 E081;
do 
  echo http://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/pval/$i-$mark.pval.signal.bigwig >> url.$mark
done
parallel -a url.$mark axel -n 5 

Regarding what's axel and how fast it is comparing to wget, please refer to this link: http://www.cyberciti.biz/tips/download-accelerator-for-linux-command-line-tools.html

Friday, September 11, 2015

PLINK2 vs. SNAP for linkage disequilibrium (LD) calculation

Among other ways to calculate SNPs in linkage disequilibrium (LD), two methods have been used in many literatures: PLINK2 and SNAP

With PLINK2, SNPs in LD can be calculated using parameters like:
PLINK2 -r2 dprime --ld-window-kb 1000 --ld-window 10 --ld-window-r2 0.8
SNAP is a web-based service with pre-calculated data: https://www.broadinstitute.org/mpg/snap. You can set the population, distance limit, r2 threshold.

Note that PLINK2 has one more control: --ld-window

In its manual (https://www.cog-genomics.org/plink2/ld), it says "By default, when a limited window report is requested, every pair of variants with at least (10-1) variants between them, or more than 1000 kilobases apart".

That means PLINK2 limits SNPs with two kinds of radius: both the genomic distance (--ld-window-kb) and the number of SNPs apart (--ld-window), while SNAP only has the distance limit. Both of them have the r2 threshold.

Sunday, August 16, 2015

Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))

library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    summary.lm(aov(y~x))$r.squared
  })
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    anova(lm(y~x))$`Pr(>F)`[1]
  })
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

> summary.lm(results)

Call:
aov(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])

Below table is a nice summary the methods applicable to corresponding data type.

PREDICTOR VARIABLE (S)
OUTCOME VARIABLE
CategoricalContinuous
CategoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance)Linear regression
ContinuousLogistic regressionLinear regression,  Pearson correlation
Mixture of Categorical and ContinuousLogistic regressionLinear regressionAnalysis of Covariance
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html

Wednesday, August 12, 2015

use getline to capture system command output in awk

I just learnt this today: awk also has its pipe (|) and getline, just like unix. If I want to call a system command in awk and capture its output (Note: system() won't work as it only return the exit status), I can use pipe the output to getline. 

For example,

$cat > test.txt
aa bb cc
11 22 33
44 55 cc

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd | getline a; print a}'
55

or 

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; system(cmd);}'
55

Note that if only use cmd, it won't print out anything, because cmd itself won't switch to console (unlike system). 

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | head -n1 | cut -f2 -d\" \""; cmd;}'

If you want to output the multiple lines and process them in awk, you can do

$awk 'BEGIN{cmd="grep cc test.txt | sort -k1,1 | cut -f2 -d\" \""; while( (cmd | getline a) >0) print a;}'
55
bb

Reference: http://stackoverflow.com/questions/1960895/awk-assigning-system-commands-output-to-variable

Wednesday, August 05, 2015

Dopamine, L-DOPA, catecholamines, TH, Nuur1 etc.

catecholamines (儿茶酚胺) is the collection name for three neurotransmitters: epinephrine (adrenaline), norepinephrine (noradrenaline) and dopamine.

L-DOPA is its precursor. Its corresponding drug is called levodopa.

L-DOPA is produced from the amino acid L-tyrosine by the enzyme tyrosine hydroxylase (TH).

Tyrosine cannot pass the BBB (blood brain barrier), but L-DOPA can.

L-DOPA can be converted to dopamine, which can be further converted to epinephrine (adrenaline), norepinephrine (noradrenaline).

Dopaminergic neurons can be divided into 11 cell groups according to where they located, using histochemical fluorescence method. It includes A8-A16, Aaq and Telencephalic group. Among that, A9 is where the substantia nigra pars compacta (SNpc) refers.

Number of TH-positive neurons in SN declines with age, while a-synuclein level increases with age. This inverse relationship is also seen in the surviving DA neurons of PD patients.

And it shows that the down-regulation of TH is linked to a reduction of expression of transcription factor called Nurr1 (orphan nuclear receptor, encoded by gene NR4A2).

Nurr1 is shown to regulate a category of nuclear-encoded mitochondrial genes. (http://www.ncbi.nlm.nih.gov/pubmed/23341612)

Tuesday, July 28, 2015

Note from GEUVADIS paper

Note from GEUVADIS papers:

Lappalainen et al. Nature 2013 : Transcriptome and genome sequencing uncovers functional variation in humans, http://dx.doi.org/10.1038/nature12531
 
‘t Hoen et al. Nature Biotechnology 2013: Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories, http://dx.doi.org/10.1038/nbt.2702

1. how to detect sample outlier?
a. before alignment: distance of k-mer profile
b. after alignment: Spearman rank correlation between samples --> D-statistics (i.e.  the median correlation of one sample against all the other samples)
c. gender mismatch: XIST vs. chrY
d. ASE bias rate among heterozygous sites



2. eQTL
a. exon/gene quantification
b. filter out lowly expressed ones (e.g. 0 in >50% samples)
c. for each group, normalize with PEER, adding mean
    c1. use subset (??e.g. chr20, or chr20- 22??) using K=????0,1,,3,,57,10,13,15,20 for each dataset
    c2. run eQTL and number of genes (eGenes) for each K.
    c3. get the optimal K = K(with most number of eQTL genes)
    c4. run PEER on 20,000 exons to get covairantes for the final normalization
    c5. final PEER normalization using all dataset, residual + mean as final quantification
d. transform the final quantification to standard normal distribution (by ?)
e. eQTL using Matrix-eQTL: linear regression of quantification ~ genotypes + genotype_covariates
3. Differential expression analysis
a. TMM normalization (from edgeR)
b. filter: genes with more than 5 counts per million in at least 1 sample were analyzed in pairwise population comparisons
c. tweeDEseq (good for large samples), significance: FDR < 0.05 and log2 fold change greater than 2

Wednesday, June 24, 2015

median filter in AWK

Here is what median filter does from wikipedia:
To demonstrate, using a window size of three with one entry immediately preceding and following each entry, a median filter will be applied to the following simple 1D signal:
x = [2 80 6 3]
So, the median filtered output signal y will be:
y[1] = Median[2 2 80] = 2
y[2] = Median[2 80 6] = Median[2 6 80] = 6
y[3] = Median[80 6 3] = Median[3 6 80] = 6
y[4] = Median[6 3 3] = Median[3 3 6] = 3
i.e. y = [2 6 6 3].
Note that, in the example above, because there is no entry preceding the first value, the first value is repeated, as with the last value, to obtain enough entries to fill the window. This is one way of handling missing window entries at the boundaries of the signal.
Here is my awk code to implement this:
#!/bin/awk -f
# awk script to filter noise by sliding a window and taking mean or median per window
# Authos: Xianjun Dong
# Date: 2015-06-23
# Usage: _filter.awk values.txt
# bigWigSummary input.bigwig chr10 101130293 101131543 104 | _filter.awk -vW=5 -vtype=median
BEGIN{
  if(W=="") W=5; 
  if(type=="") type="median";
  half=(W-1)/2;

{
  for(i=1;i<=NF;i++) {
    array[half+1]=$i
    for(k=1;k<=half;k++){
      array[half+1-k]=(i<=k)?$1:$(i-k);
      array[half+1+k]=((i+k)>NF)?$NF:$(i+k);
    }
    if(type=="median") {asort(array); x=array[half+1];}
    if(type=="mean") {x=0; for(j=1;j<=W;j++) x+=array[j]; x=x/W;}
    printf("%s\t", x);
  }
  print "";
}

Friday, May 22, 2015

Be cautious of using grep --colour=always

I noticed that grep --colour=always can embed additional code (for coloring, for example, ^[[01;31m^[[K    ^[[m^[[K) in your text, which could lead downstream confusion. Here is one such example:
https://stackoverflow.com/questions/18315380/invisible-0131-number-in-bash

$ echo 1 2 3 | grep 2
1 2 3
$ echo 1 2 3 | grep 2 | cat -v
1 2 3
$ echo 1 2 3 | grep --colour=always 2 
2 3
$ echo 1 2 3 | grep --colour=always 2 | cat -v
1 ^[[01;31m^[[K2^[[m^[[K 3

See my answer for more detail.

The solution is to use --colour=autoor not use the --colour option at all.


$ echo 1 2 3 | grep --colour=auto 2 
2 3
$ echo 1 2 3 | grep --colour=auto 2 | cat -v
1 2 3

Check your .bashrc to see if you set alias grep='grep --colour=always' like I did.

Thursday, May 21, 2015

Which version should I install? 32 bit or 64 bit? Ubuntu or CentOS?

I was posting two relevant articles on this topic:

Unix vs. Linux
 (http://onetipperday.blogspot.com/2014/07/unix-vs-linux.html)
Which version of tools I should install? i386 vs. x86_64, and 32bit vs. 64bit kernel (http://onetipperday.blogspot.com/2013/02/software-vs-hardware-i386-vs-x8664-and.html)

However, this can still be confusing sometimes... for example, when you visit the SRA toolkit download page: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

OK. Here is the short answer.

1. How to tell my CPU architecture is 32 bit or 64 bit?

$ uname -aIf the output is i386 or i586, it's 32 bit. If it's x86_64, it's 64 bit.

Ref: http://www.cyberciti.biz/faq/linux-how-to-find-if-processor-is-64-bit-or-not/

2. How to tell my OS (operation system) kernel is 32 bit or 64 bit?

Note that the OS kernel can be different from the above hardware architecture. Usually the OS changes version following the update of hardware, but the 32bit version can work in a 64bit machine (not the other way around), so it might not be consistent. The Macbook Pro I am using is running a 32-bit kernel on a 64-bit processor (see my previous post). 

3. How to tell which OS I am using? CentOS or Ubuntu?

In my case, I used:
$ ls -d /etc/[A-Za-z]*[_-][rv]e[lr]* | grep -v "lsb" | cut -d'/' -f3 | cut -d'-' -f1 | cut -d'_' -f1

it returns redhat, which is part of CentOS now.

## For more detail, please refer terdon's answer in http://askubuntu.com/a/459425

Friday, May 15, 2015

How to correctly set color in the image() function?

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:

I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval.

For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.

x=c(3,1,2,1)

image(1,1:length(x), matrix(x, nrow=1, ncol=length(x)), col=c("blue","red",'green','yellow'))

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

collist<-c(0,1)
image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

x=matrix(rnorm(100),nrow=10)*100
xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
collist<-c("#053061","#2166AC","#4393C3","#92C5DE","#D1E5F0","#F7F7F7","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
ColorRamp<-colorRampPalette(collist)(10000)
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
layout(matrix(seq(2),nrow=2,ncol=1),widths=c(1),heights=c(3,0.5))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
axis(1,seq(xmin,xmax,10),seq(xmin,xmax,10))