Friday, May 08, 2015

Tips and Tools you may need for working on BIG data

Nowadays everyone is talking about big data. As a genomic scientist, I could feel hungry of a collection of tools more specialized for the mediate-to-big data we deal everyday.

Here are some tips I found useful when getting, processing or visualizing large data set:

1. How to download data faster than wget?

We can use wget to download the data to local disk. If it's large, we can download with other faster alternative, such as axel, aria2.

2. Process the data in parallel with hidden option in GNU commands

  • If you have many many files to process, and they are independent, you can process them in a parallel manner. GNU has a command called parallel. Lindenbaum Pierre wrote a nice notebook for "GNU Parallel in Bioinformatics", worthy to read. 
  • Many commonly used commands also have a hidden option to run in a parallel way. For example, GNU sort command has --parallel=N to set it with multiple cores. 
  • You can set -F when doing grep -f on a large seed file. People also suggest to set export LC_ALL=C line to get X2 speed.

3. In R, there are several must-have tips for large data, e.g. data.table
  • If using read.table(), set stringsAsFactors = F and colClass. See the example here
  • use fread(), not read.table(). Some more details here. But so far, fread() does not support reading *.gz file directly. Use fread('zcat file.gz')
  • use data.table, rather data.frame. Learn the difference online here.
  • There is a nice View for how to process data in parallel in R:, but I have not followed them practically. Hopefully there will be some easy tutorials there, or I become less procrastinated to learn some of them ... At least I can start with foreach()
4. How to open scatter plot with too many points in Illustrator?

This is really a problem for me as we usually have a figure with >30k dots (i.e. each dot is a gene). Even though they are highly overlapping each other, opening it in Illustrator is extremely slow. Here is a tip:
From that, probably a better idea is to "compress" the data before plotting it, such as merge the overlapped ones if they overlapped some %.
or this one:
or this one:

Still working on the post...


  1. don't forget

  2. Cool tips here. Thanks!

    By the way can you write up and share some tips on time management as a bioinformatician? Some jobs just take too long to run and I am hoping to better make use of my time. Would love to hear some tips from you. Thanks!