Thursday, November 29, 2012

measuring expression of smallRNA genes

As the early note mentioned, it's not a good idea to use bias-correction based method like cufflinks to calculate the RPKM/FPKM values for small RNA (simply because it's not designed for that purpose).

Since the smallRNA is generally shorter, the usual RNA-seq protocol does not capture them properly (only the 'tail' of the distribution of fragment length). A good solution is to use smallRNA RNA-seq protocol. I meant using short reads length and without size selection.

Based on that, we may be able to use tools like cufflinks/DESeq to calculate the RPKM/FPKM. But again, I am not sure cufflinks will still apply the bias-correction to the small RNA genes when the reads length is short enough. As this post discussed, "The Best way to normalize miRNA HTS data", methods can be:

- RPM ( Read Per Million )
- RPKM (Reads per kilobase per million mapped) : Mortazi et al, Nat. Methods, 2008 (
- Trimmed mean of M-values : Robinson, Oshlack, Genome Biology 2010 (
- upper-quantile :

Simon (one of the authors of DESeq) suggested to use TMM or DESeq (which are reads count-based methods) for differential expression analysis, at least for small RNA genes.

As the DEseq document described, to get reads count you need alignment (SAM/BAM) and the annotation file as input for some tools, like
  • the function summarizeOverlaps in the BioConductor GenomicRanges package,
  • the geneCounts function in the Bioconductor easyRNASeq package,
  • the htseq-count script distributed with the HTSeq Python package (Note:  htseq-count produces one count file for each sample. DESeq offers the function newCountDataSetFromHTSeqCount to get an analysis started from these files)  
  • or the bedtools multicov -count -bams sample1.bam sample2.bam sample3.bam -bed genes.bed
Note the naming issue in Refseq or UCSC that same locus has been assigned with multiple gene names (e.g. SMN1/2), and htseq-count requires unique gene_id in the annotation. You might want to make sure of this before using htseq-count. One solution to this is to use Ensembl GTF.

Also, be aware of the difference between bedtools multicov and htseq-count, esp. when you use multi-mappers. Read details here:

No comments:

Post a Comment