Wednesday, April 18, 2012

sorting BAM makes the file smaller

Yes, it's not just me to notice this change. See here: http://seqanswers.com/forums/archive/index.php/t-13652.html

As Heng said, "BAM is compressed. Sorting helps to give a better compression ratio because similar sequences are grouped together."... So it's not because of the removal of the unmapped reads (which are put at the end).

The tips is - always sort the output BAM after converting a SAM, e.g.

samtools view -Sbu in.sam | samtools sort - in.sorted
mv in.sorted.bam in.bam

Sorted BAM is smaller and better for searching. 

---------------------------
Other tips are:

1. SAM->BAM does not require a sorted header, nor a header. 
If there is header, samtools view -Sb in.sam > in.bam
if there is no header, samtools view -Sbt genome.fa.fai in.sam > in.bam
But the in.bam will not follow the order in the sequence dictionary (genome.fa.fai) unless you sort it by samtools sort. 

2. If input BAM, cufflinks require a proper header for the BAM file, esp. the line of 

@HD VN:1.0 SO:coordinate

Without the line, even if your BAM (or SAM) is sorted, but cufflinks cannot tell it by the file, only if you provide the info through the @HD line. So, I guess 

@HD VN:1.0 SO:unsorted 

won't work. 

2 comments:

  1. Nice and good article.. it is very useful for me to learn and understand easily.. thanks for sharing your valuable information and time.. please keep updating.more 
    php jobs in hyderabad.

    ReplyDelete