Wednesday, April 18, 2012

sorting BAM makes the file smaller

Yes, it's not just me to notice this change. See here:

As Heng said, "BAM is compressed. Sorting helps to give a better compression ratio because similar sequences are grouped together."... So it's not because of the removal of the unmapped reads (which are put at the end).

The tips is - always sort the output BAM after converting a SAM, e.g.

samtools view -Sbu in.sam | samtools sort - in.sorted
mv in.sorted.bam in.bam

Sorted BAM is smaller and better for searching. 

Other tips are:

1. SAM->BAM does not require a sorted header, nor a header. 
If there is header, samtools view -Sb in.sam > in.bam
if there is no header, samtools view -Sbt genome.fa.fai in.sam > in.bam
But the in.bam will not follow the order in the sequence dictionary (genome.fa.fai) unless you sort it by samtools sort. 

2. If input BAM, cufflinks require a proper header for the BAM file, esp. the line of 

@HD VN:1.0 SO:coordinate

Without the line, even if your BAM (or SAM) is sorted, but cufflinks cannot tell it by the file, only if you provide the info through the @HD line. So, I guess 

@HD VN:1.0 SO:unsorted 

won't work. 

No comments:

Post a Comment