Tuesday, July 17, 2012

deeply understanding the SAM tags

First of all, read the SAM specification file carefully, which ought to contains all pieces of information you need. [http://samtools.sourceforge.net/SAM1.pdf]

According to the common questions asked (e.g. "SAM format: single vs multiple hits?", "For SAM files, if a query is aligned to multiple positions would I have multiple entries for the same query or would I get multiple alignment positions in the same entry for the query?"), I'd like to address based on my understanding to the SAM specification.

1. How to distinguish single hit vs. multiple hits from SAM file?

There are two ways to distinguish it. One is by the FLAG field; multi-mapped reads contain bit of 0x100 in their FLAG tag (e.g. 256 or 272 depending on which strand it mapped to), so use snip like awk '{if (and($2,0x100)) print}' can extract the secondarily-mapped hits (hits except the first one). The other way to tell multiple hits is to use the optional tag NH (if any). NH is "Number of reported alignments that contains the query in the current record", which should be >1 for multiple-mapped reads. The snip code to extract multi-mapped reads can be awk '/NH:i:[1-9]/ {print}'

Update: The snip code to extract unique mappers can be
awk '{for(i=1;i<=NF;i++) if($i~/NH:i:1\t/ || $i~/NH:i:1$/) print}'
awk '/NH:i:1[\t$]/ {print}'
or more elegantly,
fgrep -w NH:i:1
so, multimapper can be extraced by 
fgrep -vw NH:i:1

2. How to call variation (e.g. SNPs) from SAM file?

The variation information is already stored in the SAM file (since it's alignment result). Two places to get the info: 1) from CIGAR line. The SAM specification file already tells the details of CIGAR line. To note here is the 'M/I/D/N', which means alignment match (can be a sequence match or mismatch), insertion to the reference, deletion from the reference and skipped region from the reference. My understanding is I is similar as N, only difference is in length. "For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not de ned". To keep in mind is the M can contain mismatch. So, to specify what the mismatch is, there is another tag in the SAM file, called MD (String for mismatching positions). Here is the footnote for MD in the specification file: "The MD eld aims to achieve SNP/indel calling without looking at the reference. For example, a string '10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches followed by an A on the reference which is different from the aligned read base; the next 5 reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are matches. The MD eld ought to match the CIGAR string. "

3. How to take advantage of the header section (e.g. @RG) properly?

ADD: @HD line is necessary if running cufflinks on bam file. Use "samtools reheader" for that if the BAM file converted by "samtools view -bSt $BOWTIE_INDEXES/genome.fai" does not contain @HD line. I wonder the "samtools sort" should generate the @HD line. 
ADD2: how to use @RG tag?

4. Other tags in the SAM files
AS:Alignment score generated by aligner. For example, if you use bowtie2, the score "can be negative. Can be greater than 0 in --local mode (but not in --end-to-end mode)."
NM: number of mismatches (Edit distance to the reference), including mismatches in xxI in CIGAR line, and mismatches in xxM of CIGAR line (which is in MD string). For example, if CIGAR=22M3I4M and MD=25T0, then NM=4; if CIGAR=6M1D23M, MD=1T4^T1C21, then NM=3.
CC: chromosome name of next alignment, '=' if on the same chr.
CP: start position of next alignment
HI: hit index (increasing from from 0 to NH-1). The best hit does not have to be the first one. See example below.

$grep HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 ~/scratch/mouse_adult_wt_smallRNAseq_76nt_strand_ox/accepted_hits.sam
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 256 chr1 18909796 0 20M2D9M * 0 0 TAGCCTCTGTCAGCACTCCTGAGTTCAGA B;8:98@@;@=+?=8BBD=3=B=CBDDD= AS:i:-16 XN:i:0 XM:i:1 XO:i:1 XG:i:2 NM:i:3 MD:Z:20^GG7A1 YT:Z:UU NH:i:10 CC:Z:chr10 CP:i:73888540 HI:i:0
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 256 chr10 73888540 0 20M2D9M * 0 0 TAGCCTCTGTCAGCACTCCTGAGTTCAGA B;8:98@@;@=+?=8BBD=3=B=CBDDD= AS:i:-15 XN:i:0 XM:i:1 XO:i:1 XG:i:2 NM:i:3 MD:Z:20^GG8C0 YT:Z:UU NH:i:10 CC:Z:chr14 CP:i:25161322 HI:i:1
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 16 chr14 25161322 0 29M * 0 0 TCTGAACTCAGGAGTGCTGACAGAGGCTA =DDDBC=B=3=DBB8=?+=@;@@89:8;B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:29 YT:Z:UU NH:i:10 CC:Z:chr17 CP:i:22276988 HI:i:2
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 256 chr17 22276988 0 23M1D6M * 0 0 TAGCCTCTGTCAGCACTCCTGAGTTCAGA B;8:98@@;@=+?=8BBD=3=B=CBDDD= AS:i:-18 XN:i:0 XM:i:2 XO:i:1 XG:i:1 NM:i:3 MD:Z:21G1^A4A1 YT:Z:UU NH:i:10 CC:Z:chr18 CP:i:90147338 HI:i:3
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 272 chr18 90147338 0 6M1D23M * 0 0 TCTGAACTCAGGAGTGCTGACAGAGGCTA =DDDBC=B=3=DBB8=?+=@;@@89:8;B AS:i:-18 XN:i:0 XM:i:2 XO:i:1 XG:i:1 NM:i:3 MD:Z:1T4^T1C21 YT:Z:UU NH:i:10 CC:Z:chr6 CP:i:9831220 HI:i:4
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 272 chr6 9831220 0 5M1D24M * 0 0 TCTGAACTCAGGAGTGCTGACAGAGGCTA =DDDBC=B=3=DBB8=?+=@;@@89:8;B AS:i:-18 XN:i:0 XM:i:2 XO:i:1 XG:i:1 NM:i:3 MD:Z:1T0G2^C24 YT:Z:UU NH:i:10 CC:Z:= CP:i:89789445 HI:i:5
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 256 chr6 89789445 0 22M3I4M * 0 0 TAGCCTCTGTCAGCACTCCTGAGTTCAGA B;8:98@@;@=+?=8BBD=3=B=CBDDD= AS:i:-18 XN:i:0 XM:i:1 XO:i:1 XG:i:3 NM:i:4 MD:Z:25T0 YT:Z:UU NH:i:10 CC:Z:chr9 CP:i:17586725 HI:i:6
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 256 chr9 17586725 0 24M1D5M * 0 0 TAGCCTCTGTCAGCACTCCTGAGTTCAGA B;8:98@@;@=+?=8BBD=3=B=CBDDD= AS:i:-18 XN:i:0 XM:i:2 XO:i:1 XG:i:1 NM:i:3 MD:Z:24^G2C0A1 YT:Z:UU NH:i:10 CC:Z:= CP:i:115726831 HI:i:7
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 272 chr9 115726831 0 8M2D21M * 0 0 TCTGAACTCAGGAGTGCTGACAGAGGCTA =DDDBC=B=3=DBB8=?+=@;@@89:8;B AS:i:-15 XN:i:0 XM:i:1 XO:i:1 XG:i:2 NM:i:3 MD:Z:0G7^CC21 YT:Z:UU NH:i:10 CC:Z:chrX CP:i:24026352 HI:i:8
HWUSI-EAS1533_0026_FC:6:71:14711:9709#0 272 chrX 24026352 0 4M3I22M * 0 0 TCTGAACTCAGGAGTGCTGACAGAGGCTA =DDDBC=B=3=DBB8=?+=@;@@89:8;B AS:i:-14 XN:i:0 XM:i:0 XO:i:1 XG:i:3 NM:i:3 MD:Z:26 YT:Z:UU NH:i:10 HI:i:9


  1. Anonymous1:00 PM

    Hi, I want to extract multi-mapped reads from sam file generated from bwa (bwsw).
    I am working with 454 data and I have used bwa 0.5.9.

    The problem is that the flag to indicate that the read is a multiple hit not appear
    (256 or NH:i: ),
    So, I can not extract the reads that has multiple hits

    so I have two question?
    ## I'ts fine is all flags I can see in the sam file are 0, 4, 16?
    ## How can I extract the reads that have multiple Hits?

    e.g: I have two reads mapping in the same contig

    @SQ SN:Contig1276 LN:500

    I think that I can filt it using a Perl script that checking the MAPQ(5 column)?
    (In that case the higher is 27)
    It is correct?

    Thanks in advance,


  2. grep -v "^@" file.sam | cut -f 1 | sort | uniq -dc

  3. Anonymous3:19 AM

    I would like Kyle's answer to be correct, I'd really love. But it isn't always.