And now, I got another example of confusing ID system in biology or bioinformatics.
You expect each SNP should have their unique ID in the dbSNP database, right? But they are not. Look at this example:
chr1 13837 13838 - rs7164031, rs79531918
chr1 13837 13838 + rs200683566, rs28391190, rs28428499, rs71252448, rs79817774
At the same location, multiple SNPs ID are assigned, for both strands.
Here is what I got from NCBI User service for the explanation:
This is expected for many reasons:
- short probes could have multiple mapping locations on the genome
- certain genes could have duplicate/pseudogene/paralogs
- variations found in repeat regions would be difficult to map to a unique
An rsID represents a cluster of reported variations submitted to dbSNP.
In ideal situation, they can be mapped to a unique location in the genome.
There are cases where such unique mapping is NOT attainable.
But I am still not clear of how a cluster of variants are defined. [UPDATE to add]
p.s. code snip to merge SNPs with the same location into single line:
grep single snp137.bed | sort -k1,1 -k2,2n -k6,6 | bedtools groupby -g 1,2,3,6 -c 4 -o collapse