## Monday, June 11, 2012

### Uniq and basic set theory

Imagine that I have two files:
aquatic - contains a list of aquatic animals
starfish
whale
nemo
crab
dolphin

mammal - contains a list of mammals
whale
RMS
batman
dolphin
scooby-doo

Given aquatic and mammal are two different sets, let’s use sort and uniq to play with a few basic set theory operations:
Union ( A U B - members in either A or B )
aquatic U mamal= {batman, crab, dolphin, nemo, RMS, scooby-doo, starfish, whale}
sort aquatic mammal | uniq
Intersection ( A ∩ B - members in both A and B )
aquatic ∩ mammal = {dolphin, whale}
sort aquatic mammal | uniq -d
Symmetric Difference ( A ^ B - members in A or B but not both )
aquatic ^ mammal = {batman, crab, nemo, RMS, scooby-doo, starfish}
sort aquatic mammal | uniq -u
Relative Complement ( A \ B - members in A but not in B )
aquatic \ mammal = {crab, nemo, starfish}
sort aquatic mammal | uniq -d | sort aquatic - | uniq -u

• "sort aquatic mammal | uniq -d" performs an intersection: aquatic mammal = {dolphin, whale}.
• "sort aquatic - | uniq -u” performs a symmetric difference: aquatic ^ {dolphin, whale} = {crab, nemo, starfish}.
UPDATED: I found a piece of clean elegant codes to perform relative complement:
sort aquatic mammal mammal | uniq -u
.
“sort -u”: a short-hand for “sort | uniq”
sort -u is equivalent to sort | uniq to eliminate duplicated elements in a list. Therefore, you may replace:
sort aquatic mammal | uniq
with:
sort -u aquatic mammal
this is quite useful when I want to do operation like AND, OR, or +/-, for example, I want to subtract file1 by file2:
sort file1 file2 | uniq -u // output all lines which are not in file1 or file2
if there are duplicated lines in file1/file2, then it's safer to use:
sort file1 file2 file2 | uniq -u