Wednesday, August 29, 2012

quick sort in bash line

There are varieties of sorting algorithms (see wiki for details: The time complexity for the quick ones are like nlog(n).

In unix, we usually use the 'sort' command to sort input. I cannot find direct info for the computational complexity of unix sort command, but definitely it shows quick slow esp. when the dataset is large.

Here are several feasible solutions for quick sorting I googled:

1. "split & merge"

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;

Followed by:
sort -m _tmp* -o data.tsv.sorted

arnab suggested to split the large file into small pieces and sort them individually, then use 'sort -m' to merge them. I've tried it, works very well, at least the "sort -m" is worthy to try. (Note: -merge is actually sort instead of merge. Here is what 'info sort' says:

Merge the given files by sorting them as a group. Each input file
must always be individually sorted. It always works to sort
instead of merge; merging is provided because it is faster, in the
case where it works.

This is another implementation for quicksort in bash. I've not tested it yet.

3. BSD seems already have lib for other quick sorting algorithms:

No comments:

Post a Comment