Wednesday, July 27, 2011

An efficient way to do dataset intersection

The main message is to use "match" to get index of needed rows and then get the rows by the index, instead of using the row names to select, which is much slower. Here is example:

In example above, we know that the same values of column 2nd have same values of columns from 4th to the end. So, instead of doing unique on whole matrix, getting the unique of column 2nd and then getting the index of unique ones by match. Match(a,b) only return the index of first occurrence of a in b. For example

This tips also help in intersecting two big dataframes. For example,

No comments:

Post a Comment