Quickest algorithm for finding sets with high intersection

Posted by conradlee on Stack Overflow See other posts from Stack Overflow or by conradlee
Published on 2010-04-23T08:34:38Z Indexed on 2010/04/23 8:53 UTC
Read the original article Hit count: 321

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs (i.e., all integer sets have a cardinality of 100).

I want to find all pairs of integer sets that have an intersection of 15 or greater.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algorithm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

Update Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

© Stack Overflow or respective owner

Related posts about set

Related posts about intersection