How to pick random (small) data samples using Map/Reduce?

Posted by Andrei Savu on Stack Overflow See other posts from Stack Overflow or by Andrei Savu
Published on 2010-03-25T08:48:16Z Indexed on 2010/03/25 8:53 UTC
Read the original article Hit count: 182

Filed under:
|
|
|

I want to write a map/reduce job to select a number of random samples from a large dataset based on a row level condition. I want to minimize the number of intermediate keys.

Pseudocode:

for each row 
  if row matches condition
    put the row.id in the bucket if the bucket is not already large enough

Have you done something like this? Is there any well known algorithm?

A sample containing sequential rows is also good enough.

Thanks.

© Stack Overflow or respective owner

Related posts about mapreduce

Related posts about hadoop