kmeans based on mapreduce by python

Posted by user3616059 on Stack Overflow See other posts from Stack Overflow or by user3616059
Published on 2014-06-10T09:15:25Z Indexed on 2014/06/10 9:24 UTC
Read the original article Hit count: 161

Filed under:
|
|

I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as key and coordinates of row as value. In reducer, updating the centroids would be performed. I am writing this by python.

As you know, I have to use Hadoop streaming to transfer data between STDIN and STOUT. according to my knowledge, when we print (key + "\t"+value), it will be sent to reducer. Reducer will receive data and it calculates the new centroids but when we print new centroids, I think it does not send them to mapper to calculate new clusters and it just send it to STDOUT and as you know, kmeans is a iterative program. So, my questions is whether Hadoop streaming suffers of doing iterative programs and we should employ MRJOB for iterative programs?

© Stack Overflow or respective owner

Related posts about python

Related posts about hadoop