Converting python collaborative filtering code to use Map Reduce

Posted by Neil Kodner on Stack Overflow See other posts from Stack Overflow or by Neil Kodner
Published on 2010-05-21T11:04:57Z Indexed on 2010/05/21 11:10 UTC
Read the original article Hit count: 375

Using Python, I'm computing cosine similarity across items.

given event data that represents a purchase (user,item), I have a list of all items 'bought' by my users.

Given this input data

(user,item)
X,1
X,2
Y,1
Y,2
Z,2
Z,3

I build a python dictionary

{1: ['X','Y'], 2 : ['X','Y','Z'], 3 : ['Z']}

From that dictionary, I generate a bought/not bought matrix, also another dictionary(bnb).

{1 : [1,1,0], 2 : [1,1,1], 3 : [0,0,1]} 

From there, I'm computing similarity between (1,2) by calculating cosine between (1,1,0) and (1,1,1), yielding 0.816496

I'm doing this by:

items=[1,2,3]
for item in items:
  for sub in items:
    if sub >= item:    #as to not calculate similarity on the inverse
      sim = coSim( bnb[item], bnb[sub] )

I think the brute force approach is killing me and it only runs slower as the data gets larger. Using my trusty laptop, this calculation runs for hours when dealing with 8500 users and 3500 items.

I'm trying to compute similarity for all items in my dict and it's taking longer than I'd like it to. I think this is a good candidate for MapReduce but I'm having trouble 'thinking' in terms of key/value pairs.

Alternatively, is the issue with my approach and not necessarily a candidate for Map Reduce?

© Stack Overflow or respective owner

Related posts about python

Related posts about hadoop