Document Similarity: Comparing two documents efficiently

Posted by seanieb on Stack Overflow See other posts from Stack Overflow or by seanieb
Published on 2010-03-13T10:24:55Z Indexed on 2010/03/13 10:35 UTC
Read the original article Hit count: 301

Filed under:
|
|
|

I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries

This is what I have so far, it works, but is super slow:

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)

#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)

# loop through each token in doca and see if one matches in docb
for x in doca_dic:
    if docb_dic.has_key(x):
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity

I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated. Thanks.

© Stack Overflow or respective owner

Related posts about python

Related posts about mysql