Very fast document similarity

Posted by peyton on Stack Overflow See other posts from Stack Overflow or by peyton
Published on 2010-05-13T18:23:30Z Indexed on 2010/05/13 18:34 UTC
Read the original article Hit count: 217

Filed under:

cosine

|

search

|

similarity

|

Performance

Hello,

I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.

My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!

I have already taken into account the following possibilities for speeding up performance:

Pre-normalize all the term vectors
Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!

I have a few questions:

If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.
Is there some clever way to only consider vectors which are likely to be close to the query document?
Is there some way to be more frugal about the amount of memory I'm using for all these vectors?

Thanks!

© Stack Overflow or respective owner

Related posts about cosine

Find cosine similarity in R

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays? Currently, I implemented my own function, but I can't help but think that R should already come with one :) Thanks, Derek >>> More
java cosine similarity problem

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi again :) I developed some java program to calculate cosine similarity on the basis of TF*IDF. It worked very well. But there is one problem.... :( for example: If I have following two matrix and I want to calculate cosine similarity it does not work as rows are not same in length doc 1 1 2 3 4… >>> More
How do you efficiently implement a document similarity search system?

as seen on Stack Overflow - Search for 'Stack Overflow'
How do you implement a "similar items" system for items described by a set of tags? In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement… >>> More
Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity… >>> More
Binary Cosine Cofficient

as seen on Stack Overflow - Search for 'Stack Overflow'
I was given the following forumulae for calculating this sim=|QnD| / v|Q|v|D| I went ahed and implemented a class to compare strings consisting of a series of words #pragma once #include <vector> #include <string> #include <iostream> #include <vector> using namespace… >>> More

Related posts about search

"Error in the Site Data Web Service." when performing crawl

as seen on Server Fault - Search for 'Server Fault'
Installed SharePoint Services v3 (SP2, october 2009 cumulative updates, Language Pack), attached to a content database I had previously (all works). Installed Search server 2008 Express (with language pack) on top of WSS and crawl does not work. However it works for newly created web application +… >>> More
Search Alternative Search Engines from within Bing’s Search Page

as seen on How to geek - Search for 'How to geek'
So you love using Bing Search but may still be curious to see what another search engine will provide if used. Now you can search using another search engine from within the Bing Search page and enjoy numbered results using two simple user scripts. Note: These user scripts may also be added to other… >>> More
CONVERT(int, (datepart(month, @search)), (datepart(day, @search)), DateAdd(year, Years.Year - (datepart(year, @search)))

as seen on Stack Overflow - Search for 'Stack Overflow'
In the query the top part is getting all the years that will run in the stored procedure. Works fine But at first i just wanted to run the queries for yesterdays date for all the years, but now i realized i want the user to select a date that will be in a parameter @search Booked <= CONVERT(int… >>> More
Am?lioration du Search MOSS: synonyme et Best Bet - La gestion des synonymes dans MOSS Search

as seen on ASP-PHP.net - Search for 'ASP-PHP.net'
Le moteur de recherche de MOSS permet la configuration d'une liste de synonymes. Nous verrons donc dans cet article comment effectuer cette tache et ce que cela peut apporter ? vos utilisateurs. Nous verrons aussi comment automatiser un peu plus cette configuration par l'utilisation de code ou d'outils… >>> More
Utiliser un MOSS 2007 Search avec SPS 2003 - Comment utiliser un MOSS Search avec SPS Portail

as seen on ASP-PHP.net - Search for 'ASP-PHP.net'
Microsoft Office SharePoint Server 2007 (MOSS) fournit de nombreuses fonctionnalit?s qui ne sont pas disponibles sous SharePoint Portal Server 2003 (SPS). C'est particuli?rement vrai pour le moteur de recherche. Ce moteur de recherche peut pourtant ?tre utilis? sans attendre une ?volution du site… >>> More