How do you efficiently implement a document similarity search system?
        Posted  
        
            by Björn Lindqvist
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Björn Lindqvist
        
        
        
        Published on 2010-02-03T10:36:32Z
        Indexed on 
            2010/03/19
            1:51 UTC
        
        
        Read the original article
        Hit count: 507
        
How do you implement a "similar items" system for items described by a set of tags?
In my database, I have three tables, Article, ArticleTag and Tag. Each Article is related to a number of Tags via a many-to-many relationship. For each Article i want to find the five most similar articles to implement a "if you like this article you will like these too" system.
I am familiar with Cosine similarity and using that algorithm works very well. But it is way to slow. For each article, I need to iterate over all articles, calculate the cosine similarity for the article pair and then select the five articles with the highest similarity rating.
With 200k articles and 30k tags, it takes me half a minute to calculate the similar articles for a single article. So I need another algorithm that produces roughly as good results as cosine similarity but that can be run in realtime and which does not require me to iterate over the whole document corpus each time.
Maybe someone can suggest an off-the-shelf solution for this? Most of the search engines I looked at does not enable document similarity searching.
© Stack Overflow or respective owner