What algorithms do "the big ones" use to cluster news?

Posted by marco92w on Stack Overflow See other posts from Stack Overflow or by marco92w
Published on 2009-05-27T18:27:31Z Indexed on 2010/03/22 0:21 UTC
Read the original article Hit count: 641

Filed under:
|
|

I want to cluster texts for a news website.

At the moment I use this algorithm to find the related articles. But I found out that PHP's similar_text() gives very good results, too.

What sort of algorithms do "the big ones", Google News, Topix, Techmeme, Wikio, Megite etc., use? Of course, you don't know exactly how the algorithms work. It's secret. But maybe someone knows approximately the way they work?

The algorithm I use at the moment is very slow. It only compares two articles. So for having the relations between 5,000 articles you need about 12,500,000 comparisons. This is quite a lot.

Are there alternatives to reduce the number of necessary comparisons? [I don't look for improvements for my algorithm.] What do "the big ones" do? I'm sure they don't always compare one article to another and this 12,500,000 times for 5,000 news.

It would be great if somebody can say something about this topic.

© Stack Overflow or respective owner

Related posts about mashups

Related posts about clustering