What algorithms do "the big ones" use to cluster news?

Posted by marco92w on Stack Overflow See other posts from Stack Overflow or by marco92w
Published on 2009-05-27T18:27:31Z Indexed on 2010/03/22 0:21 UTC
Read the original article Hit count: 736

Filed under:

mashups

|

clustering

|

algorithm

I want to cluster texts for a news website.

At the moment I use this algorithm to find the related articles. But I found out that PHP's similar_text() gives very good results, too.

What sort of algorithms do "the big ones", Google News, Topix, Techmeme, Wikio, Megite etc., use? Of course, you don't know exactly how the algorithms work. It's secret. But maybe someone knows approximately the way they work?

The algorithm I use at the moment is very slow. It only compares two articles. So for having the relations between 5,000 articles you need about 12,500,000 comparisons. This is quite a lot.

Are there alternatives to reduce the number of necessary comparisons? [I don't look for improvements for my algorithm.] What do "the big ones" do? I'm sure they don't always compare one article to another and this 12,500,000 times for 5,000 news.

It would be great if somebody can say something about this topic.

© Stack Overflow or respective owner

Related posts about mashups

Application Mashups Require Strong Security Approach

as seen on Devx - Search for 'Devx'
Security pros offer advice on developing mashups. >>> More
Application Mashups Require Strong Security Approach

as seen on Internet.com - Search for 'Internet.com'
Security pros offer advice on developing mashups. >>> More
Mashups and the Enterprise Mashup Markup Language

as seen on Dr Dobbs - Search for 'Dr Dobbs'
Data Formats - Languages - Social Sciences - Markup language - Graphics >>> More
JackBe Mashups w/ PerformancePoint in MOSS

as seen on Stack Overflow - Search for 'Stack Overflow'
Has anyone used JackBe mashups in a SharePoint environment hitting OLAP cubes with PerformancePoint 2007? Any gotchas? Our client is evaluating adding JackBe to our enterprise toolkit and they're soliciting technical recommendations... >>> More
What algorithms do "the big ones" use to cluster news?

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to cluster texts for a news website. At the moment I use this algorithm to find the related articles. But I found out that PHP's similar_text() gives very good results, too. What sort of algorithms do "the big ones", Google News, Topix, Techmeme, Wikio, Megite etc., use? Of course, you don't… >>> More

Related posts about clustering

agglomerative clustering java

as seen on Stack Overflow - Search for 'Stack Overflow'
Is there any java file that I can use to perform "agglomerative clustering" Result should provide me every level nodes id help................. >>> More
MySQL Clustering in a Sandbox

as seen on Internet.com - Search for 'Internet.com'
MySQL's unique architecture allows for plugin storage engines. There is the MyISAM storage engine, the ARCHIVE storage engine and the InnoDB storage engine; so it makes sense then that MySQL's clustering solution involves a storage engine as well, namely the NDB (Network DataBase) storage engine. >>> More
Clustering for Mere Mortals (Pt2)

as seen on SQL Team - Search for 'SQL Team'
Planning. I could stop there and let that be the entirety post #2 in this series. Planning is the single most important element in building a cluster and the Laptop Demo Cluster is no exception. One of the more awkward parts of actually creating a cluster is coordinating information between Windows… >>> More
Microsoft SQL Server High-Availability Videos and Q&A Log

as seen on SQL Blog - Search for 'SQL Blog'
You Want Videos? We Got Videos! I always enjoy getting the chance to catch up with author, consultant, and Microsoft Clustering MVP Allan Hirt . Allan and I recently presented two sessions covering an overview of high availability in Microsoft SQL Server and, the following week, a demo of how to implement… >>> More
I need advice about iscsi + zfs(or ntfs) + windows 2008 clustering

as seen on Server Fault - Search for 'Server Fault'
I want to setup a storage farm with iSCSI. I have 2 cluster node machine, 1 iscsi target machine that has 8TB installed as RAID 10. The capacity is now 8TB, but I'll upgrade the capacity in future. Let's say, I installed clusters as file server, and I connected these servers to iscsi target, then… >>> More