How do I express subtle relationships in my data?

Posted by Chuck H on Programmers See other posts from Programmers or by Chuck H
Published on 2012-12-10T18:19:49Z Indexed on 2012/12/10 23:17 UTC
Read the original article Hit count: 366

"A" is related to "B" and "C". How do I show that "B" and "C" might, by this context, be related as well?

Example:

Here are a few headlines about a recent Broadway play:

1 - David Mamet's Glengarry Glen Ross, Starring Al Pacino, Opens on Broadway
2 - Al Pacino in 'Glengarry Glen Ross': What did the critics think?
3 - Al Pacino earns lackluster reviews for Broadway turn
4 - Theater Review: Glengarry Glen Ross Is Selling Its Stars Hard
5 - Glengarry Glen Ross; Hey, Who Killed the Klieg Lights?

Problem:

Running a fuzzy-string match over these records will establish some relationships, but not others, even though a human reader could pick them out from context in much larger datasets.

How do I find the relationship that suggests #3 is related to #4? Both of them can be easily connected to #1, but not to each other.

Is there a (Googlable) name for this kind of data or structure? What kind of algorithm am I looking for?

Goal:

Given 1,000 headlines, a system that automatically suggests that these 5 items are all probably about the same thing.

To be honest, it's been so long since I've programmed I'm at a loss how to properly articulate this problem. (I don't know what I don't know, if that makes sense).

This is a personal project and I'm writing it in Python. Thanks in advance for any help, advice, and pointers!

© Programmers or respective owner

Related posts about algorithms

Related posts about python