How does Amazon's Statistically Improbable Phrases work?

Posted by ??iu on Stack Overflow See other posts from Stack Overflow or by ??iu
Published on 2010-01-05T22:13:49Z Indexed on 2010/05/13 15:44 UTC
Read the original article Hit count: 233

How does something like Statistically Improbable Phrases work?

According to amazon:

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules

One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about platform-agnostic