How does Amazon's Statistically Improbable Phrases work?

Posted by ??iu on Stack Overflow See other posts from Stack Overflow or by ??iu
Published on 2010-01-05T22:13:49Z Indexed on 2010/05/13 15:44 UTC
Read the original article Hit count: 338

Filed under:

algorithm

|

platform-agnostic

|

nlp

|

natural-language

How does something like Statistically Improbable Phrases work?

According to amazon:

Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books in the Search Inside! program. If they find a phrase that occurs a large number of times in a particular book relative to all Search Inside! books, that phrase is a SIP in that book.

SIPs are not necessarily improbable within a particular book, but they are improbable relative to all books in Search Inside!. For example, most SIPs for a book on taxes are tax related. But because we display SIPs in order of their improbability score, the first SIPs will be on tax topics that this book mentions more often than other tax books. For works of fiction, SIPs tend to be distinctive word combinations that often hint at important plot elements.

For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules

One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.

© Stack Overflow or respective owner

Related posts about algorithm

Jpeg Algorithm vs BMP Algorithm?

as seen on Super User - Search for 'Super User'
I'm just wonder, what the differences are between creating a BMP file algorithm and JPG file algorithm ? If you know the others images' format algorithm, please post them. Thanks. >>> More
word disambiguation algorithm (Lesk algorithm)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hii.. Can anybody help me to find an algorithm in Java code to find synonyms of a search word based on the context and I want to implement the algorithm with WordNet database. For example, "I am running a Java program". From the context, I want to find the synonyms for the word "running", but the… >>> More
Search algorithm (with a sort algorithm already implemented)

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello, Im doing a Java application and Im facing some doubts in which concerns performance. I have a PriorityQueue which guarantees me the element removed is the one with greater priority. That PriorityQueue has instances of class Event (which implements Comparable interface). Each Event is associated… >>> More
Is there any algorithm for finding LINES by PIXEL COLORS on picture?

as seen on Stack Overflow - Search for 'Stack Overflow'
So I have Image like this I want to get something like this (I hevent drawn all lines I want but I hope you can get my idea) I need algorithm for finding all straight lines on it by just reading colors of pixels. No hard math, no Haar, no Hough. Some algorithm which would be based on points… >>> More
collsion issues with quadtree [on hold]

as seen on Game Development - Search for 'Game Development'
So i implemented a Quad tree in Java for my 2D game and everything works fine except for when i run my collision detection algorithm, which checks if a object has hit another object and which side it hit.My problem is 80% of the time the collision algorithm works but sometimes the objects just go… >>> More

Related posts about platform-agnostic

Daylight saving time - do and don'ts

as seen on Stack Overflow - Search for 'Stack Overflow'
I am hoping to make this question and the answers to it the definitive guide to dealing with daylight saving time, in particular for dealing with the actual change overs. Many systems are dependent on keeping accurate time, the problem is with changes to time due to daylight savings - moving the… >>> More
Cloud Agnostic Architecture?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I'm doing some architecture work on a new solution which will initially run in Windows Azure. However I'd like the solution (or at least the architecture/design) to be Cloud Agnostic (to whatever extent is realistic). Has anyone done any work on this front or seen any good white papers/blog posts… >>> More
How does Amazon's Statistically Improbable Phrases work?

as seen on Stack Overflow - Search for 'Stack Overflow'
How does something like Statistically Improbable Phrases work? According to amazon: Amazon.com's Statistically Improbable Phrases, or "SIPs", are the most distinctive phrases in the text of books in the Search Inside!™ program. To identify SIPs, our computers scan the text of all books… >>> More
Determining whether a file is a duplicate

as seen on Stack Overflow - Search for 'Stack Overflow'
Is there a reliable way to determine whether or not two files are the same? For example, two files with the same size and type may or may not be the same binarilly (yeah, I know it's not really a word). I assume that comparing one or two checksums of the files will help, but I wonder: How reliable… >>> More
What is the worst 'gotcha' you've experienced?

as seen on Stack Overflow - Search for 'Stack Overflow'
I'd like to hear some of the more pernicious 'gotchas' that exist out there. Any language, system, or library is fine. >>> More