How does Amazon's Statistically Improbable Phrases work?
- by ??iu
How does something like Statistically Improbable Phrases work?
According to amazon: 
  Amazon.com's Statistically Improbable
  Phrases, or "SIPs", are the most
  distinctive phrases in the text of
  books in the Search Inside!™ program.
  To identify SIPs, our computers scan
  the text of all books in the Search
  Inside! program. If they find a phrase
  that occurs a large number of times in
  a particular book relative to all
  Search Inside! books, that phrase is a
  SIP in that book.
  
  SIPs are not necessarily improbable
  within a particular book, but they are
  improbable relative to all books in
  Search Inside!. For example, most SIPs
  for a book on taxes are tax related.
  But because we display SIPs in order
  of their improbability score, the
  first SIPs will be on tax topics that
  this book mentions more often than
  other tax books. For works of fiction,
  SIPs tend to be distinctive word
  combinations that often hint at
  important plot elements.
For instance, for Joel's first book, the SIPs are: leaky abstractions, antialiased text, own dog food, bug count, daily builds, bug database, software schedules
One interesting complication is that these are phrases of either 2 or 3 words. This makes things a little more interesting because these phrases can overlap with or contain each other.