Ngram IDF smoothing

Posted by adi92 on Stack Overflow See other posts from Stack Overflow or by adi92
Published on 2010-06-10T18:47:58Z Indexed on 2010/06/10 18:52 UTC
Read the original article Hit count: 284

I am trying to use IDF scores to find interesting phrases in my pretty huge corpus of documents.
I basically need something like Amazon's Statistically Improbable Phrases, i.e. phrases that distinguish a document from all the others
The problem that I am running into is that some (3,4)-grams in my data which have super-high idf actually consist of component unigrams and bigrams which have really low idf..
For example, "you've never tried" has a very high idf, while each of the component unigrams have very low idf..
I need to come up with a function that can take in document frequencies of an n-gram and all its component (n-k)-grams and return a more meaningful measure of how much this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or backoff models.. I am not sure what assumptions/intuitions those models leverage to perform well, and so how well they would do for IDF scores.
Anybody has any better ideas?

© Stack Overflow or respective owner

Related posts about machine-learning

Related posts about nlp