Tracking/Counting Word Frequency

Posted by Joel Martinez on Stack Overflow See other posts from Stack Overflow or by Joel Martinez
Published on 2010-05-17T20:49:33Z Indexed on 2010/05/17 21:00 UTC
Read the original article Hit count: 194

Filed under:
|

I'd like to get some community consensus on a good design to be able to store and query word frequency counts. I'm building an application in which I have to parse text inputs and store how many times a word has appeared (over time). So given the following inputs:

  • "To Kill a Mocking Bird"
  • "Mocking a piano player"

Would store the following values:

Word    Count
-------------
To      1
Kill    1
A       2
Mocking 2
Bird    1
Piano   1
Player  1

And later be able to quickly query for the count value of a given arbitrary word.

My current plan is to simply store the words and counts in a database, and rely on caching word count values ... But I suspect that I won't get enough cache hits to make this a viable solution long term.

Can anyone suggest algorithms, or data structures, or any other idea that might make this a well-performing solution?

© Stack Overflow or respective owner

Related posts about indexing

Related posts about algorithm