Writing a spell checker similar to "did you mean"

Posted by user888734 on Programmers See other posts from Programmers or by user888734
Published on 2014-05-10T20:16:34Z Indexed on 2014/06/03 3:40 UTC
Read the original article Hit count: 170

Filed under:
|
|
|

I'm hoping to write a spellchecker for search queries in a web application - not unlike Google's "Did you mean?" The algorithm will be loosely based on this: http://catalog.ldc.upenn.edu/LDC2006T13

In short, it generates correction candidates and scores them on how often they appear (along with adjacent words in the search query) in an enormous dataset of known n-grams - Google Web 1T - which contains well over 1 billion 5-grams.

I'm not using the Web 1T dataset, but building my n-gram sets from my own documents - about 200k docs, and I'm estimating tens or hundreds of millions of n-grams will be generated.

This kind of process is pushing the limits of my understanding of basic computing performance - can I simply load my n-grams into memory in a hashtable or dictionary when the app starts? Is the only limiting factor the amount of memory on the machine?

Or am I barking up the wrong tree? Perhaps putting all my n-grams in a graph database with some sort of tree query optimisation? Could that ever be fast enough?

© Programmers or respective owner

Related posts about c#

Related posts about algorithms