How to structure an index for type ahead for extremely large dataset using Lucene or similar?

Posted by Pete on Stack Overflow See other posts from Stack Overflow or by Pete
Published on 2010-05-04T20:33:03Z Indexed on 2010/05/05 2:48 UTC
Read the original article Hit count: 285

Filed under:
|
|
|

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:

Must have:

  • The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
  • The ability to return results very quickly, I'd say 500ms is an upper bound.

Nice to have:

  • The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
  • In-word substring matching, so for example ('st' would match 'bestseller')

Note:

  • This index will purely be used for type ahead, and does not need to serve standard search queries.
  • I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.

Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

© Stack Overflow or respective owner

Related posts about lucene

Related posts about typeahead