How to structure an index for type ahead for extremely large dataset using Lucene or similar?

Posted by Pete on Stack Overflow See other posts from Stack Overflow or by Pete
Published on 2010-05-04T20:33:03Z Indexed on 2010/05/05 2:48 UTC
Read the original article Hit count: 285

Filed under:

lucene

|

typeahead

|

autocomplete

|

index

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:

Must have:

The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
The ability to return results very quickly, I'd say 500ms is an upper bound.

Nice to have:

The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
In-word substring matching, so for example ('st' would match 'bestseller')

Note:

This index will purely be used for type ahead, and does not need to serve standard search queries.
I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.

Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

© Stack Overflow or respective owner

Related posts about lucene

performance comparision between Zend Lucene and Java Lucene

as seen on Stack Overflow - Search for 'Stack Overflow'
Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering How big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and… >>> More
Why wasn't fast-vector-highlighter (lucene-contrib) made an official part of Lucene 3.0 core

as seen on Stack Overflow - Search for 'Stack Overflow'
I've read some Jira entries and they mentioned moving fast-vector-highlighter to core about a year ago but it never made it. Looking at the svn for contrib it seems incomplete. There are no tests for FastVectorHighlighter Documentation is lacking No samples anywhere on apache.org Anyone have… >>> More
pylucene: install error

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to install Pylucene (pylucene-3.3-3-src.tar.gz) on my ubuntu linux 11.10. I have python 2.7.2. I was able to compile JCC (I think) because I didnt see any error when I installed it. When I tried to install Pylucene I get the following error. Can someone help? Thanks. ICU not installed /usr/bin/python… >>> More
Solr WordDelimiterFilter + Lucene Highlighter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to get the Highlighter class from Lucene to work properly with tokens coming from Solr's WordDelimiterFilter. It works 90% of the time, but if the matching text contains a ',' such as "1,500" the output is incorrect: Expected: 'test 1,500 this' Observed: 'test 11,500 this' I… >>> More
java AbstractMethodError

as seen on Stack Overflow - Search for 'Stack Overflow'
How to handle this error in lucene: java.lang.AbstractMethodError: org.apache.lucene.store.Directory.listAll()[Ljava/lang/String; at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:568) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) … >>> More

Related posts about typeahead

twitter bootstrap typeahead 2.0.4 ajax error

as seen on Stack Overflow - Search for 'Stack Overflow'
I have the following code which definitely returns a proper data result if I use the 2.0.0 version, but for some reason bootstrap's typeahead plugin is giving me an error. I pasted it below the code sample: <input id="test" type="text" /> $('#test').typeahead({ source: function(typeahead… >>> More
twitter bootstrap typeahead (method 'toLowerCase' of undefined)

as seen on Stack Overflow - Search for 'Stack Overflow'
I am trying to use twitter bootstrap to get the manufacturers from my DB. Because twitter bootstrap typeahead does not support ajax calls I am using this fork: https://gist.github.com/1866577 In that page there is this comment that mentions how to do exactly what I want to do. The problem is when… >>> More
twitter bootstrap typeahead ajax example

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm trying to find a working example of the twitter bootstrap typeahead element that will make an ajax call to populate it's dropdown. I have an existing working jquery autocomplete example which defines the ajax url to and how to process the reply <script type="text/javascript"> //<![CDATA[ $(document)… >>> More
Keyboard "type ahead" in CRUD web apps?

as seen on Programmers - Search for 'Programmers'
In some data entry contexts, I've seen data typists, type really fast and know so well the app they use, and have a mechanic quality in their work so that they can "type ahead", ie continue typing and "tab-bing" and "enter-ing" faster than the display updates, so that in many occasions they are typing… >>> More
Can web apps allow fast data-typists to "type-ahead"?

as seen on Programmers - Search for 'Programmers'
In some data entry contexts, I've seen data typists, type really fast and know so well the app they use, and have a mechanic quality in their work so that they can "type ahead", ie continue typing and "tab-bing" and "enter-ing" faster than the display updates, so that in many occasions they are typing… >>> More