Full Text Search like Google

Posted by Eduardo on Stack Overflow See other posts from Stack Overflow or by Eduardo
Published on 2009-12-30T00:35:32Z Indexed on 2010/05/22 9:50 UTC
Read the original article Hit count: 232

Filed under:
|
|

I would like to implement full-text-search in my off-line (android) application to search the user generated list of notes.

I would like it to behave just like Google (since most people are already used to querying to Google)

My initial requirements are:

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.
  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)
  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.
  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.
  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.
  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.
  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.
  • Search all fields (in multifield documents) if no field specified (not just a default field)
  • Auto-completion in search results while typing to give popular searches. (just like Google Suggest)

How may I configure a full-text-search engine to behave as much as possible as Google?

(I am mostly interested in Open Source, Java and in particular Lucene)

© Stack Overflow or respective owner

Related posts about java

Related posts about full-text-search