How to index a string like "aaa.bbb.ddd-fff" in Lucene?

Posted by user46703 on Stack Overflow See other posts from Stack Overflow or by user46703
Published on 2010-05-27T21:59:31Z Indexed on 2010/05/27 22:01 UTC
Read the original article Hit count: 267

Hi,

I have to index a lot documents that contain reference numbers like "aaa.bbb.ddd-fff". The structure can change but it's always some arbitrary numbers or characters combined with "/","-","_" or some other delimiter.

The users want to be able to search for any of the substrings like "aaa" or "ddd" and also for combinations like "aaa.bbb" or "ddd-fff". The best I have been able to come up with is to create my own token filter modeled after the synonym filter in "Lucene in action" which spits out multiple terms for each input. In my case I return "aaa.bbb", "bbb.ddd","bbb.ddd-fff" and all other combinations of the substrings. This works pretty well but when I index large documents (100MB) that contain lots of such strings I tend to get out of memory exceptions because my filter returns multiple terms for each input string.

Is there a better way to index these strings?

© Stack Overflow or respective owner

Related posts about full-text-search

Related posts about lucene