I am trying to develop a complex textual search engine.
I have thousands of textual pages from many books.
I need to search pages that contain specified complex logical criterias.
These criterias can contain virtually any compination of the following:
A: Full words.
B: Word roots (semilar to stems; i.e. all words with certain key letters).
C: Word templates (in some languages are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).
D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.
Now, would it be faster to have the pages' full text in database (not indexed) and search though them all using SQL and Regular Expressions ?
Or would it be better to construct indexes of word/root/template-page-location tuples.
Hence, we can boost searching for individual words/roots/templates.
However, it gets tricky as we interdouce logical connectives into our query.
I thought of doing the following steps in such cases:
1: Seperately search for each individual words/roots/templates in the specified query.
2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective
For example, if we are searching for "he AND (is OR was)":
1: We shall search for "he", "is" and "was" seperately and get result lists for each word.
2: Merge the result lists of "is" and "was" using the merging function OR-MERGE
3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE
The result of step 3 is then returned as the result of the specified query.
What do you think gurues ? Which is faster ? Any better ideas ?
Thank you all in advance.