I have a working Lucene index supporting a suggestion service. When a user types into a search box it queries the index by the SUGGESTION_FIELD. Each entry in SUGGESTION_FIELD can be one of many supported languages and each is stored using an appropriate language specific analyzer. In order to know what analyzer was used there is second field per entry which stores the LOCALE. So during a query I can say something like the code below to do a language specific query using appropriate analyzer
QueryParser parser = new QueryParser(Version.LUCENE_33, SUGGESTION_FIELD, getLangaugeAnalyzer(locale));
return searcher.search(parser.parse("SUGGESTION_FIELD:" + queryString + " AND LOCALE:"
+ locale), 100);
The works…. But now the client wants to be able to search using multiple languages at once.
My Question: What would be the fastest querying solution bearing in mind that a suggestion service needs to be very fast?…
Sol. #1. The simplest solution would seem to be; do the query multiple times. Once for each locale, thereby applying the corresponding language analyser each time. Finally append the results from each query in some sensible fashion
Sol. #2. Alternatively I could re-index using a column for each locale such that:
SUGGESTION_FIELD_en, SUGGESTION_FIELD_fr, SUGGESTION_FIELD_es etc..
using a different analyzer for each field (using PerFieldAnalyzerWrapper) and then query using a more complex query string such that:
"SUGGESTION_FIELD_en:" + queryString + " AND SUGGESTION_FIELD_fr:" + queryString + " AND SUGGESTION_FIELD_es:" + queryString
Please help if you think you 🙂
Your query is going to be something like this: (sugField:queryString1 AND locale:loc1) OR (sugField:queryString2 AND locale:loc2) OR …. This is a top-level BooleanQuery with subordinate BooleanQueries added with occurs=SHOULD, where each subordinate query has its terms with occurs=MUST. The queryString1, queryString2, etc. are the outputs from different language analyzers having the same input, the string the user entered.
Each subordinate query involves mandatory terms (from your query string) that are rare in the index and Lucene knows this at the outset (it knows the total doc count for each Term in the index) so it will first constrain the result by the queryString and then additionally intersect that with the locale terms. This will be VERY efficient no matter how large your index.
As for the different analyzers, I suggest you don’t use the QueryParser, but create the entire query programmatically. This is a good general advice whenever you don’t enter the query by hand and in your case it is the only way to gain control of the analyzing aspect. Run your query string through each of the language-specific analyzers and add their output tokens as TermQueries to the subordinate BooleanQueries.