For a script I need to compare ad titles against a lucene index.
This index contains a couple of keywords and the action to take if the ad matches.
For example:
(keyword,action,new_category,optional)
"red volvo","recategorize","cars","red"
The idea is that I need to query the whole ad title against the keyword field. Both (query and index) are analyzed with my own analyzer which has stemming, lowercasing, etc.
The problem I’m having is with partial matches. For example:
“I am selling a red horse” is matching “red volvo”.
If it were the other way around (the ads were indexed and I would need to query by the keyword) I could do:
q=+red +volvo
But that’s not an option due to the huge amount of ads I need to process.
So, the concrete question, is there a way to force all tokens in a field to be matched against the query?
I could use a KeywordAnalyzer so the whole ‘red volvo’ is seen as one token, but I cannot analyze the whole ad title as a single keyword, because it won’t match anything.
Given that you do want to catch the phrase “red volvo” exactly, but never just “red” or “volvo”, then I think you are on the right track with indexing it using the keyword analyzer. But you want to search with a longer query than than the field your searching, which is sort of the reverse of the typical use case.
I hesitate to recommend it, but I think the right way to go about this query might be to use a different analyzer to query than the one you use to create the index.
If the phrases indexed are of a predictable size, say 2-5 words, then using a ShingleFilter could produce the terms you need from a long query to search it as a Keyword.
Something like this:
This will split only on whitespace, and then produce search terms of 1 to 5 tokens in length, so in the example: “I am selling a red horse” is will produce the terms like “I”, “am”, “I am”, “red horse”, “I am selling”, “am selling a red horse”, etc.
I think a whitespace filter is probably the best choice for making this work with keywords, but if you run into whitespace characters it splits on other than spaces, or more than one space in a row, you may run into problems.