I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don’t work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don’t have to do any nasty adding “*”s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn’t work, but silic* does.
The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word “valve”. The snowball analyzer will stem it down to “valv”.
So at search time, since you stem the input query, both “valve” and “valves” will be stemmed down to “valv”. A
TermQueryusing the stemmedTerm“valv” will yield a match on both “valve” and “valves” occurences.But now, since in the Index you stored the
Term“valv”, a query for “valve*” will not match anything. That is because theQueryParserdoes not run theAnalyzeron Wildcard Queries.There is the AnalyzingQueryParser than can handle some of these cases, but I don’t think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.