I am using Lucene 3.6.1. I have a BooleanQuery some clauses of which are marked as Occur.MUST_NOT. When I extract terms from this query, it happily extracts the terms that must not occur as well. This is so because of the following code in BooleanQuery.java
@Override
public void extractTerms(Set<Term> terms) {
for (BooleanClause clause : clauses) {
clause.getQuery().extractTerms(terms);
}
}
I am using these terms to present the user with a set of terms that can be added or removed from the query. If the user has explicitly specified that some term or phrase is not desired (e..g, by adding -"foo bar" to a query), I don’t want to show these terms to him. What might make more sense is code like this:
@Override
public void extractTerms(Set<Term> terms) {
for (BooleanClause clause : clauses) {
if (!clause.isProhibited())
clause.getQuery().extractTerms(terms);
}
}
What is the design rationale for the existing implementation? When does it make sense? What’s the best way to get around this problem, assuming I don’t want negated terms, but don’t know where in the query tree they occur?
Gene: maybe you can open a LUCENE Jira ticket for this?
I actually think extractTerms should do as you suggest. For example if i make a simple highlighter that uses this method (which I’ve done before), I don’t want the negative portions either. I’m guessing in general this is the expected behavior for most uses of this method.
At the very least its currently inconsistent, e.g. SpanNotQuery is in the same boat and excludes its “negative” portions from extractTerms.