Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include

Question

0

Asked: May 27, 20262026-05-27T02:57:16+00:00 2026-05-27T02:57:16+00:00

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include

0

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.

I’m running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, …

In total, there 36 combinations that I want to try.

How can I do easily and gracefully do this?

I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:

public class MyAnalyzer extends Analyzer
{

public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
    CaseNumberFilter(
            new StopFilter(
                    new LowerCaseFilter(
                            new StandardFilter(
                                    new StandardTokenizer(reader)
                    )
            ), StopAnalyzer.ENGLISH_STOP_WORDS)
    )
);
}

What I’d like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don’t want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.

Any ideas on how to do this gracefully?

EDIT: By “gracefully”, I mean that I can easily create a new Analyzer in some sort of loop:

analyzer = new MyAnalyzer(doStemming, doStopping, ...)

where doStemming and doStopping change with each loop iteration.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T02:57:17+00:00

Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.

public class MyAnalyzer extends Analyzer {

    private Set customStopSet; 
    public static final String[] STOP_WORDS = ...;

    private boolean doStemming = false;
    private boolean doStopping = false;

    public JavaSourceCodeAnalyzer(){
            super();
            customStopSet = StopFilter.makeStopSet(STOP_WORDS);
    }

    public void setDoStemming(boolean val){
            this.doStemming = val;
    }

    public void setDoStopping(boolean val){
            this.doStopping = val;
    }

    public TokenStream tokenStream(String fieldName, Reader reader) {

            // First, convert to lower case
            TokenStream out = new  LowerCaseTokenizer(reader);

            if (this.doStopping){
                    out = new StopFilter(true, out, customStopSet);
            }

            if (this.doStemming){
                    out = new PorterStemFilter(out);
            }

            return out;
    }
}

There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can’t have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.

(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)

Is there a better answer?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply