Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6864773
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T02:57:16+00:00 2026-05-27T02:57:16+00:00

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include

  • 0

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.

I’m running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, …

In total, there 36 combinations that I want to try.

How can I do easily and gracefully do this?

I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:

public class MyAnalyzer extends Analyzer
{

public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
    CaseNumberFilter(
            new StopFilter(
                    new LowerCaseFilter(
                            new StandardFilter(
                                    new StandardTokenizer(reader)
                    )
            ), StopAnalyzer.ENGLISH_STOP_WORDS)
    )
);
}

What I’d like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don’t want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.

Any ideas on how to do this gracefully?

EDIT: By “gracefully”, I mean that I can easily create a new Analyzer in some sort of loop:

analyzer = new MyAnalyzer(doStemming, doStopping, ...)

where doStemming and doStopping change with each loop iteration.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T02:57:17+00:00Added an answer on May 27, 2026 at 2:57 am

    Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.

    public class MyAnalyzer extends Analyzer {
    
        private Set customStopSet; 
        public static final String[] STOP_WORDS = ...;
    
        private boolean doStemming = false;
        private boolean doStopping = false;
    
        public JavaSourceCodeAnalyzer(){
                super();
                customStopSet = StopFilter.makeStopSet(STOP_WORDS);
        }
    
        public void setDoStemming(boolean val){
                this.doStemming = val;
        }
    
        public void setDoStopping(boolean val){
                this.doStopping = val;
        }
    
        public TokenStream tokenStream(String fieldName, Reader reader) {
    
                // First, convert to lower case
                TokenStream out = new  LowerCaseTokenizer(reader);
    
                if (this.doStopping){
                        out = new StopFilter(true, out, customStopSet);
                }
    
                if (this.doStemming){
                        out = new PorterStemFilter(out);
                }
    
                return out;
        }
    }
    

    There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can’t have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.

    (I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)

    Is there a better answer?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Has anyone used Lucene.NET rather than using the full text search that comes with
I have a Lucene index that has several documents in it. Each document has
I read that Lucene has an internal query language where one specifies : and
I just noticed that the Zend lucene implementation has a default analyzer that can
I know that Lucene has extensive support for wildcard searches and I know you
I have a field that I am indexing with Lucene like so: @Field(name=hungerState, index=Index.TOKENIZED,
Lucene's StandardAnalyzer removes dots from string/acronyms when indexing it. I want Lucene to retain
The Lucene documents tell me that Hits will be removed from the API in
I want to find the top 1000 documents in a Lucene.NET index that match
When I look at the Zend framework documentation, Zend Search Lucene has the ability

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.