Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6602623
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T18:53:54+00:00 2026-05-25T18:53:54+00:00

I need to use a tokenizer that splits words on whitespace but that doesn’t

  • 0

I need to use a tokenizer that splits words on whitespace but that doesn’t split if the whitespace is whithin double parenthesis. Here an example:

My input-> term1 term2 term3 ((term4 term5)) term6  

should produce this list of tokens:

term1, term2, term3, ((term4 term5)), term6.  

I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?

Thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T18:53:55+00:00Added an answer on May 25, 2026 at 6:53 pm

    I haven’t tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:

    \w+|\(\([\w\s]*\)\)
    

    And a method that split a string by matched groups from the reg ex returning an array. Code example:

    class Regex_ComandLine {
    
    public static void main(String[] args) {
        String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
        String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");
    
        for (String arg : parsedInput) {
            System.out.println(arg);
        }
    }
    
    static String[] splitByMatchedGroups(String string,
                                                String patternString) {
        List<String> matchList = new ArrayList<>();
        Matcher regexMatcher = Pattern.compile(patternString).matcher(string);
    
        while (regexMatcher.find()) {
            matchList.add(regexMatcher.group());
        }
    
        return matchList.toArray(new String[0]);
    }
    

    }

    The output:

    term1
    term2
    term3
    ((term4 term5))
    term6
    

    Hope this help you.

    Please note that the following code with the usual split():

    String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");
    

    will return you nothing or not what you want beacuse it only check delimiters.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to use an alias in the WHERE clause, but It keeps telling
I have a std::wstring variable that contains a text and I need to split
I'm working on a project that have the following need: use source ip address
Need to use own imaged markers instead built-in pins. I have several questions. 1.
I need to use sed to convert all occurences of ##XXX## to ${XXX} .
I need to use NSImage which appears need to be imported from <AppKit/AppKit.h> .
I need to use a many to many relationship in my project and since
I need to use sendmail from Macs in an office. At the moment, I
I need to use a byte array as a profile property in a website.
I need to use a datetime.strptime on the text which looks like follows. Some

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.