I need to use a tokenizer that splits words on whitespace but that doesn’t

Question

0

Asked: May 25, 20262026-05-25T18:53:54+00:00 2026-05-25T18:53:54+00:00

I need to use a tokenizer that splits words on whitespace but that doesn’t

0

I need to use a tokenizer that splits words on whitespace but that doesn’t split if the whitespace is whithin double parenthesis. Here an example:

My input-> term1 term2 term3 ((term4 term5)) term6

should produce this list of tokens:

term1, term2, term3, ((term4 term5)), term6.

I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?

Thanks in advance.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T18:53:55+00:00

I haven’t tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:

\w+|\(\([\w\s]*\)\)

And a method that split a string by matched groups from the reg ex returning an array. Code example:

class Regex_ComandLine {

public static void main(String[] args) {
    String input = "term1 term2 term3 ((term4 term5)) term6";    //your input
    String[] parsedInput = splitByMatchedGroups(input, "\\w+|\\(\\([\\w\\s]*\\)\\)");

    for (String arg : parsedInput) {
        System.out.println(arg);
    }
}

static String[] splitByMatchedGroups(String string,
                                            String patternString) {
    List<String> matchList = new ArrayList<>();
    Matcher regexMatcher = Pattern.compile(patternString).matcher(string);

    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

    return matchList.toArray(new String[0]);
}

}

The output:

term1
term2
term3
((term4 term5))
term6

Hope this help you.

Please note that the following code with the usual split():

String[] parsedInput = input.split("\\w+|\\(\\([\\w\\s]*\\)\\)");

will return you nothing or not what you want beacuse it only check delimiters.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to use a tokenizer that splits words on whitespace but that doesn’t

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply