I need to use a tokenizer that splits words on whitespace but that doesn’t split if the whitespace is whithin double parenthesis. Here an example:
My input-> term1 term2 term3 ((term4 term5)) term6
should produce this list of tokens:
term1, term2, term3, ((term4 term5)), term6.
I think that I can obtain this behaviour by extending Lucene WhiteSpaceTokenizer. How can I perform this extension?
Is there some other solutions?
Thanks in advance.
I haven’t tried to extend the Tokenizer, but i have here a nice (i think) solution with a regular expression:
And a method that split a string by matched groups from the reg ex returning an array. Code example:
}
The output:
Hope this help you.
Please note that the following code with the usual
split():will return you nothing or not what you want beacuse it only check delimiters.