I’m learning regexp and thought I was starting to get a grip. but then…
I tried to split a string and I need help to understand such a simple thing as:
String input = "abcde";
System.out.println("[a-z] " + Arrays.toString(input.split("[a-z]")));
System.out.println("\\w " + Arrays.toString(input.split("\\w")));
System.out.println("\\w*? " + Arrays.toString(input.split("\\w*?")));
System.out.println("\\w+? " + Arrays.toString(input.split("\\w+?")));
The output is
[a-z] - []
\w - []
\w*? - [, a, b, c, d, e]
\w+? - []
Why doesn’t any of the two first lines split the String on any character?
The third expression \w*?, (question mark prevents greediness) works as I expected, splitting the String on every character. The star, zero or more matches, returns an empty array.
I’ve tried the expression within NotePad++ and in a program and it shows 5 matches as in:
Scanner ls = new Scanner(input);
while(ls.hasNext())
System.out.format("%s ", ls.findInLine("\\w");
Output is: a b c d e
This really puzzles me
If you split a string with a regex, you essentially tell where the string should be cut. This necessarily cuts away what you match with the regex. Which means if you split at
\w, then every character is a split point and the substrings between them (all empty) are returned. Java automatically removes trailing empty strings, as described in the documentation.This also explains why the lazy match
\w*?will give you every character, because it will match every position between (and before and after) any character (zero-width). What’s left are the characters of the string themselves.Let’s break it down:
[a-z],\w,\w+?Your string is
And the matches are as follows:
which leaves you with the substrings between the matches, all of which are empty.
The above three regexes behave the same in this regard as they all will only match a single character.
\w+?will do so because it lacks any other constraints that might make the+?try matching more than just the bare minimum (it’s lazy, after all).\w*?In this case the matches are between the characters, leaving you with the following substrings:
Java throws the trailing empty one away, though.