I’m using scanner with delimiter and I’ve came across a strange behaviour I’d like to understand.
I’m using this programm :
Scanner sc = new Scanner("Aller à : Navigation, rechercher");
sc.useDelimiter("\\s+|\\s*\\p{Punct}+\\s*");
String word="";
while(sc.hasNext()){
word = sc.next();
System.out.println(word);
}
The output is :
Aller
à
Navigation
rechercher
So first I don’t understand why I’m getting a blank token, the documentation says :
Depending upon the type of delimiting pattern, empty tokens may be returned. For example, the pattern “\s+” will return no empty tokens since it matches multiple instances of the delimiter. The delimiting pattern “\s” could return empty tokens since it only passes one space at a time.
I’m using \\s+ so why it returns a blank token?
Then there is an other thing I’d like to understand concerning regex. If I change the delimiter using the “reversed” regex :
sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+");
The output is correct and I get :
Aller
à
Navigation
rechercher
Why it works in the way?
EDIT :
With this case :
Scanner sc = new Scanner("(23 ou 24 minutes pour les épisodes avec introduction) (approx.)1");
sc.useDelimiter("\\s*\\p{Punct}+\\s*|\\s+"); //second regex
I still have a blank token between introduction and approx. Is it possible to avoid it?
I have a feeling that you are causing two delimiter captures in places where there’s a blank space followed by punctuation. Why not simply use
[\\s\\p{Punct}]+?This regex
\\s+|\\p{Punct}+will first capture the empty space and swallow it, then will capture the next delimiter as the punctuation. That will be two delimiters next to each other with nothing in between (the empty token).