I’m parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I’m trying to make a list of words with their counts and for that I use a Scanner with this delimiter :
sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");
It works great for my need, but analysing the result, I saw something that looks like a blank token (again…). The character is after (nynorsk) in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don’t move).
After further research I’ve found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).
It’s not the only directional character, looking at the Character documentation Java seems to define them.
So I’m wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.
I’d like to avoid to make my own list because I fear I will forgot some of them.
You could always go the other way round and use a whitelist rather than a blacklist: