I’m parsing a text file made from this Wikipedia article , basically I made

Question

0

Asked: June 4, 20262026-06-04T13:57:07+00:00 2026-06-04T13:57:07+00:00

I’m parsing a text file made from this Wikipedia article , basically I made

0

I’m parsing a text file made from this Wikipedia article, basically I made a Ctrl+A and copy/paste all the content in a text file. (I use it as example).
I’m trying to make a list of words with their counts and for that I use a Scanner with this delimiter :

    sc.useDelimiter("[\\p{javaWhitespace}\\p{Punct}]+");

It works great for my need, but analysing the result, I saw something that looks like a blank token (again…). The character is after (nynorsk)‬ in the article (funny when I copy/paste here the character disappear, in gedit I can use → and ← and the cursor don’t move).

After further research I’ve found out that this token was actually the POP DIRECTIONAL FORMATTING (U+202C).

It’s not the only directional character, looking at the Character documentation Java seems to define them.

So I’m wondering if there is a standard way to detect these characters, and if possible a way that can be easily integrated in the delimiter pattern.

I’d like to avoid to make my own list because I fear I will forgot some of them.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T13:57:09+00:00

Editorial Team

2026-06-04T13:57:09+00:00Added an answer on June 4, 2026 at 1:57 pm

You could always go the other way round and use a whitelist rather than a blacklist:

sc.useDelimiter("[^\\p{L}]+");

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m parsing a text file made from this Wikipedia article , basically I made

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply