Normally, in order to remove non-word characters from a String the replaceAll method can

Question

0

Asked: June 10, 20262026-06-10T04:31:58+00:00 2026-06-10T04:31:58+00:00

Normally, in order to remove non-word characters from a String the replaceAll method can

0

Normally, in order to remove non-word characters from a String the replaceAll method can be used:

String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");

The above returns a cleaned string “somestringwithnonwordssuchas”.

However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.

What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T04:31:59+00:00

Try [^\\p{L}]. That should match every Unicode codepoint except for letters.

The Pattern class has a pretty thorough description of the possible character classes. Note that the POSIX character classes are ASCII-only by default and won’t help you a lot, you’ll need to use the Unicode-specific classes.

Note that there’s the UNICODE_CHARACTER_CLASS flag that changes the behavior of the POSIX classes to conform to this section of the Unicode Standard (basically making them equivalent to their closest Unicode-aware equivalents).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Normally, in order to remove non-word characters from a String the replaceAll method can

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply