Normally, in order to remove non-word characters from a String the replaceAll method can be used:
String cleanWords = "some string with non-words such as ';'".replaceAll("\\W", "");
The above returns a cleaned string “somestringwithnonwordssuchas”.
However, if the string contains Cyrillic characters they get recognised as non-word, and get removed from the string. It is expected that Cyrillic characters would remain. Hence the question.
What is a proper way to deal with the task of removing non-word characters regardless of the language, assuming that string has UTF-8 encoding?
Try
[^\\p{L}]. That should match every Unicode codepoint except for letters.The
Patternclass has a pretty thorough description of the possible character classes. Note that the POSIX character classes are ASCII-only by default and won’t help you a lot, you’ll need to use the Unicode-specific classes.Note that there’s the
UNICODE_CHARACTER_CLASSflag that changes the behavior of the POSIX classes to conform to this section of the Unicode Standard (basically making them equivalent to their closest Unicode-aware equivalents).