I am using PMD, checkstyle, findbugs, etc. in Sonar. I would like to have a rule verifying that Java code contains no characters not part of UTF-8.
E.g. the character � should not be allowed
I could not find a rule for this in the above plugins, but I guess a custom rule can be made in Sonar.
Here is the regular expression which will match only valid UTF-8 byte sequences:
I have derived it from RFC 3629 UTF-8, a transformation format of ISO 10646 section 4 – Syntax of UTF-8 Byte Sequences.
Factorizing the above gives the slightly shorter:
This simple perl script demonstrates usage:
It produces the following output: