This regular expression is supposed to match all non-ASCII characters, 0-128 code points:
/[^x00-x7F]/i
Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.
Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.
After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?
It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the
/umodifier.If you do use the
/umodifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression[^x00-x7F]will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.If you do not use the
/umodifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the[^x00-x7F]regex (because it’s only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.