in an original code (Drupal core module) previous developer commented out the string:
if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/i', $name)) {
and instead, added:
if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/iu', $name)) {
Can you help me to understand what the difference between these two? What u modifier does? In php docs I found:
u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
So I guess, previous developer had problems with interpreting special characters or something. I’m a bit puzzled, please advice on this.
The modifier is needed to process utf-8 encoded input properly. A pattern like \xC1 should match the unicode character U+00C1 (À). When you encode Á in utf-8 you get \xC3\x81, so \xC1 doesn’t match. The “u” modifier makes the algorithm use utf-8 so it does match.
Basically, when you work with utf-8 encoded text this is what will happen:
In your case the first regular expression [^\x80-\xF7] matches no (non-ascii) UTF-8 encoded text because of the way UTF-8 works. The second expression matches unicode characters outside of the range U+0080 – U+00F7, so it lets through all of cyrillic, greek, arab, hebrew, …