I’m trying to remove repeating white-space characters from UTF8 string in PHP using regex.
This regex
$txt = preg_replace( '/\s+/i' , ' ', $txt );
usually works fine, but some of the strings have Cyrillic letter “Р”, which is screwed after the replacement.
After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.
Any ideas how to do this properly in PHP with regex?
it is described @ http://www.php.net/manual/en/function.preg-replace.php#106981
If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:
...u’, ‘…’, $string) with the u (unicode) modifierFor further information, the complete list of preg_* modifiers could be found at :
http://php.net/manual/en/reference.pcre.pattern.modifiers.php