I’ve found this regex in a script I’m customizing. Can someone tell me what its doing?
function test( $text) {
$regex = '/( [\x00-\x7F] | [\xC0-\xDF][\x80-\xBF] | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} ) | ./x';
return preg_replace($regex, '$1', $text);
}
Inside of the capturing group there are four options:
[\x00-\x7F][\xC0-\xDF][\x80-\xBF][\xE0-\xEF][\x80-\xBF]{2}[\xF0-\xF7][\x80-\xBF]{3}If none of these patterns are matched at a given location, then any character will be matched by the
.that is outside of the capturing group.The
preg_replacecall will iterate over$textfinding all non-overlapping matches, replacing each match with whatever was captured.There are two possibilities here, either the entire match was inside the capturing group so the replacement doesn’t change
$text, or the.at the end matched a single character and that character is removed from$text.Here are some basic examples:
\xF8-\xFFappears in the text, it will always be removed\xC0-\xDFwill be removed unless followed by a character in\x80-\xBF\xE0-\xEFwill be removed unless followed by two characters in\x80-\xBF\xF0-\xF7will be removed unless followed by three characters in\x80-\xBF\x80-\xBFwill be removed unless it was matched as a part of one of the above cases