I’m filtering all user input to remove the following characters:
http://www.w3.org/TR/unicode-xml/#Charlist (“not suitable characters for use with markup”).
So, I have this two functions:
if (!function_exists("mb_trim")) {
function mb_trim($str)
{
return preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
}
}
function sanitize($str)
{
// Clones of grave and accent
$str = preg_replace("/[\x{0340}-\x{0341}]+/u", "", $str);
// Obsolete characters for Khmer
$str = preg_replace("/[\x{17A3}]+/u", "", $str);
$str = preg_replace("/[\x{17D3}]+/u", "", $str);
// Line and paragraph separator
$str = preg_replace("/[\x{2028}]+/u", "", $str);
$str = preg_replace("/[\x{2029}]+/u", "", $str);
// BIDI embedding controls (LRE, RLE, LRO, RLO, PDF)
$str = preg_replace("/[\x{202A}-\x{202E}]+/u", "", $str);
// Activate/Inhibit Symmetric swapping
$str = preg_replace("/[\x{206A}-\x{206B}]+/u", "", $str);
// Activate/Inhibit Arabic from shaping
$str = preg_replace("/[\x{206C}-\x{206D}]+/u", "", $str);
// Activate/Inhibit National digit shapes
$str = preg_replace("/[\x{206E}-\x{206F}]+/u", "", $str);
// Interlinear annotation characters
$str = preg_replace("/[\x{FFF9}-\x{FFFB}]+/u", "", $str);
// Byte Order Mark
$str = preg_replace("/[\x{FEFF}]+/u", "", $str);
// Object replacement character
$str = preg_replace("/[\x{FFFC}]+/u", "", $str);
// Scoping for Musical Notation
$str = preg_replace("/[\x{1D173}-\x{1D17A}]+/u", "", $str);
$str = mb_trim($str);
if (mb_check_encoding($str)) {
return $str;
} else {
return false;
}
}
I have not much knowledge with regular expresions, so, what I want to know is
- Is the mb_trim function correct for trimming multi-byte strings?
- Is it possible to join all regular expresions in the function
sanitize to do only one preg_replace?
Thanks
You can do with one preg_replace by combining them into a one character set like so: