I’m using the following regex to strip out non-printing control characters from user input before inserting the values into the database.
preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $value)
Is there a problem with using this on utf-8 strings? It seems to remove all non-ascii characters entirely.
Part of the problem is that you aren’t treating the target as a UTF-8 string; you need the
/umodifier for that. Also, in UTF-8 any non-ASCII character is represented by two or more bytes, all of them in the range\x80..\xFF. Try this:\p{Cc}is the Unicode property for control characters, and theucauses both the regex and the target string to be treated as UTF-8.