I’ve found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I’m not sure how to port it to MySQL as it seems that MySQL don’t support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can’t pass the data through a script as the database is around 1TB.
I’ve managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
I’ve used the fact that mysql changes the bytes that aren’t utf-8 to ‘?’ when converting to latin1.
Here is how the check looks like: