I’ve found a Perl regexp that can check if a string is UTF-8 (the

Question

0

Asked: May 14, 20262026-05-14T00:50:35+00:00 2026-05-14T00:50:35+00:00

I’ve found a Perl regexp that can check if a string is UTF-8 (the

0

I’ve found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

But I’m not sure how to port it to MySQL as it seems that MySQL don’t support hex representation of characters see this question.

Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?

UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can’t pass the data through a script as the database is around 1TB.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-14T00:50:36+00:00

I’ve managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.

I’ve used the fact that mysql changes the bytes that aren’t utf-8 to ‘?’ when converting to latin1.

Here is how the check looks like:

SELECT (
         CONVERT(
           CONVERT(
              potentially_broken_column 
           USING latin1) 
         USING utf8))
       != 
       potentially_broken_column) AS INVALID ....

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve found a Perl regexp that can check if a string is UTF-8 (the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply