Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 203347
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T17:22:14+00:00 2026-05-11T17:22:14+00:00

My Perl app and MySQL database now handle incoming UTF-8 data properly, but I

  • 0

My Perl app and MySQL database now handle incoming UTF-8 data properly, but I have to convert the pre-existing data. Some of the data appears to have been encoded as CP-1252 and not decoded as such before being encoded as UTF-8 and stored in MySQL. I’ve read the O’Reilly article Turning MySQL data in latin1 to utf8 utf-8, but although it’s frequently referenced, it’s not a definitive solution.

I’ve looked at Encode::DoubleEncodedUTF8 and Encoding::FixLatin, but neither worked on my data.

This is what I’ve done so far:

#Return the $bytes from the DB using BINARY()
my $characters = decode('utf-8', $bytes);
my $good = decode('utf-8', encode('cp-1252', $characters));

That fixes most of the cases, but if run against proplerly-encoded records, it mangles them. I’ve tried using Encode::Guess and Encode::Detect, but they cannot distinguish between the properly encoded and the misencoded records. So I just undo the conversion if the \x{FFFD} character is found after the conversion.

Some records, though, are only partially converted. Here’s an example where the left curly quotes are properly converted, but the right curly quotes get mangled.

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "\xC3\xA2\xE2\x82\xAC\xC5\x93four score\xC3\xA2\xE2\x82\xAC\xC2\x9D")))'

And and here’s an example where a right single quote did not convert:

perl -CO -MEncode -e 'print decode("utf-8", encode("cp-1252", decode("utf-8", "bob\xC3\xAF\xC2\xBF\xC2\xBDs")))'

Am I also dealing with double encoded data here? What more must I do to convert these records?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T17:22:14+00:00Added an answer on May 11, 2026 at 5:22 pm

    With the “four score” example, it almost certainly is doubly-encoded data. It looks like either:

    1. cp1252 data that was run through a cp1252 to utf8 process twice, or
    2. utf8 data that was run through a cp1252 to utf8 process

    (Naturally, both cases look identical)

    Now, that’s what you expected, so why didn’t your code work?

    First, I’d like to refer you to this table which shows the conversion from cp1252 to unicode. The important thing I want you to note is that there are some bytes (such as 0x9D) which are not valid in cp1252.

    When I imagine writing a cp1252 to utf8 converter, therefore, I need to do something with those bytes that aren’t in cp1252. The only sensible thing I can think of is to transform the unknown bytes into unicode characters at the same value. In fact, this appears to be what happened. Let’s take your “four score” example back one step at a time.

    First, since it is valid utf-8, let’s decode with:

    $ perl -CO -MEncode -e '$a=decode("utf-8", 
      "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
      "four score" .
      "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
      for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
    

    This yields this sequence of unicode code points:

    e2 20ac 153 66 6f 75 72 20 73 63 6f 72 65 e2 20ac 9d
    

    (“fmt” is a unix command that just reformats text so that we have nice line breaks with long data)

    Now, let’s represent each of these as a byte in cp1252, but when the unicode character can’t be represented in cp1252, let’s just replace it with a byte that has the same numeric value. (Instead of the default, which is to replace it with a question mark) We should then, if we’re correct about what happened to the data, have a valid utf8 byte stream.

    $ perl -CO -MEncode -e '$a=decode("utf-8",
      "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
      "four score" .
      "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
      $a=encode("cp-1252", $a, sub { chr($_[0]) } );
      for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
    

    That third argument to encode – when it’s a sub – tells what to do with unrepresentable characters.

    This yields:

    e2 80 9c 66 6f 75 72 20 73 63 6f 72 65 e2 80 9d
    

    Now, this is a valid utf8 byte stream. Can’t tell that by inspection? Well, let’s ask perl to decode this byte stream as utf8:

    $ perl -CO -MEncode -e '$a=decode("utf-8",
      "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
      "four score" .
      "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
      $a=encode("cp-1252", $a, sub { chr($_[0]) } );
      $a=decode("utf-8", $a, 1);
      for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
    

    Passing “1” as the third argument to decode ensures that our code will croak if the byte stream is invalid. This yields:

    201c 66 6f 75 72 20 73 63 6f 72 65 201d
    

    Or printed:

    $ perl -CO -MEncode -e '$a=decode("utf-8",
      "\xC3\xA2\xE2\x82\xAC\xC5\x93" .
      "four score" .
      "\xC3\xA2\xE2\x82\xAC\xC2\x9D");
      $a=encode("cp-1252", $a, sub { chr($_[0]) } );
      $a=decode("utf-8", $a, 1);
      print "$a\n"'
    “four score”
    

    So I think that the full algorithm should be this:

    1. Grab the byte stream from mysql. Assign this to $bytestream.
    2. While $bytestream is a valid utf8 byte stream:
      1. Assign the current value of $bytestream to $good
      2. If $bytestream is all-ASCII (i.e., every byte is less than 0x80), break out of this “while … valid utf8” loop.
      3. Set $bytestream to the result of “demangle($bytestream)”, where demangle is given below. This routine undoes the cp1252-to-utf8 converter we think this data has suffered from.
    3. Put $good back in the database if it isn’t undef. If $good was never assigned, assume $bytestream was a cp1252 byte stream and convert it to utf8. (Of course, optimize and don’t do this if the loop in step 2 didn’t change anything, etc.)

    .

    sub demangle {
      my($a) = shift;
      eval { # the non-string form of eval just traps exceptions
             # so that we return undef on exception
        local $SIG{__WARN__} = sub {}; # No warning messages
        $a = decode("utf-8", $a, 1);
        encode("cp-1252", $a, sub {$_[0] <= 255 or die $_[0]; chr($_[0])});
      }
    }
    

    This is based on the assumption that it’s actually very rare for a string that isn’t all-ASCII to be a valid utf-8 byte stream unless it really is utf-8. That is, it’s not the sort of thing that happens accidentally.

    EDITED TO ADD:

    Note that this technique does not help too much with your “bob’s” example, unfortunately. I think that that string also went through two rounds of cp1252-to-utf8 conversion, but unfortunately there was also some corruption. Using the same technique as before, we first read the byte sequence as utf8 and look at the sequence of unicode character references we get:

    $ perl -CO -MEncode -e '$a=decode("utf-8",
      "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
      for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
    

    This yields:

    62 6f 62 ef bf bd 73
    

    Now, it just so happens that for the three bytes ef bf bd, unicode and cp1252 agree. So representing this sequence of unicode code points in cp1252 is just:

    62 6f 62 ef bf bd 73
    

    That is, the same sequence of numbers. Now, this is in fact a valid utf-8 byte stream, but what it decodes to may surprise you:

    $ perl -CO -MEncode -e '$a=decode("utf-8",
      "bob\xC3\xAF\xC2\xBF\xC2\xBDs");
      $a=encode("cp-1252", $a, sub { chr(shift) } );
      $a=decode("utf-8", $a, 1);
      for $c (split(//,$a)) {printf "%x ",ord($c);}' | fmt
    
    62 6f 62 fffd 73
    

    That is, the utf-8 byte stream, though a legitimate utf-8 byte stream, encoded the character 0xFFFD, which is generally used for “untranslatable character”. I suspect that what happened here is that the first *-to-utf8 transformation saw a character it didn’t recognize and replaced it with “untranslatable”. There’s no way to then programmatically recover the original character.

    A consequence is that you can’t detect whether a stream of bytes is valid utf8 (needed for that algorithm I gave above) simply by doing a decode and then looking for 0xFFFD. Instead, you should use something like this:

    sub is_valid_utf8 {
      defined(eval { decode("utf-8", $_[0], 1) })
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

My App needs to connect to and grab data from a MySQL database sitting
I'm putting together a small web app that writes to a database (Perl CGI
I have a mod_perl2 based web app that requires a connection to a mysql
I have a MySQL database that has a few very simple tables. I would
I'm attempting to move a web app we have (written in Perl) from an
I have written a personal web app that uses charCodeAt() to convert text that
I have a Perl app that runs some perforce operations, in the end I
I have a Perl web app and would like to log all uncaught exceptions
I have a very simple online ordering app in Perl. The user fills in
If I have this perl app: print `someshellscript.sh`; that prints bunch of stuff and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.