I have a set of data that contains garbled text fields because of encoding

Question

0

Asked: May 13, 20262026-05-13T23:50:29+00:00 2026-05-13T23:50:29+00:00

I have a set of data that contains garbled text fields because of encoding

0

I have a set of data that contains garbled text fields because of encoding errors during many import/exports from one database to another. Most of the errors were caused by converting UTF-8 to ISO-8859-1. Strangely enough, the errors are not consistent: the word ‘München‘ appears as ‘MÃ¼nchen‘ in some place and also as ‘MÃœnchen‘ in somewhere else.

Is there a trick in SQL server to correct this kind of crap? The first thing that I can think of is to exploit the COLLATE clause, so that Ã¼ is interpreted as ü, but I don’t exactly know how. If it isn’t possible to make it in the DB level, do you know any tool that helps for a bulk correction? (no manual find/replace tool, but a tool that guesses the garbled text somehow and correct them)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T23:50:30+00:00

I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added… Each row could contain any number of encodings.

The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my Utf8Voodoo class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

// ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
// site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
// characters in 0x80-0x9F), but mb_detect_encoding never returns that
// encoding when ISO-8859-* is in the detect list, so we cannot use it.
// CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
// just a few times, but enough to make it pretty much impossible to
// automagically detect exactly which non-ISO encoding was used. Hence the
// need for "likely bytes" in addition to the "magic bytes" below.

/**
 * This array contains the magic bytes that determine possible encodings.
 * It works by elimination: the most specific byte patterns (the array's
 * keys) are listed first. When a match is found, the possible encodings
 * are that entry's value.
 */
public static $legacyEncodingsMagicBytes = array(
    '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
    '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
    '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
);

/**
 * This array contains the bytes that make it more likely for a string to
 * be a certain encoding. The keys are the pattern, the values are arrays
 * with (encoding => likeliness score modifier).
 */
public static $legacyEncodingsLikelyBytes = array(
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x80 | -      | -      | €      | Ç
    '/\x80/' => array(
        'Windows-1252' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x93 | -      | -      | “      | ô
    // 0x94 | -      | -      | ”      | ö
    // 0x95 | -      | -      | •      | ò
    // 0x96 | -      | -      | –      | û
    // 0x97 | -      | -      | —      | ù
    // 0x99 | -      | -      | ™      | Ö
    '/[\x93-\x97\x99]/' => array(
        'Windows-1252' => +1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x86 | -      | -      | †      | å
    // 0x87 | -      | -      | ‡      | ç
    // 0x89 | -      | -      | ‰      | ë
    // 0x8A | -      | -      | Š      | è
    // 0x8C | -      | -      | Œ      | î
    // 0x8E | -      | -      | Ž      | Ä
    // 0x9A | -      | -      | š      | Ü
    // 0x9C | -      | -      | œ      | £
    // 0x9E | -      | -      | ž      | ×
    '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
        'Windows-1252' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA4 | ¤      | €      | ¤      | ñ
    '/\xA4/' => array(
        'ISO-8859-15' => +10,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0xA6 | ¦      | Š      | ¦      | ª
    // 0xBD | ½      | œ      | ½      | ¢
    '/[\xA6\xBD]/' => array(
        'ISO-8859-15' => -1,
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x82 | -      | -      | ‚      | é
    // 0xA7 | §      | §      | §      | º
    // 0xFD | ý      | ý      | ý      | ²
    '/[\x82\xA7\xCF\xFD]/' => array(
        'CP850' => +1
    ),
    // Byte | ISO-1  | ISO-15 | W-1252 | CP850
    // 0x91 | -      | -      | ‘      | æ
    // 0x92 | -      | -      | ’      | Æ
    // 0xB0 | °      | °      | °      | ░
    // 0xB1 | ±      | ±      | ±      | ▒
    // 0xB2 | ²      | ²      | ²      | ▓
    // 0xB3 | ³      | ³      | ³      | │
    // 0xB9 | ¹      | ¹      | ¹      | ╣
    // 0xBA | º      | º      | º      | ║
    // 0xBB | »      | »      | »      | ╗
    // 0xBC | ¼      | Œ      | ¼      | ╝
    // 0xC1 | Á      | Á      | Á      | ┴
    // 0xC2 | Â      | Â      | Â      | ┬
    // 0xC3 | Ã      | Ã      | Ã      | ├
    // 0xC4 | Ä      | Ä      | Ä      | ─
    // 0xC5 | Å      | Å      | Å      | ┼
    // 0xC8 | È      | È      | È      | ╚
    // 0xC9 | É      | É      | É      | ╔
    // 0xCA | Ê      | Ê      | Ê      | ╩
    // 0xCB | Ë      | Ë      | Ë      | ╦
    // 0xCC | Ì      | Ì      | Ì      | ╠
    // 0xCD | Í      | Í      | Í      | ═
    // 0xCE | Î      | Î      | Î      | ╬
    // 0xD9 | Ù      | Ù      | Ù      | ┘
    // 0xDA | Ú      | Ú      | Ú      | ┌
    // 0xDB | Û      | Û      | Û      | █
    // 0xDC | Ü      | Ü      | Ü      | ▄
    // 0xDF | ß      | ß      | ß      | ▀
    // 0xE7 | ç      | ç      | ç      | þ
    // 0xE8 | è      | è      | è      | Þ
    '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
        'CP850' => -1
    ),
/* etc. */

Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a set of data that contains garbled text fields because of encoding

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply