Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 666649
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T23:50:29+00:00 2026-05-13T23:50:29+00:00

I have a set of data that contains garbled text fields because of encoding

  • 0

I have a set of data that contains garbled text fields because of encoding errors during many import/exports from one database to another. Most of the errors were caused by converting UTF-8 to ISO-8859-1. Strangely enough, the errors are not consistent: the word ‘München‘ appears as ‘München‘ in some place and also as ‘MÜnchen‘ in somewhere else.

Is there a trick in SQL server to correct this kind of crap? The first thing that I can think of is to exploit the COLLATE clause, so that ü is interpreted as ü, but I don’t exactly know how. If it isn’t possible to make it in the DB level, do you know any tool that helps for a bulk correction? (no manual find/replace tool, but a tool that guesses the garbled text somehow and correct them)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T23:50:30+00:00Added an answer on May 13, 2026 at 11:50 pm

    I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added… Each row could contain any number of encodings.

    The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my Utf8Voodoo class, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.

    // ISO-8859-15 has the Euro sign, but ISO-8859-1 has also been used on the
    // site. Sigh. Windows-1252 has the Euro sign at 0x80 (and other printable
    // characters in 0x80-0x9F), but mb_detect_encoding never returns that
    // encoding when ISO-8859-* is in the detect list, so we cannot use it.
    // CP850 has accented letters and currency symbols in 0x80-0x9F. It occurs
    // just a few times, but enough to make it pretty much impossible to
    // automagically detect exactly which non-ISO encoding was used. Hence the
    // need for "likely bytes" in addition to the "magic bytes" below.
    
    /**
     * This array contains the magic bytes that determine possible encodings.
     * It works by elimination: the most specific byte patterns (the array's
     * keys) are listed first. When a match is found, the possible encodings
     * are that entry's value.
     */
    public static $legacyEncodingsMagicBytes = array(
        '/[\x81\x8D\x8F\x90\x9D]/' => array('CP850'),
        '/[\x80\x82-\x8C\x8E\x91-\x9C\x9E\x9F]/' => array('Windows-1252', 'CP850'),
        '/./' => array('ISO-8859-15', 'ISO-8859-1', 'Windows-1252', 'CP850'),
    );
    
    /**
     * This array contains the bytes that make it more likely for a string to
     * be a certain encoding. The keys are the pattern, the values are arrays
     * with (encoding => likeliness score modifier).
     */
    public static $legacyEncodingsLikelyBytes = array(
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x80 | -      | -      | €      | Ç
        '/\x80/' => array(
            'Windows-1252' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x93 | -      | -      | “      | ô
        // 0x94 | -      | -      | ”      | ö
        // 0x95 | -      | -      | •      | ò
        // 0x96 | -      | -      | –      | û
        // 0x97 | -      | -      | —      | ù
        // 0x99 | -      | -      | ™      | Ö
        '/[\x93-\x97\x99]/' => array(
            'Windows-1252' => +1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x86 | -      | -      | †      | å
        // 0x87 | -      | -      | ‡      | ç
        // 0x89 | -      | -      | ‰      | ë
        // 0x8A | -      | -      | Š      | è
        // 0x8C | -      | -      | Œ      | î
        // 0x8E | -      | -      | Ž      | Ä
        // 0x9A | -      | -      | š      | Ü
        // 0x9C | -      | -      | œ      | £
        // 0x9E | -      | -      | ž      | ×
        '/[\x86\x87\x89\x8A\x8C\x8E\x9A\x9C\x9E]/' => array(
            'Windows-1252' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA4 | ¤      | €      | ¤      | ñ
        '/\xA4/' => array(
            'ISO-8859-15' => +10,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0xA6 | ¦      | Š      | ¦      | ª
        // 0xBD | ½      | œ      | ½      | ¢
        '/[\xA6\xBD]/' => array(
            'ISO-8859-15' => -1,
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x82 | -      | -      | ‚      | é
        // 0xA7 | §      | §      | §      | º
        // 0xFD | ý      | ý      | ý      | ²
        '/[\x82\xA7\xCF\xFD]/' => array(
            'CP850' => +1
        ),
        // Byte | ISO-1  | ISO-15 | W-1252 | CP850
        // 0x91 | -      | -      | ‘      | æ
        // 0x92 | -      | -      | ’      | Æ
        // 0xB0 | °      | °      | °      | ░
        // 0xB1 | ±      | ±      | ±      | ▒
        // 0xB2 | ²      | ²      | ²      | ▓
        // 0xB3 | ³      | ³      | ³      | │
        // 0xB9 | ¹      | ¹      | ¹      | ╣
        // 0xBA | º      | º      | º      | ║
        // 0xBB | »      | »      | »      | ╗
        // 0xBC | ¼      | Œ      | ¼      | ╝
        // 0xC1 | Á      | Á      | Á      | ┴
        // 0xC2 | Â      | Â      | Â      | ┬
        // 0xC3 | Ã      | Ã      | Ã      | ├
        // 0xC4 | Ä      | Ä      | Ä      | ─
        // 0xC5 | Å      | Å      | Å      | ┼
        // 0xC8 | È      | È      | È      | ╚
        // 0xC9 | É      | É      | É      | ╔
        // 0xCA | Ê      | Ê      | Ê      | ╩
        // 0xCB | Ë      | Ë      | Ë      | ╦
        // 0xCC | Ì      | Ì      | Ì      | ╠
        // 0xCD | Í      | Í      | Í      | ═
        // 0xCE | Î      | Î      | Î      | ╬
        // 0xD9 | Ù      | Ù      | Ù      | ┘
        // 0xDA | Ú      | Ú      | Ú      | ┌
        // 0xDB | Û      | Û      | Û      | █
        // 0xDC | Ü      | Ü      | Ü      | ▄
        // 0xDF | ß      | ß      | ß      | ▀
        // 0xE7 | ç      | ç      | ç      | þ
        // 0xE8 | è      | è      | è      | Þ
        '/[\x91\x92\xB0-\xB3\xB9-\xBC\xC1-\xC5\xC8-\xCE\xD9-\xDC\xDF\xE7\xE8]/' => array(
            'CP850' => -1
        ),
    /* etc. */
    

    Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a data type that contains a set and a method that expects
I have a data set that I import into a SQL table every night.
I have a set of data that contains parents and children. What I'm looking
So I have a Python file that contains a large set of data of
I have a VB.Net data set that contains data from multiple tables. Does anyone
I have an X.509 certificate that contains a set of data with the following
I have a set of data that is structured like this: ItemA.GroupA ItemB.GroupA ItemC.GroupB
I have a set of 'dynamic data' that I need to bind to the
I have a data set that is around 700 rows with eight columns of
I have a data set that that I would like to call in a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.