I have a set of data that contains garbled text fields because of encoding errors during many import/exports from one database to another. Most of the errors were caused by converting UTF-8 to ISO-8859-1. Strangely enough, the errors are not consistent: the word ‘München‘ appears as ‘München‘ in some place and also as ‘MÜnchen‘ in somewhere else.
Is there a trick in SQL server to correct this kind of crap? The first thing that I can think of is to exploit the COLLATE clause, so that ü is interpreted as ü, but I don’t exactly know how. If it isn’t possible to make it in the DB level, do you know any tool that helps for a bulk correction? (no manual find/replace tool, but a tool that guesses the garbled text somehow and correct them)
I have been in exactly the same position. The production MySQL server was set up to be latin1, old data was latin1, new data was utf8 but stored to latin1 columns, then utf8 columns were added… Each row could contain any number of encodings.
The big problem is that there is no single one solution that corrects everything, because a lot of legacy encodings use the same bytes for different characters. That means you will have to resort to heuristics. In my
Utf8Voodooclass, there is a huge array of the bytes from 127 to 255, a.k.a. the legacy single-byte encoding non-ASCII characters.Then you loop over the bytes (not characters) in the strings and keep the scores. Let me know if you want some more info.