I have a postgresql database I would like to convert to UTF-8.
The problem is that it is currently SQL_ASCII, so hasn’t been doing any kind of encoding conversion on its input, and as such has ended up with data of a mix of encoding types in the tables. One row might contain values encoded as UTF-8, another might be ISO-8859-x, or Windows-125x, etc.
This has made performing a dump of the database, and converting it to UTF-8 with the intention of importing it into a fresh UTF-8 database, difficult. If the data were all of one encoding type, I could just run the dump file through iconv, but I don’t think that approach works here.
Is the problem fundamentally down to knowing how each data is encoded? Here, where that is not known, can it be worked out, or even guessed? Ideally I’d love a script which would take a file, any file, and spit out valid UTF-8.
This is exactly the problem that Encoding::FixLatin was written to solve*.
If you install the Perl module then you’ll also get the
fix_latincommand-line utility which you can use like this:Read of the ‘Limitations‘ section of the documentation to understand how it works.
[*] Note I’m assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 – because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you’re screwed 🙁