A software is producing UTF-8 files, but writing content to the file that isn’t unicode. I can’t change that software and have to take the output as it is now. Don’ t know if this will show up here correctly, but an german umlaut “ä” is shown in the file as “ä”.
If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say “convert to ANSI” in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.
To reproduce, create yourself an UTF-8 encoded file and write content to it:
Ok, I’ll try. Create yourself a UTF-8 file and write this to it:
Männer Schüle Vöogel SüÃ
Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:
use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
my $db = DBI->connect(
'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
{ PrintError => 1 }
);
my $sql="";
my $sql = qq{SET NAMES 'utf8';};
$db->do($sql);
while (my $line = <FILE>) {
my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
$sth->execute($line);
}
}
The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:
Männer Schüler Vögel Süß
So, how can I convert that correctly?
Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).
As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of
ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the stringä.