I am importing contents from an Excel-generated CSV-file into an XML document like:
$csv = fopen($csvfile, r);
$words = array();
while (($pair = fgetcsv($csv)) !== FALSE) {
array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}
The inserted data are English/German expressions.
I insert these values into an XML structure and output the XML as following:
$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;
header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!
echo $dom -> saveXML();
This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.
All files are UTF-8 encoded and served in UTF-8.
This seems rather strange and bug-like to me, yet maybe I am missing something, there’s a lot of smart people around here.
Ok, so this seems to be a bug in
fgetcsv.I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.
This is (a not-yet-optimized version of) what I am doing:
The
getCSVValuesis coming from here and is needed to deal with CSV lines like this (commas!):It looks like:
Quite a bit of a workaround, but it seems to work fine.
EDIT:
There’s a also a filed bug for this, apparently this depends on the locale settings.