I have a PHP script that imports and parses XML files and saves the data into the database:
- Database collation:
utf8_general_ci, charset:utf8 - Page’s charset :
utf-8 - XML files:
ANSI, contains smart quotes (from MS Word)
So during import I do a utf8_encode() on the text from the XML files prior to saving into the database and subsequently displaying on the page.
But when successfully imported, and saved into DB,
- Database: smart quotes are saved as
?character (viewed from CMD) - Page: smart quotes are displayed as boxes
Any ideas as to why the smart quotes are not being converted correctly, even when using utf8_encode()?
EDIT:
@Tomalak: The XML files are actually .txt, no XML declaration (<?xml ... ?>), and no root element. My script actually adds a root element just so the parser works:
utf8_encode('<article>' . file_get_contents($xmlfile) . '</article>');
Seems like I need to add an XML declaration..? If so, how should it look like?
If your XML string (i.e. file contents) is not encoded as UTF-8, you need an XML declaration that denotes the file encoding. If an XML declaration is missing, the parser will assume UTF-8.
As long as you do not use “special” characters (i.e. anything outside of the ASCII range), it will work without a declaration even if your file is not really UTF-8-encoded. This is because UTF-8 is byte-compatible to ASCII. But as soon as characters are used that are on one of the code pages — like the “smart quotes” — it will break because these are represented by different bytes in UTF-8.
In your case there are text files in a legacy encoding that you wrap with a root element to turn them into well-formed XML. Therefore you need to add the XML declaration yourself:
This way you instruct the
DOMDocumenthow to interpret the bytes in your string. I assumedWindows-1252for you because you said ANSI and mentioned the curly quotes.In fact, 95% of the time this is what people really mean, even on Linux and even if they say
ISO-8859-1(orlatin-1), which is almost, but not exactly the same thing.To be extra sure you can open your text files in a hex editor, spot a few special characters and compare their byte values with the suspected encoding. For
Windows-1252. For the curly quotes the expected byte values would be:“147 (0x93)”148 (0x94)Once the meaning of the individual bytes in your string is declared,
DOMDocumentcan make sense of them and does the right thing.When it comes to in the DB, I strongly suspect there is some automagic encoding conversion going on. I admit that I don’t know enough about PHP/mySQL/Unicode integration to say for sure.