I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù
I have a PHP script that exports these form submissions into CSV format, when I look at the CSV format in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: �����
I am also passing these submission to salesforce and am getting an error: “The entity “Atilde” was referenced, but not declared.”
What can I do to ensure portability of my CSV file? What’s the proper way to handle the encoding?
My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8
Data is being stored in MySQL as latin1_swedish_ci collation.
Total encoding confusion! 🙂
The table character set
The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.
The connection character set
The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).
SET NAMES "utf8".Page character set
The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.
Recommendations
Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.
However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you’ve seen, the guess will be wrong.
Your best bet is to use Latin-1 for the CSV-file. I’d still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.
If you use UTF-8 for the connection character set (by executing
SET NAMES "utf8"after connecting), you’ll need to run the text through utf8_decode to convert to Latin-1.That entity problem
This sounds like you’re passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.