Using PHP against a UTF-8 compliant database. Here’s how input goes in.
- user types input into textarea
- textarea encoded with javascript escape()
- passed via HTTP post
- decoded with PHP rawurldecode()
- passed through HTMLPurifier with default settings
- escaped for MySQL and stored in database
And it comes out in the usual way and I run unescape() on page load. This is to allow people to, say, copy and paste directly from a word document and have the smart quotes show up.
But HTMLPurifier seems to be clobbering non-UTF-8 special characters, ones that escape() to a simple % expression, like Ö, which escapes to %D6, whereas smartquotes escape to %u2024 or something and go into the database that way. It takes out both the special character and the one immediately following.
I need to change something in this process. Perhaps I need to change multiple things.
What can I do to not get special characters clobbered?
escapeisn’t safe for non-ascii. UseescapeURIComponentI assume that you use
XmlHttpRequest? If not, make sure that the page containing the form is served as utf-8.If you access the value through
$_POST, you should not decode it, since that has already been done. Doing so will mess up data.Make sure you don’t have
magic quotesturned on. Make sure that the database stores tables as utf-8 (The encoding and the collation must be both utf-8). Make sure that the connection between php and MySql is utf-8 (Useset names utf8, if you don’t use PDO).Finally, make sure that the page is served as utf-8 when you output the string again.