Here’s the workflow:
- user types in Word; Word changes a single apostrophe to a “smart quote”
- user pastes the test from word into a form on a web page; the page the form is in is encoded in UTF-8
- the data gets saved into a MySQL database with the encoding
latin1 - when retrieved from the database by a PHP app (which assumes the database encoding is UTF-8) and displayed in a UTF-8 web page, the quote displays as ’
I realise there’s a mismatch between the encoding of the input and output pages and the database. That I’m going to fix.
Shouldn’t the character survive the trip to and from the database anyway?
And how does a single character (0x92 if I’m not confused) go through that process and come out the other end as three characters?
Can someone talk me through what’s happening to the bytes at each stage of the process?
Step 1:
Word converts
'to’(Unicode codepointU+2019,RIGHT SINGLE QUOTATION MARK).Step 2:
’is encoded into UTF-8 asE2 80 99Step 3:
This appears to be where the problem occurs. It looks like the UTF-8 string is stored without conversion in the latin-1-encoded MySQL field:
E2 80 99in latin-1 is’.Step 4:
Either here or in the previous step, that falsely used latin-1 string is converted to UTF-8.
’in UTF-8 isC3 A2 E2 82 AC E2 84 A2.This will display on a UTF-8-encoded website as
’.