In our website, some Mac users have troubles when they copy-paste text from PDF files into a TextArea (handled by TinyMCE). All accentuated char are corrupted, and became for example e? for a é, i? for a î, etc. I cannot reproduce this problem with a Windows computer.
When I wrote the content of the TextArea on a file (before inserting it in the database), I just discovered that the initial é is visually different that a traditionnal é (on Vim, see below).

Indeed :
// the corrupted é - first line of the screenshot
echo bin2hex($char); // display 65cc81
// traditionnal é
echo bin2hex('é'); // display c3a9
After searching a lot, here I am :
It seems that Mac OS copies Unicode accentuated chars as a combination of two chars: in our example, e + ́. So far, I didn’t find any solution to replace corrupted é with the real one, to avoid e? in the database.
And I’m a little desperate.
The process of normalizing the representation to one form or the other is called, well, normalization. In PHP there’s the
Normalizerclass for that, sending all input through it is a good idea:You likely want to normalize to form C, Canonical Decomposition followed by Canonical Composition.
Should that class not be available on your system, there’s the Patchwork UTF-8 library.