Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:

(Note: This is an example, not a spam job post… 🙂
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
- What causes this particular, common encoding issue?
- As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular
’example, this is a typical CP1252 representation of the Unicode Character ‘RIGHT SINQLE QUOTATION MARK’ (U+2019)’which was been read using UTF-8. In UTF-8, that character exist of the bytes0xE2,0x80and0x99. If you check the CP1252 codepage layout, then you’ll see that those bytes represent exactly the charactersâ,€and™.This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong
charset=CP1252attribute inContent-Typeresponse header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you’re consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you’re familiar with Java (your question history confirms this), you may find this article useful.