I’m writing a JSON parser in C++ and am facing a problem when parsing JSON strings:
The JSON specification states that JSON strings can contain unicode characters in the form of:
"here comes a unicode character: \u05d9 !"
My JSON parser tries to map JSON strings to std::string so usually, one character of the JSON strings becomes one character of the std::string. However for those unicode characters, I really don’t know what to do:
Should I just put the raw bytes values in my std::string like so:
std::string mystr;
mystr.push_back('\0x05');
mystr.push_back('\0xd9');
Or should I interpret the two characters with a library like iconv and store the UTF-8 encoded result in my string instead ?
Should I use a std::wstring to store all the characters ? What then on *NIX OSes where wchar_t are 4-bytes long ?
I sense something is wrong in my solutions but I fail to understand what. What should I do in that situation ?
After some digging and thanks to H2CO3’s comments and Philipp’s comments, I finally could understand how this is supposed to work:
Reading the RFC4627, Section
3. Encoding:So it appears a JSON octet stream can be encoded in UTF-8, UTF-16, or UTF-32 (in both their BE or LE variants, for the last two).
Once that is clear,
Section 2.5. Stringsexplains how to handle those\uXXXXvalues in JSON strings:With more complete explanations for characters not in the Basic Multilingual Plane.
Hope this helps.