I have a string that I have retrived from an html page using boost’s regex_search(). Unfortunately, however, the japanese characters in the page are written as \u codes, and these are interpreted by regex_search as normal characters in a string.
So, my question is, how does one go about converting these codes to normal Unicode text? (UTF-8 obviously)
This is a fundamental issue with fstream having absolutely no regard for UTF-8. It looks like boost has its own implementation of fstream, but changing to it had no effect on my program, and I couldn’t find any extra settings to configure boost’s fstream to work with UTF-8 (although today is my first day ever working with boost, I could have missed it).
As a final note: I’m running this on linux, but I’d certainly appreciate a portable solution over a system-specific one.
Thanks all, I really appreciate the help 😀
fstreamis a narrow-character only stream (it’s a typedef tobasic_fstream<char>).std::wfstreamwould be the type you’re looking for, although to be perfectly portable to, for example, Windows, you may have to introduce C++11 dependencies (Windows has no Unicode locales, but supports locale-independent Unicode conversions introduced by C++11. GCC on Linux doesn’t support the new Unicode conversions, but has plenty of Unicode locales to choose from) or rely on boost.locale.Your steps would be:
std::wofstream(or convert to UTF-8 first, and then write tostd::ofstream)To illustrate the last step:
produces a file (on Linux) that contains
e6 97 a5 e6 9c ac e8 aa 9e 0a