I have a string that I have retrived from an html page using boost’s

Question

0

Asked: June 18, 20262026-06-18T10:33:26+00:00 2026-06-18T10:33:26+00:00

I have a string that I have retrived from an html page using boost’s

0

I have a string that I have retrived from an html page using boost’s regex_search(). Unfortunately, however, the japanese characters in the page are written as \u codes, and these are interpreted by regex_search as normal characters in a string.

So, my question is, how does one go about converting these codes to normal Unicode text? (UTF-8 obviously)

This is a fundamental issue with fstream having absolutely no regard for UTF-8. It looks like boost has its own implementation of fstream, but changing to it had no effect on my program, and I couldn’t find any extra settings to configure boost’s fstream to work with UTF-8 (although today is my first day ever working with boost, I could have missed it).

As a final note: I’m running this on linux, but I’d certainly appreciate a portable solution over a system-specific one.

Thanks all, I really appreciate the help 😀

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-18T10:33:28+00:00

fstream is a narrow-character only stream (it’s a typedef to basic_fstream<char>). std::wfstream would be the type you’re looking for, although to be perfectly portable to, for example, Windows, you may have to introduce C++11 dependencies (Windows has no Unicode locales, but supports locale-independent Unicode conversions introduced by C++11. GCC on Linux doesn’t support the new Unicode conversions, but has plenty of Unicode locales to choose from) or rely on boost.locale.

Your steps would be:

parse the string to obtain the hexadecimal values of the code points
store them as wide characters.
write them to a std::wofstream (or convert to UTF-8 first, and then write to std::ofstream)

To illustrate the last step:

#include <fstream>
#include <locale>
int main()
{
    std::locale::global(std::locale("en_US.utf8")); // any utf8 works
    std::wofstream f("test.txt");
    f.imbue(std::locale());

    f << wchar_t(0x65e5) << wchar_t(0x672c) << wchar_t(0x8a9e) << '\n';
}

produces a file (on Linux) that contains e6 97 a5 e6 9c ac e8 aa 9e 0a

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a string that I have retrived from an html page using boost’s

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply