I have some UTF-8 strings in memory (this is part of a bigger system)

Question

0

Asked: June 11, 20262026-06-11T17:15:32+00:00 2026-06-11T17:15:32+00:00

I have some UTF-8 strings in memory (this is part of a bigger system)

0

I have some UTF-8 strings in memory (this is part of a bigger system) which are basically name of places in European countries. What I’m trying to do is write them to a text file. I’m on my Linux machine (Fedora). So when I write these name strings (char pointers) to file, the file is getting saved in extended ASCII format.

Now I copy this file to my Windows machine where I need to load these names to mySQL DB. When I open the text file on notepad++, again it defaults the encoding to ANSI. But I can select encoding to UTF-8 and almost all the characters looks as expected except the following 3 characters:- Ő, ő and ű. They are displayed within the text as &#336, &#337 and &#369.

Does anyone has any thought on what might be wrong. I know that these are not part of extended ASCII symbols. But the way I’m writing this to the file is something like:

// create out file stream
std::ofstream fs("sample.txt");

// loop through utf-8 formatted string list
if(fs.is_open()) {
    for(int i = 0; i < num_strs; i++) {
        fs << str_name; // unsigned char pointer representing name in utf-8 format
        fs << "\n";
    }
}
fs.close();

Everything looks good even with characters like ú and ö and ß. The issue is with the above 3 characters alone. Any thoughts/suggestions/comments on this? Thanks!

As an example, a string like “Gyömrő” shows up as “Gyömr&#369”.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T17:15:33+00:00

You need to identify at which stage the unexpected &#336 HTML entities are introduced. My best guess is, that they are already in the string you are writing to the file. Use a debugger or add testing code that counts the &s in the string.

That means, your source of information does not strictly use UTF-8 for non-ASCII characters, but occasionally uses HTML entities. This is odd, but possible if your data source is a HTML file (or something like that).

Also, you might want to look at your output file in HEX mode. (There’s a nice plugin for Notepad++) This might hopefully help you to understand what UTF-8 really means on the byte level: The 128 ASCII symbols use one byte of a value 0-127. Other symbols use 2-6 bytes (i think), where the first byte must be >127. HTML entities are not really an encoding, more an escape sequence like ‘\n’ ‘\r’.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some UTF-8 strings in memory (this is part of a bigger system)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply