I have some UTF-8 strings in memory (this is part of a bigger system) which are basically name of places in European countries. What I’m trying to do is write them to a text file. I’m on my Linux machine (Fedora). So when I write these name strings (char pointers) to file, the file is getting saved in extended ASCII format.
Now I copy this file to my Windows machine where I need to load these names to mySQL DB. When I open the text file on notepad++, again it defaults the encoding to ANSI. But I can select encoding to UTF-8 and almost all the characters looks as expected except the following 3 characters:- Ő, ő and ű. They are displayed within the text as Ő, ő and ű.
Does anyone has any thought on what might be wrong. I know that these are not part of extended ASCII symbols. But the way I’m writing this to the file is something like:
// create out file stream
std::ofstream fs("sample.txt");
// loop through utf-8 formatted string list
if(fs.is_open()) {
for(int i = 0; i < num_strs; i++) {
fs << str_name; // unsigned char pointer representing name in utf-8 format
fs << "\n";
}
}
fs.close();
Everything looks good even with characters like ú and ö and ß. The issue is with the above 3 characters alone. Any thoughts/suggestions/comments on this? Thanks!
As an example, a string like “Gyömrő” shows up as “Gyömrű”.
You need to identify at which stage the unexpected Ő HTML entities are introduced. My best guess is, that they are already in the string you are writing to the file. Use a debugger or add testing code that counts the &s in the string.
That means, your source of information does not strictly use UTF-8 for non-ASCII characters, but occasionally uses HTML entities. This is odd, but possible if your data source is a HTML file (or something like that).
Also, you might want to look at your output file in HEX mode. (There’s a nice plugin for Notepad++) This might hopefully help you to understand what UTF-8 really means on the byte level: The 128 ASCII symbols use one byte of a value 0-127. Other symbols use 2-6 bytes (i think), where the first byte must be >127. HTML entities are not really an encoding, more an escape sequence like ‘\n’ ‘\r’.