My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.
While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.
The openeing double quotes are printed correctly, but the closing ones are disfigured.
Using a Hex-Editor, I found ot that the code units are somehow changed from
E2 80 9D
in the input file to
E2 80 3F
in the output file.
I am using the sax-parser for the xml-parsing.
Are there any known bugs that could cause such a behaviour?
Not a known bug but a common mistake to leave encoding out when reading files or writing them – resulting in the platform default encoding used which is Windows-1252 in this case.
When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.
How this can be reproduced: