My client uses InputStreamReader/BufferedReader to fetch text from the Internet.
However when I save the Text to a *.txt the text shows extra weird special symbols like ‘Â’.
-
I’ve tried Convert the String to ASCII but that mess upp å,ä,ö,Ø which I use.
-
I’ve tried food =
food.replace("Â", ""); and IndexOf();
But string won’t find it. But it’s there in HEX Editor.
So summary: When I use text.setText(Android), the output looks fine with NO weird symbols, but when I save the text to *.txt I get about 4 of ‘Â’. I do not want ASCII because I use other Non-ASCII character.
The ‘Â’ is displayed as a Whitespace on my Android and in notepad.
Thanks!
Have A great Weekend!
EDIT:
Solved it by removing all Non-breaking-spaces:
myString.replaceAll("\\u00a0"," ");
You say that you are fetching like this:
There is a fair chance that the stuff you are fetching is not encoded in UTF-8.
You need to call
getContentType()on the HttpURLConnection object, and if it is non-null, extract the encoding and use it when you create theInputStreamReader. Only assume “UTF-8” if the response doesn’t supply a content type with a valid encoding.On reflection, while you SHOULD pay attention to the content type returned by the server, the real problem is either in the way that you are writing the *.txt file, or in the display tool that is showing strange characters.
When you display files using a HEX editor, it is most likely using an 8-bit character set to render bytes, and that character set is most likely Latin-1. But apparently, the file is actually encoded differently.
Anyway, the approach of replacing non-breaking spaces is (IMO) a hack, and it won’t deal with other stuff that you might encounter in the future. So I recommend that you take the time to really understand the problem, and fix it properly.
Finally, I think I understand why you might be getting  characters. A Unicode NON-BREAKING-SPACE character is
u00a0. When you encode that as UTF-8, you get C2 A0. But C2 in Latin-1 is CAPITAL-A-CIRCUMFLEX, and A0 in Latin-1 is NON-BREAKING-SPACE. So the “confusion” is most likely that your program is writing the *.txt file in UTF-8 and the tool is reading it as Latin-1.