I’m trying to internationalise an Android application. I’ve a set of strings which I’ve written out in english, and I’m useing Google translate to convert them to the target language.
Then I’m copy and pasteing the translated text Eclipse, however it’s displayed incorrectly in Eclipse.
e.g.
I start with the English
Bearing, as degrees East of true north
which translates to
De paliers, comme degrés Est du nord vrai
and when I paste it to Eclipse I get
De paliers, comme degrés Est du nord vrai
I’ve checked and the format for the strings file is UTF-8, also I’ve checked by posting the translation to notepad and I get the correct characters, which leads me to suspect that it’s something to do with Eclipse and Windows 7. Anyone got any ideas or a workaround (i.e. will editing the xml file outside of Eclipse (in notepad for example) work?)
Your string is UTF-8 (the symbol à denotes it) but Eclipse is interpreting your file as probably Cp1252. Right click on the file and check the content encoding Eclipse is using (generally, if not modified, inherited from container which usually defaults to Cp1252. Container is project/Workspace/whole Eclipse settings in that order). Some files however, such as XML are treated according to their content (XML has a header showing the encoding used).
Update
If you check that the file is actually being interpreted as UTF-8 by Eclipse then this means a double conversion. When using Cp-1252 Ã has a binary code 0xC3 and © has a binary code 0xA9. If you peek the UTF-8 charset table you will discover that é character has a two byte encoding of 0xC3 0xA9. Sometimes when interpreting data some conversions are automatically made (i.e. when outputting java Strings to other since they originally are always UTF-16) if origin-destination encodings are known. The problem arises when one of the encodings is unknown (your case) and the transformer has to decide (normally using default system encoding). This is when things start getting messed up.
You may end up with é in UTF-8 if original source was indeed in UTF-8 but was interpreted as Cp1252. Original 0xC3 0xA9 (é in Cp1252 or é in UTF-8) sequence is translated to 0xC3 0x83 (à in UTF-8) and 0xC2 0xC9 (© in UTF-8).
How can origin encoding being detected if not specified? Normally you can’t. That’s why most UTF-8 encoders make this double conversion if you feed them back (from Cp1252 to UTF-8 and again to UTF-8 when feeding with the previous output but interpreting the input as Cp1252), unless you are using some mark in the document to tell the encoder about the encoding (such as BOM, which is by the way not supported by Eclipse).