When encoding a java String to Latin-1 (ie. charset ISO-8859-1) I currently convert the German symbol β (‘\u03B2’) to ß (‘\u00DF’) before performing the encoding. I’m trying to avoid a question mark in the encoding where possible.
Can anyone suggest other un-encodable characters which can be replaced an encodable character? Or better yet, a Java library that does it for me?
Update: A bit of background: I have a Java program which exports it’s data to CSV files so they can be read into a thrid-party application. A customer has complained that some characters are not converted – he gave me the example of ‘straβe’. Although technically β is the greek symbol for Beta, a quick google search shows quite a few people use it to mean ß.
First, are you sure your input text is correctly entered or encoded?
u+03B2 is ‘GREEK SMALL LETTER BETA’, not German eszett.
u+00DF is eszett or ‘LATIN SMALL LETTER SHARP S’
Java can map the latter to ISO-8859-1 because it’s defined in http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT .
There is no way to solve this problem generally – the whole point of Unicode is that it contains (lots) of characters that simply cannot be represented in ISO-8859-* .
I suggest producing a list of all unicode characters in your data that are not listed in the http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT document. Then for each unmapped character, you will have to choose appropriate substitutions from the ISO-8859-1 range by hand/eye.