I read this blogentry regarding perl and how they handle unicode and normalization of unicode.
Short version, as I understand it, is that there are several ways to write the identifier “é” in unicode. Either as one unicode character or as a combination of two character. And the perl program may not be able to distinguish between them causing strange errors.
So that got me thinking, how does the Java editor in Eclipse handle unicode? Or java in general, since I guess thats the same question.
On one hand the specification says:
Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit.
But on the other, the unicode chars are translated:
This translation step allows any program to be expressed using only ASCII characters.
This seems to contradict each other?
The translation step refers to the first step of the lexical translation process:
The lexical translation process allows Unicode characters to be specified in your source code as escape sequences having ASCII characters alone. It is thereby possible for one to name an identifier with valid Unicode characters but represented in ASCII using an Unicode escape sequence.
The translation of escape sequences occurs before the compiler is invoked to produce the bytecode; it is the compiler that verifies whether two identifiers are alike, irrespective of how they are represented in code. The compiler is provided with a normalized sequence of input characters and line terminators, and the rules for naming identifiers are applied against this sequence. Therefore, the following code will not compile, and will produce an error, as the identifiers have the same name, despite one being represented differently: