I read this blogentry regarding perl and how they handle unicode and normalization of

Question

0

Asked: May 25, 20262026-05-25T17:26:11+00:00 2026-05-25T17:26:11+00:00

I read this blogentry regarding perl and how they handle unicode and normalization of

0

I read this blogentry regarding perl and how they handle unicode and normalization of unicode.
Short version, as I understand it, is that there are several ways to write the identifier “é” in unicode. Either as one unicode character or as a combination of two character. And the perl program may not be able to distinguish between them causing strange errors.

So that got me thinking, how does the Java editor in Eclipse handle unicode? Or java in general, since I guess thats the same question.

On one hand the specification says:

Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit.

But on the other, the unicode chars are translated:

This translation step allows any program to be expressed using only ASCII characters.

This seems to contradict each other?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T17:26:12+00:00

The translation step refers to the first step of the lexical translation process:

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the Unicode character whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

The lexical translation process allows Unicode characters to be specified in your source code as escape sequences having ASCII characters alone. It is thereby possible for one to name an identifier with valid Unicode characters but represented in ASCII using an Unicode escape sequence.

The translation of escape sequences occurs before the compiler is invoked to produce the bytecode; it is the compiler that verifies whether two identifiers are alike, irrespective of how they are represented in code. The compiler is provided with a normalized sequence of input characters and line terminators, and the rules for naming identifiers are applied against this sequence. Therefore, the following code will not compile, and will produce an error, as the identifiers have the same name, despite one being represented differently:

package info.example.i18n;

public class UnicodeEscape
{
    int a;
    int \u0061; // Hex(61) = Dec(97) = 'a' in ASCII-7
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I read this blogentry regarding perl and how they handle unicode and normalization of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply