Looking at Tom Christiansen’s talk Unicode Support Shootout The Good, the Bad, & the

Question

0

Asked: May 24, 20262026-05-24T23:43:11+00:00 2026-05-24T23:43:11+00:00

Looking at Tom Christiansen’s talk Unicode Support Shootout The Good, the Bad, & the

0

Looking at Tom Christiansen’s talk

    Unicode Support Shootout

        The Good, the Bad, & the (mostly) Ugly

working with text seems to be so incredibly hard, that there is no programming language (except Perl 6) which gets it even remotely correct.

What are the key design decisions to make to have a chance to implement Unicode support correctly on a clean table (i. e. no backward-compatibility requirements).

What about default file encodings, which transfer format and normalization format to use internally and for strings? What about case-mapping and case-folding? What about locale- and RTL-support? What about Regex engines as defined by UTS#18? How should common APIs look like?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-24T23:43:12+00:00

EDIT: I’ll add more as I think of them.

You need no existing code that you have to support. A legacy of code that requires that everything be in 8- or 16-bit unit code units is a royal pain. It makes even libraries awkward when you have to support pre-existing models that don’t consider this.

You have to work with blind people only so fonts are no issue. 🙂

You have to follow the Unicode rules for identifier characters, and pattern syntax characters. You should normalize your identifiers internally. If your language is itself LTR, you may not wish to allow RTL idents; unclear here.

You need to provide primitives in your language that map to Unicode concepts, like instead of just uppercase and lowercase, you need uppercase, titlecase, lowercase, and foldcase (or lc, uc, tc, and fc).

You need to give full access to the Unicode Character Database, including all character properties, so that the various tech reports’ algorithms can be easily built up using them.

You need a clear logical model that is easily extensible to graphemes as needed. Just as people have come to realize a code point interface is vastly more important than a code unit one, you have to be able to deal with graphemes, etc. For example, nobody in their right mind should be forced to rewrite:

printf "%-10.10s", $string;

as this every time:

# this library treats strings as sequences of
# extended grapheme clusters for indexing purposes etc.
use Unicode::GCString;

my $gcstring = Unicode::GCString->new($string);
my $colwidth = $gcstring->columns();
if ($colwidth > 10) {
    print $gcstring->substr(0,10);
} else {
    print " " x (10 - $colwidth);
    print $gcstring;
}

You have to do it that way, BTW, because you have to have a notion of print columns, which can be 0 for combining and control characters, or 2 for characters with certain East Asian Width properties. Etc. It would be much better if there was no existing printf code so you could start from scratch and do it right. I have no idea what to do about RTL scripts’ widths.

The operating system is a pre-existing code-unit library.

You need not to interact with the filesystem name space, as you have no control over whether filesystem A runs things through NFD (Linux, I believe), filesystem B runs things through NFC (HSF+, nearly), or filesystem C (traditional Unix) doesn’t no any at all. Alternately, it is possible that you might be able to provide an abstraction layer here with local filters to hide some of that from the user if possible. Operating systems always have code-unit limits, not code-point ones, which is going to annoy you.

Other things with code-unit stipulations include databases that allocate fixed-size records. Fixed size just doesn’t work: it’s grapheme-hostile, and normalization form hostile.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Looking at Tom Christiansen’s talk Unicode Support Shootout The Good, the Bad, & the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply