Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6360997
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T23:43:11+00:00 2026-05-24T23:43:11+00:00

Looking at Tom Christiansen’s talk Unicode Support Shootout The Good, the Bad, & the

  • 0

Looking at Tom Christiansen’s talk

    Unicode Support Shootout

        The Good, the Bad, & the (mostly) Ugly 

working with text seems to be so incredibly hard, that there is no programming language (except Perl 6) which gets it even remotely correct.

What are the key design decisions to make to have a chance to implement Unicode support correctly on a clean table (i. e. no backward-compatibility requirements).

What about default file encodings, which transfer format and normalization format to use internally and for strings? What about case-mapping and case-folding? What about locale- and RTL-support? What about Regex engines as defined by UTS#18? How should common APIs look like?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T23:43:12+00:00Added an answer on May 24, 2026 at 11:43 pm

    EDIT: I’ll add more as I think of them.

    You need no existing code that you have to support. A legacy of code that requires that everything be in 8- or 16-bit unit code units is a royal pain. It makes even libraries awkward when you have to support pre-existing models that don’t consider this.

    You have to work with blind people only so fonts are no issue. 🙂

    You have to follow the Unicode rules for identifier characters, and pattern syntax characters. You should normalize your identifiers internally. If your language is itself LTR, you may not wish to allow RTL idents; unclear here.

    You need to provide primitives in your language that map to Unicode concepts, like instead of just uppercase and lowercase, you need uppercase, titlecase, lowercase, and foldcase (or lc, uc, tc, and fc).

    You need to give full access to the Unicode Character Database, including all character properties, so that the various tech reports’ algorithms can be easily built up using them.

    You need a clear logical model that is easily extensible to graphemes as needed. Just as people have come to realize a code point interface is vastly more important than a code unit one, you have to be able to deal with graphemes, etc. For example, nobody in their right mind should be forced to rewrite:

    printf "%-10.10s", $string;
    

    as this every time:

    # this library treats strings as sequences of
    # extended grapheme clusters for indexing purposes etc.
    use Unicode::GCString;
    
    my $gcstring = Unicode::GCString->new($string);
    my $colwidth = $gcstring->columns();
    if ($colwidth > 10) {
        print $gcstring->substr(0,10);
    } else {
        print " " x (10 - $colwidth);
        print $gcstring;
    }
    

    You have to do it that way, BTW, because you have to have a notion of print columns, which can be 0 for combining and control characters, or 2 for characters with certain East Asian Width properties. Etc. It would be much better if there was no existing printf code so you could start from scratch and do it right. I have no idea what to do about RTL scripts’ widths.

    The operating system is a pre-existing code-unit library.

    You need not to interact with the filesystem name space, as you have no control over whether filesystem A runs things through NFD (Linux, I believe), filesystem B runs things through NFC (HSF+, nearly), or filesystem C (traditional Unix) doesn’t no any at all. Alternately, it is possible that you might be able to provide an abstraction layer here with local filters to hide some of that from the user if possible. Operating systems always have code-unit limits, not code-point ones, which is going to annoy you.

    Other things with code-unit stipulations include databases that allocate fixed-size records. Fixed size just doesn’t work: it’s grapheme-hostile, and normalization form hostile.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Looking for a control that allows to select one text value at a time
Good Day. I'm looking for switch replacement in jQuery. Basically, I have no idea
I am looking for a good approach that can remove empty tags from XML
Looking at some assembly code for x86_64 on my Mac, I see the following
Looking to do a bit of refactoring... Using NHibernate I have this query currently
Looking for a perl one-liner what will find all words with the next pattern:
Looking at the Slickgrid examples and using Google Chrome, I'm setting a breakpoint on
Looking at the Ehcahce implementation of net.sf.cache.JS107, I am trying to achieve the following
Looking for best advice on how to do this: I have an insert like
Looking at some of the code System.Linq I've come across some examples of Buffer<TSource>

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.