Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 444197
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T21:14:47+00:00 2026-05-12T21:14:47+00:00

This is really a double question, my two end goals having answers to: What

  • 0

This is really a double question, my two end goals having answers to:

  • What is the standard string comparison order, in terms of the mechanics?
  • What’s a better name for that so I can update the docs?

Perl’s documentation for sort says that without a block, sort uses “standard string comparison order”. But what is that order? There should be a better name for it. For this question, I specifically mean the situation where locale is not in effect, since that defines its own order.

In years past, we normally called the standard sort order “ASCIIbetically”. It’s in Learning Perl and many other books. However, that term is dated. Perl has been Unicode-aware since 5.6. Talking about ASCII is old school. Since Perl is also Unicode-aware, it knows about character strings. In sv.c, Perl_sv_cmp knows about locale, bytes, and UTF-8. The first two are easy. But I’m not confident about the third.

/*
=for apidoc sv_cmp

Compares the strings in two SVs.  Returns -1, 0, or 1 indicating whether the
string in C<sv1> is less than, equal to, or greater than the string in
C<sv2>. Is UTF-8 and 'use bytes' aware, handles get magic, and will
coerce its args to strings if necessary.  See also C<sv_cmp_locale>.

=cut
*/

When Perl sorts using UTF-8, what is it really sorting? The bytes the string encodes to, the characters it represents (including marks maybe?), or something else? I think this is the relevant line in sv.c (line 6698 for commit 7844ec1):

 pv1 = tpv = (char*)bytes_to_utf8((const U8*)pv1, &cur1);

If I’m reading that right (using my rusty C), pv1 is coerced to octets, turned into UTF-8, then coerced to characters (in the C sense). I think that means it’s sorting by the UTF-8 encoding (i.e. the actual bytes that UTF-8 uses to represent a code point). Another way to say that is that it doesn’t sort on graphemes. I think I’ve almost convinced myself I’m reading this right, but some of you know way more about this than I do.

From that, the next interesting line is 6708:

 const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

To me that looks like once it has pv1 and pv2, which were coerced to char *, now are just compared byte-by-byte because they are coerced to void *. Is that what happens with memcmp, which looks like it’s just comparing bits based on the various docs I’ve read so far? Again, I’m wondering what I’m missing in the trip from bytes->utf8->char->bytes, like maybe a Unicode normalization step. Checking out Perl_bytes_to_utf8 in utf8.c didn’t help me answer that question.

As a side note, I’m wondering if this is the same thing as the Unicode Collation Algorithm? If it is, why does Unicode::Collate exist? From the looks of it, I don’t think Perl’s sort handles canonical equivalence.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T21:14:48+00:00Added an answer on May 12, 2026 at 9:14 pm

    UTF-8 has the property that sorting a UTF-8 string byte-by-byte according to the byte value gives the same ordering as sorting it codepoint-by-codepoint according to the codepoint number. That is, I know without looking that the UTF-8 representation of U+2345 is lexicographically after the UTF-8 representation of U+1234.

    As for normalization, the Perl core doesn’t know anything about it; to get accurate sorting and comparison among the different forms you would want to run all of your strings through Unicode::Normalize and convert them all to the same normalization form. I can’t comment on which is best for any given purpose, mostly because I have no clue.

    Also, sorting and cmp are affected by the locale pragma if it’s in use; it uses the POSIX collation order. Using use locale, an 8-bit locale, and unicode all together is a recipe for disaster, but using use locale, a UTF-8 locale, and unicode should work usefully. I can’t say I’ve tried it. There’s an awful lot of info in perllocale and perlunicode anyway.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

No related questions found

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.