The following code
#!/usr/bin/perl
use strict;
use warnings;
my $s1 = 'aaa2000@yahoo.com';
my $s2 = 'aaa_2000@yahoo.com';
my $s3 = 'aaa2000';
my $s4 = 'aaa_2000';
no locale;
print "\nNO Locale:\n\n";
if ($s1 gt $s2) {print "$s1 is > $s2\n";}
if ($s1 lt $s2) {print "$s1 is < $s2\n";}
if ($s1 eq $s2) {print "$s1 is = $s2\n";}
if ($s3 gt $s4) {print "$s3 is > $s4\n";}
if ($s3 lt $s4) {print "$s3 is < $s4\n";}
if ($s3 eq $s4) {print "$s3 is = $s4\n";}
use locale;
print "\nWith 'use locale;':\n\n";
if ($s1 gt $s2) {print "$s1 is > $s2\n";}
if ($s1 lt $s2) {print "$s1 is < $s2\n";}
if ($s1 eq $s2) {print "$s1 is = $s2\n";}
if ($s3 gt $s4) {print "$s3 is > $s4\n";}
if ($s3 lt $s4) {print "$s3 is < $s4\n";}
if ($s3 eq $s4) {print "$s3 is = $s4\n";}
prints out
NO Locale:
aaa2000@yahoo.com is < aaa_2000@yahoo.com
aaa2000 is < aaa_2000
With 'use locale;':
aaa2000@yahoo.com is > aaa_2000@yahoo.com
aaa2000 is < aaa_2000
which I cannot really follow: in the same time, under use locale, there is a < b AND a@yahoo.com > b@yahoo.com ?!!
Am I missing something more or less obvious, or is this a bug? Can others confirm to see the same behavior ?
Locale is $ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Thanks in advance.
With locales enabled, collation is done in multiple passes. Every character has four weights, which are compared in successive passes. The
@and_signs, like most punctuation, have no primary, secondary, or tertiary weight, so they only come into play in the fourth pass. So, for your examplein the first pass, it’s really comparing
and then in the fourth pass (there are no differentiating factors in the second and third passes)
because
@happens to be greater than_in this locale. (This is just a choice that the locale definition makes, presumably based on some ISO standard or other.)You can peek into the implementation details of this. A locale-enabled comparison ends up being implemented in the C library as
strxfrm(A) cmp strxfrm(B). Run this program:I get:
The way these numbers are derived is an implementation detail; they just have to come out such that a byte comparison yields the desired end result. But the concept is the same across all modern programming environments with locale-enabled sorting.