When I do:
use strict; use warnings;
my $regex = qr/[[:upper:]]/;
my $line = MyModule::get_my_line_from_external_source(); #file, db, etc...
print "upper here\n" if( $line =~ $regex );
How perl will know when it must match only ascii uppercase and when utf8 uppercase?
It is an precompiled regex – so somewhat perl must know, what is uppercase. Dependent on locale settings? If yes, how to match utf8 uppercase in “C” locale with precompiled regex?
updated based on tchrist’s comments:
use strict; use warnings; use Encode;
my $regex = qr/[[:upper:]]/;
my $line = XXX::line();
print "$line: upper1 ", ($line =~ $regex) ? "YES" : "NO", "\n";
my $uline = Encode::decode_utf8($line);
print "$uline: upper2 ", ($uline =~ $regex) ? "YES" : "NO", "\n";
package XXX;
sub line { return "alpha-Ω"; } #returning octets - not utf8 chars
The output is:
alpha-Ω: upper1 NO
alpha-Ω: upper2 YES
What does it mean, that the precompiled regex is not ‘hard-precompiled’ but ‘soft-precompiled’ – so perl replace ‘[[:upper:]]’ based on the utf8 flag of the matched $line.
Before Perl 5.14, this was not very well defined.
With 5.14, the pattern known how it was compiled, and you have the
/u,/l,/d,/a, or/aapattern modifiers. You can also sayor
to turn all those flags on in the lexical scope.
For example, under 5.14:
I would stear clear of locales; just use all-Unicode.
BTW, I would make darned sure that that “external source” gave you back a string that was properly decoded; that is, has its UTF8 flag turned on. Character functions work poorly on encoded strings, because they really want decoded strings instead.