Below I attach my script in Perl. I am testing the number 1234 with one equivalent in Japanese. (I copied from Wikipedia… maybe it is not 100% correct).
Using
\p{decimal number}+
\p{Number}+
\d+
The code works fine for the ASCII version, but for Japanese I find only this example:
[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]
What I am doing wrong in this case?
use 5.016;
use utf8;
use charnames qw< :full >;
use feature qw< unicode_strings >;
use Test::More tests => 2;
sub is_valid {
my $string = shift;
$string ~~ /^[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]+$/u
#/\p{decimal number}+/msx
}
ok(is_valid("1234"), "ascii");
ok(is_valid("壱弐参四"), "japanese");
Your code passes for me on v5.14.
The
/udoesn’t do what you think it does there since you have just ASCII in the pattern. You require v5.16, and that showed up in v5.14. No big whoop unless there’s some v5.16 enhancement you’re trying to use.As many people have noted, there’s a semantic difference between numbers and digits. I think you just want to match a run of digits. The problem is that the UCS doesn’t label the characters you want to match as digits.
As such, you created a very expansive character class to do that. I think you’re stuck with that. You probably don’t want to keep doing that. You could hide it all in a subroutine, but you can also define additional properties. You create a specially named subroutine that returns a string with lines of character ranges as hex values. Here’s an example for perlunicode:
You might use the Unicode::Unihan module to figure out which points you want. You can do it with code, but all this is doing in looking in the Unihan database file that’s the same name as the method. Someone who actually knows Japanese will have to tweak this to select the right characters:
That program outputs: