Why doesn’t “\w” match Unicode word characters (for example, “ğ,İ,ş,ç,ö,ü”) in a Perl regular expression?
I tried to include these characters in regular expression m{\w+}g. However, it does not match “ğ,İ,ş,ç,ö,ü”.
How can I make this work?
use strict;
use warnings;
use v5.12;
use utf8;
open(MYINPUTFILE, "< $ARGV[0]");
my @strings;
my $delimiter;
my $extensions;
my $id;
while(<MYINPUTFILE>)
{
my($line) = $_;
chomp($line);
print $line."\n";
unshift(@strings,$line =~ /\w+/g);
$delimiter = /[._\s]/;
$extensions = /pdf$|doc$|docx$/;
$id = /^200|^201/;
}
foreach(@strings){
print $_."\n";
}
The input file is like:
Çidem_Şener
Hüsnü Tağlip
…
The output goes like:
H�
sn�
Ta�
lip
�
idem_�
ener
In the code, I try to read the file and take each string in the array. (Delimiter can be _ or . or \s).
Unicode can be a challenge, and Perl has its own peculiarities.
Basically, Perl puts up a firewall surrounding all avenues of input/output with regards to Unicode. You have to tell Perl if the path to the I/O has encoding. If it does, the rule is DECODE for any input and/or, ENCODE for any output.
Decoding in converts the data from the {encoding} to the internal representation Perl uses, which is probably a combination of bytes and code points.
Encoding out does just the opposite.
So, it is actually possible to “decode in” and “encode out” to two different encodings. You just have to tell it what it is. The encoding/decoding are usually done via the file I/O layer, but you can use the Encode module (part of the distribution) to manually convert back and forth between encodings.
The perldocs on Unicode is not a light read though.
Here is a sample that might help visualize it (there are many other ways too).
Output