Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7672283
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T16:15:00+00:00 2026-05-31T16:15:00+00:00

Why doesn’t \w match Unicode word characters (for example, ğ,İ,ş,ç,ö,ü) in a Perl regular

  • 0

Why doesn’t “\w” match Unicode word characters (for example, “ğ,İ,ş,ç,ö,ü”) in a Perl regular expression?

I tried to include these characters in regular expression m{\w+}g. However, it does not match “ğ,İ,ş,ç,ö,ü”.

How can I make this work?

use strict;
use warnings;
use v5.12;
use utf8;

open(MYINPUTFILE, "< $ARGV[0]");

my @strings;
my $delimiter;
my $extensions;
my $id;

while(<MYINPUTFILE>)
{
    my($line) = $_;
    chomp($line);
    print $line."\n";
    unshift(@strings,$line =~ /\w+/g);
    $delimiter = /[._\s]/;
    $extensions = /pdf$|doc$|docx$/;
    $id = /^200|^201/;
}

foreach(@strings){
    print $_."\n";
}

The input file is like:

Çidem_Şener
Hüsnü Tağlip
…

The output goes like:

H�

sn�

Ta�

lip

�

idem_�

ener

In the code, I try to read the file and take each string in the array. (Delimiter can be _ or . or \s).

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T16:15:01+00:00Added an answer on May 31, 2026 at 4:15 pm

    Unicode can be a challenge, and Perl has its own peculiarities.
    Basically, Perl puts up a firewall surrounding all avenues of input/output with regards to Unicode. You have to tell Perl if the path to the I/O has encoding. If it does, the rule is DECODE for any input and/or, ENCODE for any output.

    Decoding in converts the data from the {encoding} to the internal representation Perl uses, which is probably a combination of bytes and code points.

    Encoding out does just the opposite.

    So, it is actually possible to “decode in” and “encode out” to two different encodings. You just have to tell it what it is. The encoding/decoding are usually done via the file I/O layer, but you can use the Encode module (part of the distribution) to manually convert back and forth between encodings.

    The perldocs on Unicode is not a light read though.

    Here is a sample that might help visualize it (there are many other ways too).

    use strict;
    use warnings;
    use Encode;
    
    
    # This is an internalized string with these UTF-8 codepoints
    # ----------------------------------------------
    my $internal_string_1 = "\x{C7}\x{69}\x{64}\x{65}\x{6D}\x{5F}\x{15E}\x{65}\x{6E}\x{65}\x{72}\x{20}\x{48}\x{FC}\x{73}\x{6E}\x{FC}\x{20}\x{54}\x{61}\x{11F}\x{6C}\x{69}\x{70}";
    
    
    # Open a temp file for writing as UTF-8.
    # Output to this file will be automatically encoded from Perl internal to UTF-8 octets.
    # Write the internal string.
    # Check the file with a UTF-8 editor.
    # ----------------------------------------------
    open (my $out, '>:utf8', 'temp.txt') or die "can't open temp.txt for writing $!";
    print $out $internal_string_1;
    close $out;
    
    
    # Open the temp file for readin as UTF-8.
    # All input from this file will be automatically decoded as UTF-8 octets to Perl internal.
    # Read/decode to a different internal string.
    # ----------------------------------------------
    open (my $in, '<:utf8', 'temp.txt') or die "can't open temp.txt for reading $!";
    $/ = undef;
    my $internal_string_2 = <$in>;
    close $in;
    
    
    # Change the binmode of STDOUT to UTF-8.
    # Output to STDOUT will now be automatically encoded from Perl internal to UTF-8 octets.
    # Capture STDOUT to a file then check with a UTF-8 editor.
    # ----------------------------------------------
    binmode STDOUT, ':utf8';
    print $internal_string_2, "\n\n";
    
    
    # Use encode() to convert an internal string to UTF-8 octets
    # Format the UTF-8 octets to hex values
    # Print to STDOUT
    # ----------------------------------------------
    my $octets = encode ("utf8", $internal_string_2);
    print "Encoded (out) string -> UTF-8 (octets):\n";
    print "   length  =  ".length($octets)."\n";
    print "   octets  =  $octets\n";
    print "   HEX val =  ";
    for (split //, $octets) {
        printf ("0x%X ", ord($_));
    }
    print "\n\n";
    
    
    # Use decode() to convert external UTF-8 octets to an internal string.
    # Format the internal string to codepoints (hex values).
    # Print to STDOUT.
    # ----------------------------------------------
    my $internal_string_3 = decode ("utf8", $octets);
    print "Decoded (in) string <- UTF-8 (octets):\n";
    print "   length      =  ".length($internal_string_3)."\n";
    print "   string      =  $internal_string_3\n";
    print "   code points =  ";
    for (split //, $internal_string_3) {
        printf ("\\x{%X} ", ord($_));
    }
    

    Output

    Çidem_Şener Hüsnü Tağlip
    
    Encoded (out) string -> UTF-8 (octets):
       length  =  29
       octets  =  Ãidem_Åener Hüsnü TaÄlip
       HEX val =  0xC3 0x87 0x69 0x64 0x65 0x6D 0x5F 0xC5 0x9E 0x65 0x6E 0x65 0x72 0x20 0x48 0xC3 0xBC 0x73 0x6E 0xC3 0xBC 0x20 0x54 0x61 0xC4 0x9F 0x6C 0x69 0x70
    
    Decoded (in) string <- UTF-8 (octets):
       length      =  24
       string      =  Çidem_Şener Hüsnü Tağlip
       code points =  \x{C7} \x{69} \x{64} \x{65} \x{6D} \x{5F} \x{15E} \x{65} \x{6E} \x{65} \x{72} \x{20} \x{48} \x{FC} \x{73} \x{6E} \x{FC} \x{20} \x{54} \x{61} \x{11F} \x{6C} \x{69} \x{70}
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Why doesn't the first example output a warning? #!/usr/bin/env perl use warnings; use 5.012;
There doesn't seem to be a dictionary.AddRange() method. Does anyone know a better way
Python doesn't support complicated anonymous functions. What's a good alternative? For example: class Calculation:
Why doesn't this code print test? #include <stdio.h> #include <stdlib.h> void foo ( void
Why doesn't the following doesn't handle the exception that was rethrown? I tried all
Why doesn't JSON.parse behave as expected? In this example, the alert doesn't fire: <html
Doesn't work with other modules, but to give an example. I installed Text::CSV_XS with
There doesn't seem to be any tried and true set of best practices to
Doesn't an ORM usually involve doing something like a select *? If I have
Doesn't value have to return toString() to be able to call value.toString()? When do

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.