For example, match Nation in Îñţérñåţîöñåļîžåţîöñ without extra modules. Is it possible in new

Question

0

Asked: May 25, 20262026-05-25T17:04:45+00:00 2026-05-25T17:04:45+00:00

For example, match Nation in Îñţérñåţîöñåļîžåţîöñ without extra modules. Is it possible in new

0

For example, match “Nation” in “”Îñţérñåţîöñåļîžåţîöñ” without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?

I found an answer! Thanks to tchrist

Rigth solution with UCA match (thnx to https://stackoverflow.com/users/471272/tchrist).

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str, $found, $offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s", $start, $end);
        $offset = $end + 1;
    }
}

Wrong (but working) solution from http://www.perlmonks.org/?node_id=485681

Magic piece of code is:

    $str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;

code example:

    use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT, ':encoding(UTF-8)';
    my $str  = "Îñţérñåţîöñåļîžåţîöñ";
    my $look = "Nation";
    say "before: $str\n";
    $str = NFD($str);
    # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/\pM//og; # remove "marks"
    say "after: $str";¬
    say "is_match: ", $str =~ /$look/i || 0;

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T17:04:46+00:00

Right solution with UCA (thnx to tchrist):

# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
say "match ok!" if @match;

P.S.
“Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.”
© tchrist Why does modern Perl avoid UTF-8 by default?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

For example, match Nation in Îñţérñåţîöñåļîžåţîöñ without extra modules. Is it possible in new

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply