I need to normalize a string such as “quée” and I can’t seem to convert the extended ASCII characters such as é, á, í, etc into roman/english versions. I’ve tried several different methods but nothing works so far. There is a fair amount of material on this general subject but I can’t seem to find a working answer to this problem.
Here’s my code:
#transliteration solution (works great with standard chars but doesn't find the
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;
#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );
foreach ( @breakdown ) {
if ( $_ eq "\x{130}" ) {
$_ = "e";
print "\nArray Output: @breakdown\n";
}
$lowercase = join( "",@breakdown );
}
1) This article should provide a fairly good (if complicated) way.
It provides a solution to converting all accented Unicode characters into the base character + accent; once that is done you can simply remove the accent characters separately.
2) Another option is CPAN:
Text::Unaccent::PurePerl(An improved Pure Perl version ofText::Unaccent)3) Also, this SO answer proposes
Text::Unidecode: