I have a file with one phrase/terms each line which i read to perl

Question

0

Asked: May 21, 20262026-05-21T22:04:28+00:00 2026-05-21T22:04:28+00:00

I have a file with one phrase/terms each line which i read to perl

0

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like “á”, “são”, “é”) and i want to compare each one of them with each term, and remove if they are equal. The problem is that i’m not certain of the file’s encoding format.

I get this from the file command:

words.txt: Non-ISO extended-ASCII English text

My linux terminal is in UTF-8 and it shows the right content for some words and for others don’t. Here is the output from some of them:

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don’t. The correct output for the other lines should be: condiã, conteúdos and moçambique.

If i use binmode(STDOUT, utf8) the “incorrect” lines now output correctly while the other ones don’t. For example the 3rd line:

ajuda, mas nÃ£o resolve

What should i do guys?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-21T22:04:29+00:00

It works like this:

C:\Dev\Perl :: chcp
Aktive Codepage: 1252.

C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei KÃ¤se vier fÃ¼nf Wurst
eins zwei drei Käse vier fünf Wurst

C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf

Where mixed-encoding.pl goes like this:

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
    chomp;
    my @tokens;
    for ( split /\s+/ ) {
        # Try UTF-8 first. If that fails, assume legacy Latin-1.
        my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
        $token = $_ if $@;
        push @tokens, $token unless any { $token eq $_ } @stopwords;
    }
    print "@tokens\n";
}

Note that the script doesn’t have to be encoded in UTF-8. It’s just that if you have funky character data in your script you have to make sure the encoding matches, so use utf8 if your encoding is UTF-8, and don’t if it isn’t.

Update based on tchrist’s sound advice:

use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
        chomp;
        my @tokens;
        for ( split /\s+/ ) {
                # Try UTF-8 first. If that fails, assume 8-bit encoding.
                my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
                $token    = decode Windows1252 => $_, Encode::FB_CROAK if $@;
                push @tokens, uc $token unless any { $token eq $_ } @stopwords;
        }
        print "@tokens\n";
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a file with one phrase/terms each line which i read to perl

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply