Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4589454
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T22:04:28+00:00 2026-05-21T22:04:28+00:00

I have a file with one phrase/terms each line which i read to perl

  • 0

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like “á”, “são”, “é”) and i want to compare each one of them with each term, and remove if they are equal. The problem is that i’m not certain of the file’s encoding format.

I get this from the file command:

words.txt: Non-ISO extended-ASCII English text

My linux terminal is in UTF-8 and it shows the right content for some words and for others don’t. Here is the output from some of them:

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don’t. The correct output for the other lines should be: condiã, conteúdos and moçambique.

If i use binmode(STDOUT, utf8) the “incorrect” lines now output correctly while the other ones don’t. For example the 3rd line:

ajuda, mas não resolve

What should i do guys?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T22:04:29+00:00Added an answer on May 21, 2026 at 10:04 pm

    It works like this:

    C:\Dev\Perl :: chcp
    Aktive Codepage: 1252.
    
    C:\Dev\Perl :: type mixed-encoding.txt
    eins zwei drei Käse vier fünf Wurst
    eins zwei drei Käse vier fünf Wurst
    
    C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
    eins zwei drei vier fünf
    eins zwei drei vier fünf
    

    Where mixed-encoding.pl goes like this:

    use strict;
    use warnings;
    use utf8; # source in UTF-8
    use Encode 'decode_utf8';
    use List::MoreUtils 'any';
    
    my @stopwords = qw( Käse Wurst );
    
    while ( <> ) { # read octets
        chomp;
        my @tokens;
        for ( split /\s+/ ) {
            # Try UTF-8 first. If that fails, assume legacy Latin-1.
            my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
            $token = $_ if $@;
            push @tokens, $token unless any { $token eq $_ } @stopwords;
        }
        print "@tokens\n";
    }
    

    Note that the script doesn’t have to be encoded in UTF-8. It’s just that if you have funky character data in your script you have to make sure the encoding matches, so use utf8 if your encoding is UTF-8, and don’t if it isn’t.

    Update based on tchrist’s sound advice:

    use strict;
    use warnings;
    # source in Latin1
    use Encode 'decode';
    use List::MoreUtils 'any';
    
    my @stopwords = qw( Käse Wurst );
    
    while ( <> ) { # read octets
            chomp;
            my @tokens;
            for ( split /\s+/ ) {
                    # Try UTF-8 first. If that fails, assume 8-bit encoding.
                    my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
                    $token    = decode Windows1252 => $_, Encode::FB_CROAK if $@;
                    push @tokens, uc $token unless any { $token eq $_ } @stopwords;
            }
            print "@tokens\n";
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have one file in my project, (the readme file,) which I would like
I have a file that has one entry per line. Each line has the
I have a file with one line in it. I create a branch and
Hi I have a text file (one word per line) and I want to
I have one file one.php <?php //just a php function doen't have to do
I have one file example1.cpp with the main function. This file must have #include
I have one file (for example: test.txt), this file contains some lines and for
I have problem with session in cakephp.I have one file chat.php that is in
I have a PHP file with one simple echo function: echo 'アクセスは撥ねりません。'; but when
I am looking to have one of my Windows Forms applications be run programmatically—from

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.