Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8141303
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T12:27:38+00:00 2026-06-06T12:27:38+00:00

I have an encoding issue in perl when trying to pull back global addresses

  • 0

I have an encoding issue in perl when trying to pull back global addresses from webpages using both LWP::Useragent and Encode for character encoding. I’ve tried googling solutions but nothing seems to work. I’m using Strawberry Perl 5.12.3.

As an example take the address page of the US embassy in Czech Republic (http://prague.usembassy.gov/contact.html). All I want is to pull back the address:

Address: Tržiště 15 118 01 Praha 1 – Malá Strana Czech Republic

Which firefox displays correctly using character encoding UTF-8 which is the same as the webpage header char-set. But when I try to use perl to pull this back and write it to a file the encoding looks messed up despite using decoded_content in Useragent or Encode::decode.

I’ve tried using regex on the data to check the error isn’t when the data is printed (ie internally correct in perl) but the error seems to be in how perl handles the encoding.

Here’s my code:

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

And here’s the output to terminal:

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
Tr├à┬╛išt├ä┬¢
15
118 01 Praha 1 – Malá Strana
Czech Republic
DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 –
Malá Strana
Czech Republic ENCODE::DECODED
CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 –
Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15
118 01 Praha 1 – Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR
AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

And to the file (note this is slightly different to terminal but not correct). OK WOW- this is showing as correct in stack overflow but not in Bluefish, LibreOffice, Excel, Word or anything else on my computer. So the data is there just incorrectly encoded. I really don’t get what’s going on.

CONTENT TYPE: UTF-8 UNDECODED CONTENT::
TržištÄ
15
118 01 Praha 1 – Malá Strana
Czech Republic
DECODED CONTENT::
Tržiště 15
118 01 Praha 1 –
Malá Strana
Czech Republic ENCODE::DECODED
CONTENT::
Tržiště 15
118 01 Praha 1 – Malá
Strana
Czech Republic DOUBLE-DECODED CONTENT::Tržiště 15
118 01 Praha 1 – Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY
ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING
ERROR AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR

Any pointers how this can be made really appreciated.

Thanks,
Ian/Montecristo

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T12:27:39+00:00Added an answer on June 6, 2026 at 12:27 pm

    The mistake is using regex to parse HTML. You lack decoding of HTML entities, at the least. You can do that manually, or leave it to a robust parser:

    use strictures;
    use Web::Query 'wq';
    use autodie qw(:all);
    
    open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
    print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Ok, I understand that using strings that have special characters is an encoding issue.
I have a large url that I am encoding using System.Web.HttpUtility.UrlEncode. When I encode
I have previously read Spolsky's article on character-encoding, as well as this from dive
Maybe this is an encoding issue? I can't imagine that you have to replace
I have been reading about the issue with trying to figure out the actual
I have a problem involving encoding/decoding. I read text from file and compare it
I have some issue with a Perl script. It modifies the content of a
I have an encoding issue - I have data stored in a MySQL table.
I have got this part from a Perl plugin. I don't understand what it
I think I have a URL encoding issue. I need to open a window

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.