Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 641097
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T21:01:33+00:00 2026-05-13T21:01:33+00:00

My code just scrapes a web page, then converts it to Unicode. html =

  • 0

My code just scrapes a web page, then converts it to Unicode.

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

But I get a UnicodeDecodeError:


Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

I assume that means the HTML contains some wrongly-formed attempt at Unicode somewhere. Can I just drop whatever code bytes are causing the problem instead of getting an error?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T21:01:33+00:00Added an answer on May 13, 2026 at 9:01 pm

    2018 Update:

    As of February 2018, using compressions like gzip has become quite popular (around 73% of all websites use it, including large sites like Google, YouTube, Yahoo, Wikipedia, Reddit, Stack Overflow and Stack Exchange Network sites).
    If you do a simple decode like in the original answer with a gzipped response, you’ll get an error like or similar to this:

    UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x8b in position 1: unexpected code byte

    In order to decode a gzpipped response you need to add the following modules (in Python 3):

    import gzip
    import io
    

    Note: In Python 2 you’d use StringIO instead of io

    Then you can parse the content out like this:

    response = urlopen("https://example.com/gzipped-ressource")
    buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
    gzipped_file = gzip.GzipFile(fileobj=buffer)
    decoded = gzipped_file.read()
    content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource
    

    This code reads the response, and places the bytes in a buffer. The gzip module then reads the buffer using the GZipFile function. After that, the gzipped file can be read into bytes again and decoded to normally readable text in the end.

    Original Answer from 2010:

    Can we get the actual value used for link?

    In addition, we usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

    html = urllib.urlopen(link).read()
    unicode_str = html.decode(<source encoding>)
    encoded_str = unicode_str.encode("utf8")
    

    As an example:

    html = '\xa0'
    encoded_str = html.encode("utf8")
    

    Fails with

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
    

    While:

    html = '\xa0'
    decoded_str = html.decode("windows-1252")
    encoded_str = decoded_str.encode("utf8")
    

    Succeeds without error. Do note that “windows-1252” is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

    Another problem I see there is that the .encode() string method returns the modified string and does not modify the source in place. So it’s kind of useless to have self.response.out.write(html) as html is not the encoded string from html.encode (if that is what you were originally aiming for).

    As Ignacio suggested, check the source webpage for the actual encoding of the returned string from read(). It’s either in one of the Meta tags or in the ContentType header in the response. Use that then as the parameter for .decode().

    Do note however that it should not be assumed that other developers are responsible enough to make sure the header and/or meta character set declarations match the actual content. (Which is a PITA, yeah, I should know, I was one of those before).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 420k
  • Answers 420k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer In Access you need parentheses if you have more than… May 15, 2026 at 10:41 am
  • Editorial Team
    Editorial Team added an answer Redefining std as an alias is okay, as long as… May 15, 2026 at 10:41 am
  • Editorial Team
    Editorial Team added an answer Your data access code won't compile without an existing database… May 15, 2026 at 10:40 am

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.