Update: this error can be reproduced simply by running this from the command line:

Question

0

Asked: June 4, 20262026-06-04T17:15:49+00:00 2026-06-04T17:15:49+00:00

Update: this error can be reproduced simply by running this from the command line:

0

Update: this error can be reproduced simply by running this from the command line:

scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future

I’m using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte

I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page’s declared encoding, and use another (say, UTF-16) instead?

Here’s where the exception is being caught:

2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
        result = method(response=response, result=result, spider=spider)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T17:15:52+00:00

There has been some work on encoding in the latest dev scrapy (0.15). It could be worth trying the latest version.

Scrapy lets you access unicode via response.body_as_unicode. This handles encoding detection in a similar way to browsers and you should nearly always use this instead of the raw body. As of scrapy 0.15, it relies on w3lib.encoding.html_to_unicode, with a little customization.

The decoding happens lazily, when someone requests unicode. You can create a new response, specifying the encoding yourself from the one you receive in the spider, however, this shouldn’t be necessary.

It’s not clear from the traceback which bit of code is actually causing the error to happen. Was there any more detail? Another possibility could be that the body is getting truncated somehow.

If these pages are handled correctly by a browser and not by scrapy, then it would be appreciated if you could make a simple test case and report a bug.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Update: this error can be reproduced simply by running this from the command line:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply