I’m writing a spider that needs a load_url function that performs the following for

Question

0

Editorial Team

Asked: May 13, 20262026-05-13T11:11:20+00:00 2026-05-13T11:11:20+00:00

I’m writing a spider that needs a load_url function that performs the following for

0

I’m writing a spider that needs a load_url function that performs the following for me:

Retry the URL if there is a temporary error, without leaking exceptions.
Not leak memory or file handles
Use HTTP-KeepAlive for speed (optional)

URLGrabber looks great on the surface, but it has trouble. The first I hit a problem with too many files open, but I was able to workaround this by turning keep-alive off. Then, the function started raising a socket.error: [Errno 104] Connection reset by peer. That error should have been caught and possibly a URLGrabberError should be raised in it’s place.

I’m running python 2.6.4.

Does anyone know of a way to fix these issues with URLGrabber or know of another way to accomplish what I need with a different library?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T11:11:20+00:00

If you are writing a web-crawler / screen-scraper, you may be interested to look at a dedicated framework such as scrapy.

You can write really quite sophisticated web crawlers with very little code: it takes care of all the gory details of scheduling the requests and calling you back with the results for you to process in whatever way you need (it’s based on twisted but it hides all the implementation details away from you nicely).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m writing a spider that needs a load_url function that performs the following for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply