I’m writing a spider that needs a load_url function that performs the following for me:
- Retry the URL if there is a temporary error, without leaking exceptions.
- Not leak memory or file handles
- Use HTTP-KeepAlive for speed (optional)
URLGrabber looks great on the surface, but it has trouble. The first I hit a problem with too many files open, but I was able to workaround this by turning keep-alive off. Then, the function started raising a socket.error: [Errno 104] Connection reset by peer. That error should have been caught and possibly a URLGrabberError should be raised in it’s place.
I’m running python 2.6.4.
Does anyone know of a way to fix these issues with URLGrabber or know of another way to accomplish what I need with a different library?
If you are writing a web-crawler / screen-scraper, you may be interested to look at a dedicated framework such as scrapy.
You can write really quite sophisticated web crawlers with very little code: it takes care of all the gory details of scheduling the requests and calling you back with the results for you to process in whatever way you need (it’s based on twisted but it hides all the implementation details away from you nicely).