When doing a scrape of a site, which would be preferable: using curl, or using Python’s requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an “HTTP 429 too many requests” error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?
Since you want to parallelize the requests, you should use
requestswithgrequests(if you’re using gevent, orerequestsif you’re using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.