I’ve set up a program in an attempt to do some scraping of a real estate site in order to get some statistical data about the market.
My program will probably call the web site about 150 times. And I want to do this once per day. I imagine the web is big enough, they will possibly get about 10,000 – 20,000 hits per day (estimate).
But if I sent these all together, would they think that they’re putting flooded with requests? would they notice that I’m web scraping and block my IP?
If so, is it important to set a timer in? At the moment, I’ve put a timer that waits 3 to 5 seconds before each call, I’m just calling if that’s necessary.
I agree with Niklas. However, if you need the data ‘faster’ I would go with a timeout of 60 (up to 120) seconds. That is good for most of the servers today with the traffic size that you describe.
Also, to be on the good side of things, please make sure you are following the robots.txt definition and see if there is some limit there (in terms of timeouts and routes).