I’m trying to index many hundrets of web-pages.
In Short
- Calling a PHP script using a CRON-job
- Getting some (only around 15) of the least recently updated URLs
- Querying theses URLs using CURL
The Problem
In development everything went fine. But when I started to index much more then some testpages, CURL refused to work after some runs. It does not get any data from the remote server.
Error messages
These errors CURL has printed out (of course not at once)
- couldn’t connect to host
- Operation timed out after 60000 milliseconds with 0 bytes received
I’m working on a V-Server and tried to connect to the remote server using Firefox or wget. Also nothing. But when connecting to that remote server from my local machine everything works fine.
Waiting some hours, it again works for some runs.
For me it seems like a problem on the remote server or a DDOS-protection or something like that, what do you guys think?
You should be using proxies when you send out too many requests as your IP can be blocked by the site by their DDOS protection or similar setups.
Here are somethings to note : (What I used for scraping datas of websites)
1.Use Proxies.
2.Use Random User Agents
3.Random Referers
4.Random Delay in crons.
5.Random Delay between requets.
What I would do is make the script run for ever and add sleep in between.
Just trigger it with visiting the url for a sec and it will run forever.