Let’s say I’ve createad a web scraping PHP page (getdata.php) that gets content of a specific website pages by cUrl, than saves some useful info to a txt file or database.
pseudo code of getdata.php,
min = get latest search id from database
max = 1.000.000 (yes one million different pages)
while (min < max) {
url = "http://www.website.com/page.php?id=".$min
content = getContentFromURL(url)
saveUsefulInfoToDb(content)
min++
set latest search id as min in database
}
It’s OK, the proccess is,
- Open
getdata.phpon browser - Wait
- Still wait, because there is about one million pages will be scraped.
- Wait
- And finally request time out.
- Fail
So the problem is I don’t know how can I make this proccess reasonable. opening page on a browser and waiting for it to finish scraping URLs, I think It’s a really bad practice.
How can I make getdata.php runnable in background like cron?
What is the best way to do it?
Thanks.
use in the top of the code
Then use a cron to fire it up each day or whenever it needs to. You definitely want this to be a background process and not a web page. Those two lines will allow it to run indefinitely as a web page or cmd line script. If you want to make it as a web page you can still use the cron to ‘fire’ it off with a line like
a bit of advice since I have done this many times: definitely make a logging function to print to a file so that you can see what it is doing as it runs or you will have no visibility and program into the php file a kill switch so you can tell it to stop running without having to use unix top or restart apache. It is probably a good idea to hard code in a kill time that it will stop if after a certain hour lest it run longer than a day and a second instance starts up and you have several running at once.