I’ve made a basic web crawler to scrape info from a website and I estimated that it should take around 6 hours (multiplying the number of pages by how long it takes to grab the info) but after around 30-40 minutes of looping through my function, it stops working and I only have a fraction of the info I wanted. When it is working, the page looks like it’s loading and it outputs where it’s up to on the screen, but when it stops, the page stops loading and the input stops showing.
Is there anyway that I can keep the page loading so I don’t have to start it again every 30 minutes?
EDIT: Here’s my code
function scrape_ingredients($recipe_url, $recipe_title, $recipe_number, $this_count) {
$page = file_get_contents($recipe_url);
$edited = str_replace("<h2 class=\"ingredients\">", "<h2 class=\"ingredients\"><h2>", $page);
$split = explode("<h2 class=\"ingredients\">", $edited);
preg_match("/<div[^>]*class=\"module-content\">(.*?)<\\/div>/si", $split[1], $ingredients);
$ingred = str_replace("<ul>", "", $ingredients[1]);
$ingred = str_replace("</ul>", "", $ingred);
$ingred = str_replace("<li>", "", $ingred);
$ingred = str_replace("</li>", ", ", $ingred);
echo $ingred;
mysql_query("INSERT INTO food_tags (title, link, ingredients) VALUES ('$recipe_title', '$recipe_url', '$ingred')");
echo "<br><br>Recipes indexed: $recipe_number<hr><br><br>";
}
$get_urls = mysql_query("SELECT * FROM food_recipes WHERE id>3091");
while($row = mysql_fetch_array($get_urls)) {
$count++;
$thiscount++;
scrape_ingredients($row['link'], $row['title'], $count, $thiscount);
sleep(1);
}
What’s your php.ini’s set_time_limit option value? it must be set to 0 in order for script to be able to work infinitely