Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users’ search experiences. Now, I have several questions:
Should I use file_get_contents() or curl to get the contents of the required web page?
$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($res55);
Further, every time I crawl a web page, I fetch a lot of links to visit next. This may take a long time (days if you crawl big websites like Ebay). In that case, my PHP code will time-out. What should be the automated way to do this? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution?
Are you doing this in the code that’s driving your web page? That is, when someone makes a request, are you crawling right then and there to build the response? If so, then yes there is definitely a better way.
If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. At that point you should parse out the data you’re looking for and store it in a database. Your site would then just need to point to that database.
This is an improvement for two reasons:
Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. The response should take as little time as possible. So you want to avoid in-line work wherever possible. By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster).
Code Design: Separation of concerns. This setup modularizes your code a little bit more. You have one module which is in charge of fetching the data and another which is in charge of displaying the data. Neither of them should ever need to know or care about how the other accomplishes its tasks. So if you ever need to replace one (such as finding a better scraping method) you won’t also need to change the other.