I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.
I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.
I am posting the code below and will try to explain the problems.
private function initiateChildCrawler($parent_Url_Html) {
global $CFG;
static $foundLink;
static $parentID;
static $urlToCrawl_InstanceOfChildren;
$forEachCount = 0;
foreach($parent_Url_Html->getHTML()->find('a') as $foundLink)
{
$forEachCount++;
if($forEachCount<500) {
$foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);
if($this->validateEduDomain($foundLink->href))
{
//Implement else condition later on
$parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
if($parentID != FALSE)
{
if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
{
$urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
{
$this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
$this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);
/*if($recursiveCount<1)
{
$this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
}*/
}
}
}
}
}
}
}
Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: http://www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.
for eg:
1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Moves to the next parent in parentCrawler.
2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Other functions are self explanatory.
Now the problem:
After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.
But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.
Please help me in this regard.
Message from CLI:
Problem signature:
Problem Event Name: APPCRASH
Application Name: php-cgi.exe
Application Version: 5.3.8.0
Application Timestamp: 4e537939
Fault Module Name: php5ts.dll
Fault Module Version: 5.3.8.0
Fault Module Timestamp: 4e537a04
Exception Code: c0000005
Exception Offset: 0000c793
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1033
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789
Flat Loop Example:
This will run until all URLs from the stack are processed, so you add (as you have somehow already for the
foreach) a counter to prevent this from running for too long:You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you’ve already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the
$URLProcessedCountinside the loop so you keep previous entries as well:Additionally I suggest you use the PHP
DOMDocumentextension instead of simple dom as it’s a much more versatile tool.