I am trying to create a simple web crawler using PHP that is capable

Question

0

Asked: May 27, 20262026-05-27T22:14:49+00:00 2026-05-27T22:14:49+00:00

I am trying to create a simple web crawler using PHP that is capable

0

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.

I am posting the code below and will try to explain the problems.

private function initiateChildCrawler($parent_Url_Html) {

    global $CFG;
    static $foundLink;
    static $parentID;
    static $urlToCrawl_InstanceOfChildren;

    $forEachCount = 0;
    foreach($parent_Url_Html->getHTML()->find('a') as $foundLink) 
    {
        $forEachCount++;
        if($forEachCount<500) {
        $foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);

        if($this->validateEduDomain($foundLink->href)) 
        {
            //Implement else condition later on
            $parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
            if($parentID != FALSE) 
            {
                if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
                {
                    $urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
                    if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
                    {
                        $this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
                        $this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);

                        /*if($recursiveCount<1)
                        {
                            $this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
                        }*/
                    }
                }
            }
        }
        }
    }   
}

Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: http://www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.

for eg:
1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Moves to the next parent in parentCrawler.
2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).

Other functions are self explanatory.

Now the problem:
After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.

But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.

Please help me in this regard.

Message from CLI:

Problem signature:
Problem Event Name: APPCRASH
Application Name: php-cgi.exe
Application Version: 5.3.8.0
Application Timestamp: 4e537939
Fault Module Name: php5ts.dll
Fault Module Version: 5.3.8.0
Fault Module Timestamp: 4e537a04
Exception Code: c0000005
Exception Offset: 0000c793
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1033
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T22:14:49+00:00

Flat Loop Example:

You initiate the loop with a stack that contains all URLs you’d like to process first.
Inside the loop:
1. You shift the first URL (you obtain it and it’s removed) from the stack.
2. If you find new URLs, you add them at the end of the stack (push).

This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

$URLStack = (array) $parent_Url_Html->getHTML()->find('a');
$URLProcessedCount = 0;
while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = array_shift($URLStack);
    if (!$url) break; # exit if the stack is empty

    # process URL

    # for each new URL:
    $URLStack[] = $newURL;
}

You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you’ve already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
{
    $url = $URLStack[$URLProcessedCount++];

Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it’s a much more versatile tool.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to create a simple web crawler using PHP that is capable

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply