Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7012369
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T22:14:49+00:00 2026-05-27T22:14:49+00:00

I am trying to create a simple web crawler using PHP that is capable

  • 0

I am trying to create a simple web crawler using PHP that is capable of crawling .edu domains, provided the seed urls of the parent.

I have used simple html dom for implementing the crawler while some of the core logic is implemented by me.

I am posting the code below and will try to explain the problems.

private function initiateChildCrawler($parent_Url_Html) {

    global $CFG;
    static $foundLink;
    static $parentID;
    static $urlToCrawl_InstanceOfChildren;

    $forEachCount = 0;
    foreach($parent_Url_Html->getHTML()->find('a') as $foundLink) 
    {
        $forEachCount++;
        if($forEachCount<500) {
        $foundLink->href = url_to_absolute($parent_Url_Html->getURL(), $foundLink->href);

        if($this->validateEduDomain($foundLink->href)) 
        {
            //Implement else condition later on
            $parentID = $this->loadSaveInstance->parentExists_In_URL_DB_CRAWL($this->returnParentDomain($foundLink->href));
            if($parentID != FALSE) 
            {
                if($this->loadSaveInstance->checkUrlDuplication_In_URL_DB_CRAWL($foundLink->href) == FALSE)
                {
                    $urlToCrawl_InstanceOfChildren = new urlToCrawl($foundLink->href);
                    if($urlToCrawl_InstanceOfChildren->getSimpleDomSource($CFG->finalContext)!= FALSE)
                    {
                        $this->loadSaveInstance->url_db_html($urlToCrawl_InstanceOfChildren->getURL(), $urlToCrawl_InstanceOfChildren->getHTML());
                        $this->loadSaveInstance->saveCrawled_To_URL_DB_CRAWL(NULL, $foundLink->href, "crawled", $parentID);

                        /*if($recursiveCount<1)
                        {
                            $this->initiateChildCrawler($urlToCrawl_InstanceOfChildren);
                        }*/
                    }
                }
            }
        }
        }
    }   
}

Now as you can see that initiateChildCrawler is being called by initiateParentCrawler function which passes the parent link to the child crawler. Example of parent link: http://www.berkeley.edu for which the crawler will find all the links on its main page and return all its html content. This happens until the seed urls are exhausted.

for eg:
1-harvard.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).
Moves to the next parent in parentCrawler.
2-berkeley.edu ->>>>> Will find all the links and return their their html content (by calling childCrawler).

Other functions are self explanatory.

Now the problem:
After the childCrawler completes the foreach loop for each link, the function is unable to exit properly. If I am running the script from CLI, the CLI crashes. While running the script in the browser causes the script to terminate.

But if I set the limit of crawling child Links to 10 or something less (by altering the $forEachCount variable), the crawler starts working fine.

Please help me in this regard.

Message from CLI:

Problem signature:
Problem Event Name: APPCRASH
Application Name: php-cgi.exe
Application Version: 5.3.8.0
Application Timestamp: 4e537939
Fault Module Name: php5ts.dll
Fault Module Version: 5.3.8.0
Fault Module Timestamp: 4e537a04
Exception Code: c0000005
Exception Offset: 0000c793
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1033
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T22:14:49+00:00Added an answer on May 27, 2026 at 10:14 pm

    Flat Loop Example:

    1. You initiate the loop with a stack that contains all URLs you’d like to process first.
    2. Inside the loop:
      1. You shift the first URL (you obtain it and it’s removed) from the stack.
      2. If you find new URLs, you add them at the end of the stack (push).

    This will run until all URLs from the stack are processed, so you add (as you have somehow already for the foreach) a counter to prevent this from running for too long:

    $URLStack = (array) $parent_Url_Html->getHTML()->find('a');
    $URLProcessedCount = 0;
    while ($URLProcessedCount++ < 500) # this can run endless, so this saves us from processing too many URLs
    {
        $url = array_shift($URLStack);
        if (!$url) break; # exit if the stack is empty
    
        # process URL
    
        # for each new URL:
        $URLStack[] = $newURL;
    }
    

    You can make it even more intelligent then by not adding URLs to the stack which already exist in it, however then you need to only insert absolute URLs to the stack. However I highly suggest that you do that because there is no need to process a page you’ve already obtained again (e.g. each page contains a link to the homepage probably). If you want to do this, just increment the $URLProcessedCount inside the loop so you keep previous entries as well:

    while ($URLProcessedCount < 500) # this can run endless, so this saves us from processing too many URLs
    {
        $url = $URLStack[$URLProcessedCount++];
    

    Additionally I suggest you use the PHP DOMDocument extension instead of simple dom as it’s a much more versatile tool.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to create a simple database table using the PHP MySQL query Create
I'm trying to create a simple Guestbook web service using ASP.NET WebServices. When trying
I'm trying to create a simple web project using Tomcat in Java. In the
Im trying to create a simple pan and zoom app using silverlight 4, but
I am trying to create a simple dialog in MFC using Visual C++. My
I am trying to create a simple page that enters data in to a
I'm trying to create a simple toggling sidebar using jquery, where it expands and
I am trying to create a simple mouseover effect using a combination of mouseover,
I'm trying to create a simple web Service in the 4.0 framework, but the
I am trying to create a simple iPhone application using HTML/JavaScript/CSS with the help

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.