Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7501925
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 29, 20262026-05-29T20:41:47+00:00 2026-05-29T20:41:47+00:00

I have to scrap a web site where i need to fetch multiple URLs

  • 0

I have to scrap a web site where i need to fetch multiple URLs and then process them one by one. The current process somewhat goes like this.

I fetch a base URL and get all secondary URLs from this page, then for each secondary url I fetch that URL, process found page, download some photos (which takes quite a long time) and store this data to database, then fetch next URL and repeat the process.

In this process, I think I am wasting some time in fetching secondary URL at the start of each iteration. So I am trying to fetch next URLs in parallel while processing first iteration.

The solution in my mind is, from main process call a PHP script, say downloader, which will download all the URL (with curl_multi or wget) and store them in some database.

My questions are

  • How to call such downloder asynchronously, I don’t want my main script to wait till downloder completes.
  • Any location to store downloaded data, such as shared memory. Of course, other than database.
  • There any chances that data gets corrupt while storing and retrieving, how to avoid this?
  • Also, please guide me know if anyone have a better plan.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-29T20:41:48+00:00Added an answer on May 29, 2026 at 8:41 pm

    When I hear someone uses curl_multi_exec it usually turns out they just load it with, say, 100 urls, then wait when all complete, and then process them all, and then start over with the next 100 urls… Blame me, I was doing so too, but then I found out that it is possible to remove/add handles to curl_multi while something is still in progress, And it really saves a lot of time, especially if you reuse already open connections. I wrote a small library to handle queue of requests with callbacks; I’m not posting full version here of course (“small” is still quite a bit of code), but here’s a simplified version of the main thing to give you the general idea:

    public function launch() {
        $channels = $freeChannels = array_fill(0, $this->maxConnections, NULL);
        $activeJobs = array();
        $running = 0;
        do {
            // pick jobs for free channels:
            while ( !(empty($freeChannels) || empty($this->jobQueue)) ) {
                // take free channel, (re)init curl handle and let
                // queued object set options
                $chId = key($freeChannels);
                if (empty($channels[$chId])) {
                    $channels[$chId] = curl_init();
                }
                $job = array_pop($this->jobQueue);
                $job->init($channels[$chId]);
                curl_multi_add_handle($this->master, $channels[$chId]);
                $activeJobs[$chId] = $job;
                unset($freeChannels[$chId]);
            }
            $pending = count($activeJobs);
    
            // launch them:
            if ($pending > 0) {
                while(($mrc = curl_multi_exec($this->master, $running)) == CURLM_CALL_MULTI_PERFORM);
                    // poke it while it wants
                curl_multi_select($this->master);
                    // wait for some activity, don't eat CPU
                while ($running < $pending && ($info = curl_multi_info_read($this->master))) {
                    // some connection(s) finished, locate that job and run response handler:
                    $pending--;
                    $chId = array_search($info['handle'], $channels);
                    $content = curl_multi_getcontent($channels[$chId]);
                    curl_multi_remove_handle($this->master, $channels[$chId]);
                    $freeChannels[$chId] = NULL;
                        // free up this channel
                    if ( !array_key_exists($chId, $activeJobs) ) {
                        // impossible, but...
                        continue;
                    }
                    $activeJobs[$chId]->onComplete($content);
                    unset($activeJobs[$chId]);
                }
            }
        } while ( ($running > 0 && $mrc == CURLM_OK) || !empty($this->jobQueue) );
    }
    

    In my version $jobs are actually of separate class, not instances of controllers or models. They just handle setting cURL options, parsing response and call a given callback onComplete.
    With this structure new requests will start as soon as something out of the pool finishes.

    Of course it doesn’t really save you if not just retrieving takes time but processing as well… And it isn’t a true parallel handling. But I still hope it helps. 🙂

    P.S. did a trick for me. 🙂 Once 8-hour job now completes in 3-4 mintues using a pool of 50 connections. Can’t describe that feeling. 🙂 I didn’t really expect it to work as planned, because with PHP it rarely works exactly as supposed… That was like “ok, hope it finishes in at least an hour… Wha… Wait… Already?! 8-O”

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a web site with flash forms that I need to scrape .
I have build a website in Django . I need to use the web
I have a corpus of lyrics of Indian songs and need to tag them
I have a program where I want to scrap some useful study material for
I have an xl spreadsheet that I'd like to scrap. I'll replace it with
I have an array, and I have this: $title = Envirometal Recycling : Scrap
I have PHP scrip that goes like this: if ($cost_frm < $cost){ echo <script
I have a list of URLs from which I want to scrape an attribute.
Have a look at this picture alt text http://www.abbeylegal.com/downloads/2009-04-01/web%20part%20top%20line.jpg Does anyone know what css
So must of us have a lot of content on our sites in one

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.