Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9021833
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T05:21:03+00:00 2026-06-16T05:21:03+00:00

I’m using cURL to pull a webpage from a server. I pass it to

  • 0

I’m using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.

The webpage contains about three thousand (yikes) table tags, and I’m scraping data from them. There are two kinds of tables, where one or more type B follow a type A.

I’ve profiled my script using microtome(true) calls. I’ve placed calls before and after each stage of my script and subtracted the times from each other. So, if you’ll follow me through my code, I’ll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:

First, I include two files. One handles some parsing, and the other defines two “data structure” classes.

// Imports
include('./course.php');
include('./utils.php');

Includes are inconsequential as far as I know, and so let’s proceed to the cURL import.

//  Execute cURL
$response = curl_exec($curl_handle);

I’ve configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.

// Run about 25 str_replace calls here, to clean up
// then run tidy.



$html = $response; 

//  
//      Prepare some config for tidy
//  
       $config = array(
                  'indent'         => true,
                  'output-xhtml'   => true,
                   'wrap'           => 200);

    //  
    // Tidy up the HTML
    //  

    $tidy = new tidy;
    $tidy->parseString($html, $config, 'utf8');
    $tidy->cleanRepair();

    $html = $tidy;

Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I’m fine with that. However, the next part of the code really barfs. Here’s where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)

//  Get all of the tables in the page

$tables = $dom->getElementsByTagName('table');

//  Create a buffer for the courses

$courses = array();

//  Iterate

$numberOfTables = $tables->length;

for ($i=1; $i <$numberOfTables ; $i++) { 

    $sectionTable = $tables->item($i);
    $courseTable = $tables->item($i-1);

    //  We've found a course table, parse it.

    if (elementIsACourseSectionTable($sectionTable)) {

        $course = courseFromTable($courseTable);
        $course = addSectionsToCourseUsingTable($course, $sectionTable);            

        $courses[] = $course;
    }
}   

For reference, here’s the utility functions that I call:

//  
//  Tell us if a given element is
//  a course section table.
//

function elementIsACourseSectionTable(DOMElement $element){

        $tableHasClass = $element->hasAttribute('class');
        $tableIsCourseTable = $element->getAttribute("class") == "coursetable"; 

        return $tableHasClass && $tableIsCourseTable;
}

//
//  Takes a table and parses it into an 
//  instance of the Course class.
//

function courseFromTable(DOMElement $table){

    $secondRow = $table->getElementsByTagName('tr')->item(1);   
    $cells = $secondRow->getElementsByTagName('td');

    $course = new Course;

    $course->startDate = valueForElementInList(0, $cells);
    $course->endDate = valueForElementInList(1, $cells);        
    $course->name = valueForElementInList(2, $cells);
    $course->description = valueForElementInList(3, $cells);
    $course->credits = valueForElementInList(4, $cells);
    $course->hours = valueForElementInList(5, $cells);
    $course->division = valueForElementInList(6, $cells);
    $course->subject = valueForElementInList(7, $cells);

    return $course;

}


//
//  Takes a table and parses it into an 
//  instance of the Section class.
//

function sectionFromRow(DOMElement $row){

    $cells = $row->getElementsByTagName('td');

    //
    //  Skip any row with a single cell
    //

    if ($cells->length == 1) {
        # code...
        return NULL;
    }

    //
    //  Skip header rows
    //

    if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
        return NULL;
    }


    $section = new Section;

    $section->section = valueForElementInList(0, $cells);
    $section->code = valueForElementInList(1, $cells);
    $section->openSeats = valueForElementInList(2, $cells);     
    $section->dayAndTime = valueForElementInList(3, $cells);        
    $section->instructor = valueForElementInList(4, $cells);        
    $section->buildingAndRoom = valueForElementInList(5, $cells);
    $section->isOnline = valueForElementInList(6, $cells);  

    return $section;

}

//
//  Take a table containing course sections
//  and parse it put the results into a
//  give course object.
//

function addSectionsToCourseUsingTable(Course $course, DOMElement $table){

    $rows = $table->getElementsByTagName('tr');
    $numRows = $rows->length;

    for ($i=0; $i < $numRows; $i++) { 

        $section = sectionFromRow($rows->item($i));

        //  Make sure we have an array to put sections into 

        if (is_null($course->sections)) {
            $course->sections = array();
        }

        //  Skip "meta" rows, since they're not really sections

        if (is_null($section)) {
            continue;
        }

        $course->addSection($section);
    }

    return $course;
}

//
//  Returns the text from a cell
//  with a 
//

function valueForElementInList($index, $list){
    $value =  $list->item($index)->nodeValue;
    $value = trim($value);
    return $value;
}

This code takes 63 seconds. That’s over a minute for a PHP script to pull data from a webpage. Sheesh!

I’ve been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I’m not entirely sure how. Any suggestions on improving this code are greatly appreciated.

What can I do to improve my code execution time?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T05:21:04+00:00Added an answer on June 16, 2026 at 5:21 am

    It turns out that my loop is terribly inefficient.

    Using a foreach cut time in half to about 31 seconds. But that wasn’t fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here’s what we found:

    Using DOMNodeList’s item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.

    After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:

        $table = $tables->item(0);
    
    while ($table != NULL) {
    
        $table = $tables->item(0);
    
        if ($table === NULL) {
            break;
        }
    
        //
        //  We've found a section table, parse it.
        //
    
        if (elementIsACourseSectionTable($table)) {
    
            $course = addSectionsToCourseUsingTable($course, $table);           
        }
    
        //
        //  Skip the last table if it's not a course section
        //
    
        else if(elementIsCourseHeaderTable($table)){
            $course = courseFromTable($table);
            $courses[] = $course;
        }
    
        //
        //  Remove the first item from the list
        //
    
        $first = $tables->item(0);
        $first->parentNode->removeChild($first);
    
        //
        //  Get the next table to parse
        //
    
        $table = $tables->item(0);
    }
    

    Note that I’ve done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I am using jsonparser to parse data and images obtained from json response. When
I am currently running into a problem where an element is coming back from
I'm new to using the Perl treebuilder module for HTML parsing and can't figure
That's pretty much it. I'm using Nokogiri to scrape a web page what has
For some reason, after submitting a string like this Jack’s Spindle from a text
this is what i have right now Drawing an RSS feed into the php,
I am using the SimpleRSS gem to parse a WordPress RSS feed. The only
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I have a French site that I want to parse, but am running into

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.