Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 994483
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T06:34:58+00:00 2026-05-16T06:34:58+00:00

I am attempting to scrape a web page that has the following structures within

  • 0

I am “attempting” to scrape a web page that has the following structures within the page:

<p class="row">
    <span>stuff here</span>
    <a href="http://www.host.tld/file.html">Descriptive Link Text</a>
    <div>Link Description Here</div>
</p>

I am scraping the webpage using curl:

<?php
    $handle = curl_init();
    curl_setopt($handle, CURLOPT_URL, "http://www.host.tld/");
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($handle);
    curl_close($handle);
?>

I have done some research and found that I should not use a RegEx to parse the HTML that is returned from the curl, and that I should use PHP DOM. This is how I have done this:

$newDom = new domDocument;
$newDom->loadHTML($html);
$newDom->preserveWhiteSpace = false;
$sections = $newDom->getElementsByTagName('p');
$nodeNo = $sections->length;
for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo $printString . "<br>";
}

Now I am not pretending that I completely understand this but I get the gist, and I do get the sections I am wanting. The only issue is that what I get is only the text of the HTML page, as if I had copied it out of my browser window. What I want is the actual HTML because I want to extract the links and use them too, like so:

for($i=0; $i<$nodeNo; $i++){
    $printString = $sections->item($i)->nodeValue;
    echo "<a href=\"<extracted link>\">LINK</a> " . $printString . "<br>";
}

As you can see, I cannot get the link because I am only getting the text of the webpage and not the source, like I want. I know the “curl_exec” is pulling the HTML because I have tried just that, so I believe that the DOM is somehow stripping the HTML that I want.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T06:34:58+00:00Added an answer on May 16, 2026 at 6:34 am

    According to comments on the PHP manual on DOM, you should use the following inside your loop:

        $tmp_dom = new DOMDocument();
        $tmp_dom->appendChild($tmp_dom->importNode($sections->item($i), true));
        $innerHTML = trim($tmp_dom->saveHTML()); 
    

    This will set $innerHTML to be the HTML content of the node.

    But I think what you really want is to get the ‘a’ nodes under the ‘p’ node, so do this:

    $sections = $newDom->getElementsByTagName('p');
    $nodeNo = $sections->length;
    for($i=0; $i<$nodeNo; $i++) {
        $sec = $sections->item($i);
        $links = $sec->getElementsByTagName('a');
        $linkNo = $links->length;
        for ($j=0; $j<$linkNo; $j++) {
            $printString = $links->item($j)->nodeValue;
            echo $printString . "<br>";
        }
    }
    

    This will just print the body of each link.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Attempting to print out a list of values from 2 different variables that are
When attempting to compile my C# project, I get the following error: 'C:\Documents and
Sample: I've created a minimal set of files that highlight the issue here: http://uploads.omega.org.uk/Foo3.zip
I'm attempting to develop an application that scapes html of a site for relevant
When attempting to write/read cookies that have brackets in the name, it seems like
I'm attempting to force a download of an image that is in a directory
Attempting to set the tabIndex for date_select has proven to be difficult out of
Attempting to insert an escape character into a table results in a warning. For
Attempting to deploy a MOSS solution to a UAT server from dev server for
When attempting to call functions in math.h , I'm getting link errors like the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.