I have an external HTML source that I want to scrape and either transform

Question

0

Asked: May 13, 20262026-05-13T13:29:12+00:00 2026-05-13T13:29:12+00:00

I have an external HTML source that I want to scrape and either transform

0

I have an external HTML source that I want to scrape and either transform into a local XML file or add to a MySQL DB.

The external source is mostly normalized and (somewhat) semantic, so that all I need to do is use XPATH to get all td content or all li content, etc. The problem is that occasionally these items use <strong> or <b> or <i> tags to style the elements I need.

This is technically semantic, since the point is to add emphasis to the specific text, and the developer might want to use CSS that isn’t the browser default.

The problem is that the actual content I am trying to grab is considered a child of this inline element, so that PHP extensions like simplexml or DOMDocument and DOMNode treat them as such. For example:

<table>
<tr><td>Thing 1</td><td>Thing 2</td></tr>
<tr><td>Thing 3</td><td>Thing 4</td></tr>
<tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
</table>

Will result in:

 [table] =>
    [tr] =>
        [td] => Thing 1
        [td] => Thing 2
    [tr] =>
        [td] => Thing 3
        [td] => Thing 4
    [tr] =>
        [td] => 
            [strong] => Thing 5
        [td] => 
            [strong] => Thing 6

Obviously the above is not quite what simplexml returns, but the above reflects the general problem.

So is there a way, using either a parameter already built into DOMDocument or using an extra sophisticated XPath query to get the contents of the td element with any children (if there are any) stripped of their descendant status and all content treated as the text of the queried element?

Right now, the only solutions I have are to either:

a) have a foreach loop that checks each result, like:

$result_text = ($result -> strong) ? $result - strong : $result;

b) using regex to strip any <strong> tags out of the HTML string before importing it into any pre-built classes like simplexml or DOMDocument.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T13:29:12+00:00

Can’t you just use strip_tags() to remove the extra markup?

$table = simplexml_load_string(
    '<table>
        <tr><td>Thing 1</td><td>Thing 2</td></tr>
        <tr><td>Thing 3</td><td>Thing 4</td></tr>
        <tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
    </table>'
);

foreach ($table->xpath('//td') as $td)
{
    $content = strip_tags($td->asXML());
    echo $content, "\n";
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an external HTML source that I want to scrape and either transform

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply