I have been trying to parse webpages by use of the HTML DOMObject in

Question

0

Asked: May 28, 20262026-05-28T05:39:36+00:00 2026-05-28T05:39:36+00:00

I have been trying to parse webpages by use of the HTML DOMObject in

0

I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.

However I have run into a bit of a problem. For testing purposes I’ve written a small HTML page containing the following incorrect HTML:

<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>

As you can see the title is outside the head tag which is the error I am trying to detect.

Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head> and </head> tags around the title.

<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>

I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.

I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.

I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.

Anyone know how to prevent the DOMDocument from fixing my broken HTML?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T05:39:36+00:00

UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED

$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);

Original answer below

You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED for that in libxml to prevent adding implied markup, but its not accessible from PHP.

On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION used.

Running this snippet:

<?php
$html = <<< HTML
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
echo $dom->saveHTML(), LIBXML_VERSION;

on my machine will give

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta name="description" content="randomdesciption"></head>
<title>sometitle</title>
</html>
20707

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been trying to parse webpages by use of the HTML DOMObject in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply