I’m interested in using pugixml to parse HTML documents, but HTML has some optional

Question

0

Asked: June 2, 20262026-06-02T07:03:13+00:00 2026-06-02T07:03:13+00:00

I’m interested in using pugixml to parse HTML documents, but HTML has some optional

0

I’m interested in using pugixml to parse HTML documents, but HTML has some optional closing tags. Here is an example: <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

Pugixml stops reading the HTML as soon as it encounters a tag that’s not closed, but in HTML missing a closing tag does not necessarily mean that there is a start-end tag mismatch.

A simple test of parsing the HTML documentation of pugixml fails because the meta tag is the second line of the HTML document: http://pugixml.googlecode.com/svn/tags/latest/docs/quickstart.html

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<title>pugixml 1.0</title>
<link rel="stylesheet" href="pugixml.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
<link rel="home" href="quickstart.html" title="pugixml 1.0">
</head>
<!--- etc... -->

A lot of HTML documents in the wild would fail if I try to parse them with pugixml. Is there a way to avoid that? If there is no way to “fix” that, then is there another HTML parsing tool that’s as about as fast as pugixml?

Update

It would also be great if the HTML parser also supports XPATH.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T07:03:14+00:00

Editorial Team

2026-06-02T07:03:14+00:00Added an answer on June 2, 2026 at 7:03 am

I ended up taking pugixml, converting it into an HTML parser and I created a github project for it: https://github.com/rofldev/pugihtml

For now it’s not fully compliant with the HTML specifications, but it does a decent enough job at parsing HTML that I can use it. I’m working on making it compliant with the HTML specifications.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m interested in using pugixml to parse HTML documents, but HTML has some optional

Update

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply