I’m trying to parse some HTML content with html5lib using the lxml treebuilder. Note:

Question

0

Asked: June 10, 20262026-06-10T18:15:20+00:00 2026-06-10T18:15:20+00:00

I’m trying to parse some HTML content with html5lib using the lxml treebuilder. Note:

0

I’m trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I’m using the requests library to grab the content and the content is HTML5 (tried with XHTML – same result).

When I simply output the HTML source, it looks alright:

response = requests.get(url)
return response.text

returns

<html xmlns:foo="http://www.example.com/ns/foo">

But when I’m actually parsing it with the html5lib, something odd happens:

tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)

returns

<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">

Note the xmlnsU0003Afoo thing.

Also, the html.nsmap dict does not contain the foo namespace, only html.

Does anyone have an idea about what’s going on and how I could fix this?

Later edit:

It seems that this is expected behavior:

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names […] to a set of names that are allowed, by replacing any character that isn’t supported with the uppercase letter U and the six digits of the character’s Unicode code […]
– Coercing an HTML DOM into an infoset

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T18:15:22+00:00

A few observations:

HTML5 doesn’t seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: “…namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax.” I see you tried with XHTML as well, but you’re currently using HTML5, so there could be an issue there. U+003A is the Unicode for colon, so somehow the xmlns is being noted but flubbed.
There is an open issue with custom namespace elements for at least the PHP version.
I don’t understand the role of html5lib here. Why not just use lxml directly:

from lxml import etree

tree = etree.fromstring(resp_text)
print etree.tostring(tree, pretty_print=True)

That seems to do what you want, without html5lib and without the goofy xmlnsU0003Afoo error. With the test HTML I used, I got the right output (follows), and tree.nsmap contained an entry for 'foo'.

<html xmlns:foo="http://www.example.com/ns/foo">
    <head>
        <title>yo</title>
    </head>
    <body>
        <p>test</p>
    </body>
</html>

Alternatively, if you wish to use pure html5lib, you can just use the included simpletree:

tree = html5lib.parse(resp_text, namespaceHTMLElements=True)
print tree.toxml()

While this doesn’t muck up the xmlns attribute, simpletree unfortunately lacks the more powerful ElementTree functions like xpath().

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m trying to parse some HTML content with html5lib using the lxml treebuilder. Note:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply