I’m trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I’m using the requests library to grab the content and the content is HTML5 (tried with XHTML – same result).
When I simply output the HTML source, it looks alright:
response = requests.get(url)
return response.text
returns
<html xmlns:foo="http://www.example.com/ns/foo">
But when I’m actually parsing it with the html5lib, something odd happens:
tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)
returns
<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">
Note the xmlnsU0003Afoo thing.
Also, the html.nsmap dict does not contain the foo namespace, only html.
Does anyone have an idea about what’s going on and how I could fix this?
Later edit:
It seems that this is expected behavior:
If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names […] to a set of names that are allowed, by replacing any character that isn’t supported with the uppercase letter U and the six digits of the character’s Unicode code […]
– Coercing an HTML DOM into an infoset
A few observations:
HTML5 doesn’t seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: “…namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax.” I see you tried with XHTML as well, but you’re currently using HTML5, so there could be an issue there.
U+003Ais the Unicode for colon, so somehow thexmlnsis being noted but flubbed.There is an open issue with custom namespace elements for at least the PHP version.
I don’t understand the role of
html5libhere. Why not just uselxmldirectly:That seems to do what you want, without
html5liband without the goofyxmlnsU0003Afooerror. With the test HTML I used, I got the right output (follows), andtree.nsmapcontained an entry for'foo'.Alternatively, if you wish to use pure
html5lib, you can just use the includedsimpletree:While this doesn’t muck up the
xmlnsattribute,simpletreeunfortunately lacks the more powerfulElementTreefunctions likexpath().