Probably because it's the 'generic way' to cover all cases.…

Question

0

Asked: May 10, 20262026-05-10T21:23:00+00:00 2026-05-10T21:23:00+00:00

I am testing against the following test document: <?xml version=1.0 encoding=UTF-8?> <!DOCTYPE html PUBLIC

0

I am testing against the following test document:

<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN'                        'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd'> <html xmlns='http://www.w3.org/1999/xhtml'>    <head>         <title>hi there</title>     </head>     <body>         <img class='foo' src='bar.png'/>     </body> </html>

If I parse the document using lxml.html, I can get the IMG with an xpath just fine:

>>> root = lxml.html.fromstring(doc) >>> root.xpath('//img') [<Element img at 1879e30>]

However, if I parse the document as XML and try to get the IMG tag, I get an empty result:

>>> tree = etree.parse(StringIO(doc)) >>> tree.getroot().xpath('//img') []

I can navigate to the element directly:

>>> tree.getroot().getchildren()[1].getchildren()[0] <Element {http://www.w3.org/1999/xhtml}img at f56810>

But of course that doesn’t help me process arbitrary documents. I would also expect to be able to query etree to get an xpath expression that will directly identify this element, which, technically I can do:

>>> tree.getpath(tree.getroot().getchildren()[1].getchildren()[0]) '/*/*[2]/*' >>> tree.getroot().xpath('/*/*[2]/*') [<Element {http://www.w3.org/1999/xhtml}img at fa1750>]

But that xpath is, again, obviously not useful for parsing arbitrary documents.

Obviously I am missing some key issue here, but I don’t know what it is. My best guess is that it has something to do with namespaces but the only namespace defined is the default and I don’t know what else I might need to consider in regards to namespaces.

So, what am I missing?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T21:23:01+00:00

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

Try this:

>>> tree.getroot().xpath( ...     '//xhtml:img',  ...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'} ...     ) [<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]

How to approach applying for a job at a company ...

How to handle personal stress caused by utterly incompetent and ...

What is a programmer’s life like?

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions