There’s probably a better way to do this than what I’d doing, because
I’m stuck in a metaphorical pothole.
I want to get some of the nodes beneath a particular node. I came up
with this XPath expression:
>>> content_tags = 'h1 h2 h3 h4 h5 h6 p ol ul dl table'.split()
>>> content_xpath = './/*[%s]' % ' or '.join('self::%s' % i for i in content_tags)
>>> content_xpath
'.//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or
self::h6 or self::p or self::ol or self::ul or self::dl or
self::table]'
Any of the listed content_tags can be the top of the hierarchy I’d
wanting, and I want to ignore other elements that may be at the same
or higher levels. Unfortunately, sometimes there’s a <p> inside a
<ul> or a <table>, or a <table> inside a <ol>, etc., and I get the
inner element as a separate result along with the outer. Is there a good way to
perform a “cut” to ignore nodes that may be nested inside one that
I’ve found? Or is there some better way of doing this that I’m
somehow missing?
Here’s an example of what I’m trying to parse.
<div class="interesting">
<img src="ignore-this.jpg"/>
<h1>I want this.</h1>
<p>I want this, too.</p>
<div class="sidebar">
<ul>
<li><p>I only want one copy of this, inside the UL.</p></li>
<li><p>Ditto.</p></li>
</ul>
</div>
</div>
Thanks!
BTW, I found a few posts on a w3.org mailing list that advocated a
“dont-include- any-descendant-or-self” filter, which I think would do
exactly what I want, but it doesn’t seem to have made it into the
final spec. 🙁
Searching as with
//pis explicitly recursive — if that’s not what you want, don’t do that! 🙂If you only want a
pthat’s directly under an interestingdiv, but thatdivcan be anywhere in your hierarchy, this would be expressed as such:…if you only want a
pthat’s directly under the location in your tree the search is relative to, that’s even simpler: