Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed

Question

0

Asked: June 4, 20262026-06-04T03:16:06+00:00 2026-06-04T03:16:06+00:00

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed

0

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.

Example:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
    <ol>
    <li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
    <li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li>
    </ol>

    </Div>

desired text:

Central Intelligence Agency
Culinary Institute of America

Except that the anchor tags surrounding prevent a simple retrieval.

To return each li tag separately, we use the straightforward:

//div[contains(@id,"mw-content-text")]/ol/li

but that also includes surrounding anchor tags, etc. And

//div[contains(@id,"mw-content-text")]/ol/li/text()

returns only the text elements that are direct children of li, i.e. ‘Central’,’.’…

It seemed logical then to look for text elements of self and descendants

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

but that returns nothing at all!

Any suggestions? I’m using Python, so I’m open to using other modules for post-processing.

(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T03:16:07+00:00

Editorial Team

2026-06-04T03:16:07+00:00Added an answer on June 4, 2026 at 3:16 am

You were almost there. There is a small problem in:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

The corrected expression is:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]

However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:

string(//div[contains(@id,"mw-content-text")]/ol/li)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply