Suppose I have an HTML table with the following rows,
...
<tr>
<th title="Library of Quintessential Memes">LQM:</th>
<td>
<a href="docs/lqm.html"><b>Intro</b></a>
<a href="P/P79/">79</a>
<a href="P/P80/">80</a>
<a href="P/P81/">81</a>
<a href="P/P82/">82</a>
</td>
</tr>
<tr>
<th title="Library of Boring Books">LBB:</th>
<td>
<a href="docs/lbb.html"><b>Intro</b></a>
<a href="R/R80/">80</a>
<a href="R/R81/">81</a>
<a href="R/R82/">82</a>
<a href="R/R83/">83</a>
<a href="R/R84/">84</a>
</td>
</tr>
...
I would like to select all <a> elements in a <td> element whose associated <th>‘s text is in a small set of fixed titles (e.g. LQM, LBR, and RTT). How can I formulate this as an XPath query?
EDIT: I am using Scrapy, a Python scraping toolkit, so if it is easier to phrase this query as a set of smaller queries, I would be more than happy to use that. For example, if I could select all <tr> elements whose first <th> child matches a regex, then select all <a> descendants of the remaining <tr> elements, that would be splendid.
The following XPath will work:
This can theoretically get some false positives (if your codes contained commas).
A stricter way to say it would be:
I tested this by adding a
<table>tag around your input and applying the following XSL transform:It produces the following output:
Of course, if you are using XSL, then you might find this construction more readable: