Using Scrapy I’d like to parse a webpage containing a very unsemantic table. What I’m looking for is a “print every following-sibling until you meet the following element”-XPath-query.
<table>
<tr>
<th>Title</th>
<th>Name</th>
<th>Comment</th>
<th>Note</th>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER1</b></td>
</tr>
<tr>
<td>Title1.1</td>
<td>-</td>
<td>Info1.1</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER2</b></td>
</tr>
<tr>
<td>Title2.1</td>
<td>Name2.1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Title2.2</td>
<td>Name2.2</td>
<td>Info2.2</td>
<td></td>
</tr>
<tr style="background-color:#CCDDEF;">
<td colspan="4"> <b>HEADER3</b></td>
</tr>
<tr>
<td>Title3.1</td>
<td>Name3.1</td>
<td></td>
<td></td>
</tr>
</table>
I’d like to group every Title, Name, Comment and Note under each header. I have tried with various XPaths (with variations of following-sibling, preceding-sibling and count) but I either get nothing, everything or every tr which is not a header.
I’m currently getting the headers with //tr[@style] or //tr[td[@colspan="4"]].
The following is the parse-function in my Scrapy-spider (which prints the header and all of the tr‘s which is not a header):
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*[@id="content-text"]//tr[td[@colspan="4"]]')
for site in sites:
print site.select('./td/b/text()').extract()
print site.select('./following-sibling::tr[not(td[@colspan])]')
This XPath expression:
selects all
trelements that are between the 1st and 2nd headers:To select all
trelements that are between the Kth and (K+1)th headers, simply replace in the above expression1withK(the number) and2withK+1(the number).XSLT – based verification:
When this transformation is applied on the provided XML document:
the Xpath expression is evaluated and the selected nodes are copied to the output:
Explanation:
This is a simple application of the Kayessian (after Dr. Michael Kay) formula for node-set intersection:
In this particulat case we substitute
$ns1with:and we substitute
$ns2with: