Using Scrapy I’d like to parse a webpage containing a very unsemantic table. What

Question

0

Asked: June 16, 20262026-06-16T11:48:01+00:00 2026-06-16T11:48:01+00:00

Using Scrapy I’d like to parse a webpage containing a very unsemantic table. What

0

Using Scrapy I’d like to parse a webpage containing a very unsemantic table. What I’m looking for is a “print every following-sibling until you meet the following element”-XPath-query.

<table>
    <tr>
        <th>Title</th>
        <th>Name</th>
        <th>Comment</th>
        <th>Note</th>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4"> <b>HEADER1</b></td>
    </tr>
    <tr>
        <td>Title1.1</td>
        <td>-</td>
        <td>Info1.1</td>
        <td></td>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4"> <b>HEADER2</b></td>
    </tr>
    <tr>
        <td>Title2.1</td>
        <td>Name2.1</td>
        <td></td>
        <td></td>
    </tr>
    <tr>
        <td>Title2.2</td>
        <td>Name2.2</td>
        <td>Info2.2</td>
        <td></td>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4"> <b>HEADER3</b></td>
    </tr>
    <tr>
        <td>Title3.1</td>
        <td>Name3.1</td>
        <td></td>
        <td></td>
    </tr>
</table>

I’d like to group every Title, Name, Comment and Note under each header. I have tried with various XPaths (with variations of following-sibling, preceding-sibling and count) but I either get nothing, everything or every tr which is not a header.

I’m currently getting the headers with //tr[@style] or //tr[td[@colspan="4"]].

The following is the parse-function in my Scrapy-spider (which prints the header and all of the tr‘s which is not a header):

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//*[@id="content-text"]//tr[td[@colspan="4"]]')
    for site in sites:
        print site.select('./td/b/text()').extract()
        print site.select('./following-sibling::tr[not(td[@colspan])]')

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T11:48:02+00:00

This XPath expression:

/*/tr[@style or td[@colspan='4']][1]/following-sibling::tr
       [count(. | /*/tr[@style or td[@colspan='4']][2]/preceding-sibling::tr)
       =
        count(/*/tr[@style or td[@colspan='4']][2]/preceding-sibling::tr)
       ]

selects all tr elements that are between the 1st and 2nd headers:

<tr>
   <td>Title1.1</td>
   <td>-</td>
   <td>Info1.1</td>
   <td/>
</tr>

To select all tr elements that are between the Kth and (K+1)th headers, simply replace in the above expression 1 with K (the number) and 2 with K+1 (the number).

XSLT – based verification:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
     <xsl:copy-of select=
     "/*/tr[@style or td[@colspan='4']][1]/following-sibling::tr
             [count(. | /*/tr[@style or td[@colspan='4']][2]/preceding-sibling::tr)
             =
              count(/*/tr[@style or td[@colspan='4']][2]/preceding-sibling::tr)
             ]
     "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<table>
    <tr>
        <th>Title</th>
        <th>Name</th>
        <th>Comment</th>
        <th>Note</th>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4">
            <b>HEADER1</b>
        </td>
    </tr>
    <tr>
        <td>Title1.1</td>
        <td>-</td>
        <td>Info1.1</td>
        <td></td>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4">
            <b>HEADER2</b>
        </td>
    </tr>
    <tr>
        <td>Title2.1</td>
        <td>Name2.1</td>
        <td></td>
        <td></td>
    </tr>
    <tr>
        <td>Title2.2</td>
        <td>Name2.2</td>
        <td>Info2.2</td>
        <td></td>
    </tr>
    <tr style="background-color:#CCDDEF;">
        <td colspan="4">
            <b>HEADER3</b>
        </td>
    </tr>
    <tr>
        <td>Title3.1</td>
        <td>Name3.1</td>
        <td></td>
        <td></td>
    </tr>
</table>

the Xpath expression is evaluated and the selected nodes are copied to the output:

<tr>
   <td>Title1.1</td>
   <td>-</td>
   <td>Info1.1</td>
   <td/>
</tr>

Explanation:

This is a simple application of the Kayessian (after Dr. Michael Kay) formula for node-set intersection:

$ns1[count(.|$ns2) = count($ns2)]

In this particulat case we substitute $ns1 with:

/*/tr[@style or td[@colspan='4']][1]/following-sibling::tr

and we substitute $ns2 with:

/*/tr[@style or td[@colspan='4']][2]/preceding-sibling::tr

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Using Scrapy I’d like to parse a webpage containing a very unsemantic table. What

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply