I am trying to parse an HTML table using lxml. While rows = outhtml.xpath(‘//tr/td/span[@class=boldred]/text()’)

Question

0

Asked: June 4, 20262026-06-04T07:22:23+00:00 2026-06-04T07:22:23+00:00

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath(‘//tr/td/span[@class=boldred]/text()’)

0

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. For instance, if a <td> starts with ‘Street 1’, I then want to grab the <span> contents of that <td> tag. This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database.

lxml_parse.py

import lxml.html as lh

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows

test.htm

<tr>

    <td></td>

    <td colspan="2">

        Street 1:<span class="required"> *</span><br />

        <span class="boldred">2100 5th Ave</span>

    </td>

    <td colspan="2">

        Street 2:<br />

        <span class="boldred">Ste 202</span>

    </td>

</tr>

<tr>

    <td></td>

    <td>

        City:<span class="required"> *</span><br />

        <span class="boldred">NYC</span>

    </td>

    <td>

        State:<br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>

    </td>

    <td>

        Country:<span class="required"> *</span><br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>

    </td>

    <td>

        Zip:<br />

        <span class="boldred">10022</span>

    </td>

</tr>

Output :

$ python lxml_parse.py 
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

Parse against a bunch of variables is what I am having problems with :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T07:22:24+00:00

lxml_tempsofsol.py :

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)

for each in myresultset:
    print each

Output :

$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse an HTML table using lxml. While rows = outhtml.xpath(‘//tr/td/span[@class=boldred]/text()’)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply