I need help parsing out some text from a page with lxml. I tried

Question

0

Asked: May 16, 20262026-05-16T18:20:16+00:00 2026-05-16T18:20:16+00:00

I need help parsing out some text from a page with lxml. I tried

0

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn’t work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

Here is the page I am trying to parse, I need to get the text under the “Additional Info” section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty “td” tags). Any suggestions as to how to get at that text would be very much appreciated.

Thanks for the help.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T18:20:16+00:00

import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

For over 65 years, Carl Stirn’s Marine
has been setting new standards of
excellence and service for boating
enjoyment. Because we offer quality
merchandise, caring, conscientious,
sales and service, we have been able
to make our customers our good
friends.

Our 26,000 sq. ft. facility includes a
complete parts and accessories
department, full service department
(Merc. Premier dealer with 2 full time
Mercruiser Master Tech’s), and new,
used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski’s xpath which addresses the OP’s comment that the number of tags separating ‘Additional Info’ from the blurb can be unknown:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need help parsing out some text from a page with lxml. I tried

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply