I am trying to screen scrape values from a website. # get the raw

Question

0

Asked: May 29, 20262026-05-29T15:12:30+00:00 2026-05-29T15:12:30+00:00

I am trying to screen scrape values from a website. # get the raw

0

I am trying to screen scrape values from a website.

# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )

# get all divs with class fruit 
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    print fruit.xpath('//li[@class="fruit"]/em')[0].text

However, the Python interpreter complains that 0 is an out of bounds iterator. That’s interesting because I am sure that the element exists. What is the proper way to access the inside <em> element with lxml?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T15:12:31+00:00

The following code works for me with my test file.

#test.py
import lxml.html

# get the raw HTML
fruitsWebsite = lxml.html.parse('test.html')

# get all divs with class fruit 
fruits = fruitsWebsite.xpath('//div[@class="fruit"]')

# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
    #Use a relative path so we don't find ALL of the li/em elements several times. Note the .//
    for item in fruit.xpath('.//li[@class="fruit"]/em'):
        print(item.text)


#Alternatively
for item in fruit.xpath('//div[@class="fruit"]//li[@class="fruit"]/em'):
    print(item.text)

Here is the html file I used to test again. If this doesn’t work for the html you’re testing again, you’ll need to post a sample file that fails as I requested in the comments above.

<html>
<body>
Blah blah
<div>Ignore me</div>
<div>Outer stuff
    <div class='fruit'>Some <em>FRUITY</em> stuff.
    <ol>
        <li class='fruit'><em>This</em> should show</li>
        <li><em>Super</em> Ignored LI</li>
        <li class='fruit'><em>Rawr</em> Hear it roar.</li>
    </ol>
    </div>
</div>
<div class='fruit'><em>Super</em> fruity website of awesome</div>
</body>
</html>

You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each “fruit”). The error you’re describing doesn’t make much sense unless your input is different than what I understood.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to screen scrape values from a website. # get the raw

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply