I am trying to screen scrape values from a website.
# get the raw HTML
fruitsWebsite = lxml.html.parse( "http://pagetoscrape.com/data.html" )
# get all divs with class fruit
fruits = fruitsWebsite.xpath( '//div[@class="fruit"]' )
# Print the name of this fruit (obtained from an <em> in the fruit div)
for fruit in fruits:
print fruit.xpath('//li[@class="fruit"]/em')[0].text
However, the Python interpreter complains that 0 is an out of bounds iterator. That’s interesting because I am sure that the element exists. What is the proper way to access the inside <em> element with lxml?
The following code works for me with my test file.
Here is the html file I used to test again. If this doesn’t work for the html you’re testing again, you’ll need to post a sample file that fails as I requested in the comments above.
You definitely will get too many results with the code you originally posted (the inner loop will search the entire tree rather than the subtree for each “fruit”). The error you’re describing doesn’t make much sense unless your input is different than what I understood.