I’m using lxml to scrape some HTML that looks like this:
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
How can I end up with data in the form
[ {'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]
So far I’ve got:
results = []
for (i,a) in enumerate(content[0].xpath('./a')):
data['text'] = a.text
results.append(data)
But I don’t know how to get the category name by splitting at font-size and retaining sibling tags – any advice?
Thanks!
I had success with the following code:
It will print:
Scraping is fragile. Here for example, we depend explicitly on the ordering of the elements as well as the nesting. However, sometimes such a hardwired approach might be good enough.
Here is another (more xpath-oriented approach) using the
preceding-siblingaxis: