I’m using lxml to scrape some HTML that looks like this: <div align=center><a style=font-size:

Question

0

Asked: May 23, 20262026-05-23T05:26:57+00:00 2026-05-23T05:26:57+00:00

I’m using lxml to scrape some HTML that looks like this: <div align=center><a style=font-size:

0

I’m using lxml to scrape some HTML that looks like this:

<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>

How can I end up with data in the form

[ {'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]

So far I’ve got:

results = []
for (i,a) in enumerate(content[0].xpath('./a')):
     data['text'] = a.text
     results.append(data)

But I don’t know how to get the category name by splitting at font-size and retaining sibling tags – any advice?

Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T05:26:57+00:00

I had success with the following code:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []
current_category = None

for element in body.xpath('./*'):
    if element.tag == 'div':
        current_category = element.xpath('./a')[0].text
    elif element.tag == 'a':
        results.append({ 'category' : current_category, 
            'title' : element.text })

print results

It will print:

[{'category': 'Football', 'title': 'Team A'}, 
 {'category': 'Football', 'title': 'Team B'}, 
 {'category': 'Baseball', 'title': 'Team C'}, 
 {'category': 'Baseball', 'title': 'Team D'}]

Scraping is fragile. Here for example, we depend explicitly on the ordering of the elements as well as the nesting. However, sometimes such a hardwired approach might be good enough.

Here is another (more xpath-oriented approach) using the preceding-sibling axis:

#!/usr/bin/env python

snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""

import lxml.html

html = lxml.html.fromstring(snippet)
body = html[1]

results = []

for e in body.xpath('./a'):
    results.append(dict(
        category=e.xpath('preceding-sibling::div/a')[-1].text,
        title=e.text))

print results

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using lxml to scrape some HTML that looks like this: <div align=center><a style=font-size:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply