I’ve scraped some HTML into a large txt file (~50k lines), and would like

Question

0

Asked: June 14, 20262026-06-14T05:44:27+00:00 2026-06-14T05:44:27+00:00

I’ve scraped some HTML into a large txt file (~50k lines), and would like

0

I’ve scraped some HTML into a large txt file (~50k lines), and would like to extract a specific set of URLs. The URL that I’m after is in one of two patterns:

First

<div class="pic">
  <a href="https://www.site.com/joesmith"><img alt="Joe Smith" class="person_image" src="https://s3.amazonaws.com/photos.site.com/medium_jpg?12345678"></a>
</div>

Second

<div class="name">
  <a href="https://www.site.com/joesmith">Joe Smith</a>
</div>

The text that I need is https://www.site.com/joesmith. I’m working with the lxml for the first time, and I’m having a hard time getting this together.

Here’s my code

from lxml import etree
from io import StringIO

def read(filename):
  file = open(filename, 'r')
  text = file.read()
  file.close()
  out = unicode(text, errors='ignore')
  return out

def parse(filename):
  data = read(filename)
  parser = etree.HTMLParser()
  tree = etree.parse(StringIO(data), parser)
  result = etree.tostring(tree.getroot(), pretty_print=True, method='HTML')
  urls = result.findall('<div class="name">')
  return urls

I’ve tried this code with both findall and findtext, and either way the result is the same, “AttributeError: ‘str’ object has no attribute ‘findall'”. I have confirmed that ‘result’ is a string with type().

Am I headed on the right path to extract the URL? How should I address this attribute error?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T05:44:28+00:00

Editorial Team

2026-06-14T05:44:28+00:00Added an answer on June 14, 2026 at 5:44 am

I’m not sure if the HTML based trees support XPath ( I suspect they do). In that case you could simply do

urls = tree.xpath('//div[@class="pics"]/a/@href') + 
       tree.xpath('//div[@class="name"]/a/@href')

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve scraped some HTML into a large txt file (~50k lines), and would like

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply