I’ve scraped some HTML into a large txt file (~50k lines), and would like to extract a specific set of URLs. The URL that I’m after is in one of two patterns:
First
<div class="pic">
<a href="https://www.site.com/joesmith"><img alt="Joe Smith" class="person_image" src="https://s3.amazonaws.com/photos.site.com/medium_jpg?12345678"></a>
</div>
Second
<div class="name">
<a href="https://www.site.com/joesmith">Joe Smith</a>
</div>
The text that I need is https://www.site.com/joesmith. I’m working with the lxml for the first time, and I’m having a hard time getting this together.
Here’s my code
from lxml import etree
from io import StringIO
def read(filename):
file = open(filename, 'r')
text = file.read()
file.close()
out = unicode(text, errors='ignore')
return out
def parse(filename):
data = read(filename)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(data), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method='HTML')
urls = result.findall('<div class="name">')
return urls
I’ve tried this code with both findall and findtext, and either way the result is the same, “AttributeError: ‘str’ object has no attribute ‘findall'”. I have confirmed that ‘result’ is a string with type().
Am I headed on the right path to extract the URL? How should I address this attribute error?
I’m not sure if the HTML based trees support XPath ( I suspect they do). In that case you could simply do