I have this small class:
class HTMLTagStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def handle_starttag(self, tag, attrs):
if tag == 'a':
return attrs[0][1]
def get_data(self):
return ''.join(self.fed)
parsing this HTML code:
<div id="footer">
<p>long text.</p>
<p>click <a href="somelink.com">here</a>
</div>
This is the result I get: long text click here
but I want to get: long text click somelink.com
Is there a way to do this?
I was actually checking out this new html parser library and come up with this solution:
You can download the library from Sourceforge or from python package index: HtmlDom, works on python 3.x, documentation of the library is not that good but it is understandable. Hope you like the answer:)