Say I have html code similar to this:
<a href="http://example.org/">Stuff I do want</a>
<p>Stuff I don't want</p>
Using HTMLParser’s handle_data doesn’t differentiate between the link-text(stuff I do want)(Is this even the right term?) and the stuff I don’t want. Does HTMLParser have a built-in way to have handle_data return only link-text and nothing else?
Basically you have to write a
handle_starttag()method as well. Just save off every tag you see asself.lasttagor something. Then, in yourhandle_data()method, just checkself.lasttagand see if it’s'a'(indicating that the last tag you saw was an HTML anchor tag and therefore you’re in a link).Something like this (untested) should work:
In fact it’s permissible in HTML to have other tags inside an
<a...> ... </a>container. And there can also be anchors that contain text but aren’t links (nohref=attribute). These cases can both be handled if desired. Again, this code is untested:HTMLParser is what you’d call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.
DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.