Say I have html code similar to this: <a href=http://example.org/>Stuff I do want</a> <p>Stuff

Question

0

Asked: May 30, 20262026-05-30T04:43:51+00:00 2026-05-30T04:43:51+00:00

Say I have html code similar to this: <a href=http://example.org/>Stuff I do want</a> <p>Stuff

0

Say I have html code similar to this:

<a href="http://example.org/">Stuff I do want</a>
<p>Stuff I don't want</p>

Using HTMLParser’s handle_data doesn’t differentiate between the link-text(stuff I do want)(Is this even the right term?) and the stuff I don’t want. Does HTMLParser have a built-in way to have handle_data return only link-text and nothing else?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T04:43:52+00:00

Basically you have to write a handle_starttag() method as well. Just save off every tag you see as self.lasttag or something. Then, in your handle_data() method, just check self.lasttag and see if it’s 'a' (indicating that the last tag you saw was an HTML anchor tag and therefore you’re in a link).

Something like this (untested) should work:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    lasttag = None

    def handle_starttag(self, tag, attr):
        self.lasttag = tag.lower()

    def handle_data(self, data):
        if self.lasttag == "a" and data.strip():
            print data

In fact it’s permissible in HTML to have other tags inside an <a...> ... </a> container. And there can also be anchors that contain text but aren’t links (no href= attribute). These cases can both be handled if desired. Again, this code is untested:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    inlink = False
    data   = []

    def handle_starttag(self, tag, attr):
        if tag.lower() == "a" and "href" in (k.lower() for k, v in attr):
           self.inlink = True
           self.data   = []

    def handle_endtag(self, tag):
        if tag.lower() == "a":
            self.inlink = False
            print "".join(self.data)

    def handle_data(self, data):
        if self.inlink:
            self.data.append(data)

HTMLParser is what you’d call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.

DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Say I have html code similar to this: <a href=http://example.org/>Stuff I do want</a> <p>Stuff

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply