I’ve been working on a basic web crawler in Python using the HTMLParser Class.

Question

0

Asked: May 31, 20262026-05-31T12:41:50+00:00 2026-05-31T12:41:50+00:00

I’ve been working on a basic web crawler in Python using the HTMLParser Class.

0

I’ve been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:

def handle_starttag(self, tag, attrs):
    if tag == 'a':
        for (key, value) in attrs:
            if key == 'href':
                newUrl = urljoin(self.baseUrl, value)
                self.links = self.links + [newUrl]

This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links.

How would I go about only fetching links that are between the <td class="title"> and </td> tags, like this:

<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T12:41:52+00:00

HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested in, such as which tags are inside other tags, you must glean from the tags you see passing by.

For example, if you see a <td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see </td>, you know you have left a table cell and can clear that flag. (Actually, you want to use a counter rather than a simple flag to allow nesting of tables, but we’ll let that slide).

To get the links inside a table cell, then, if you see <a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag’s href attribute if it has one.

from HTMLParser import HTMLParser

class LinkExtractor(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.extracting = False
        self.links      = []

    def handle_starttag(self, tag, attrs):
        if tag == "td" or tag == "a":
            attrs = dict(attrs)   # save us from iterating over the attrs
        if tag == "td" and attrs.get("class", "") == "title":
            self.extracting = True
        elif tag == "a" and "href" in attrs and self.extracting:
            self.links.append(attrs["href"])

    def handle_endtag(self, tag):
        if tag == "td":
            self.extracting = False

This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending lxml and BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.

BTW, I answered a similar question recently here.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’ve been working on a basic web crawler in Python using the HTMLParser Class.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply