I am trying to parse a website. I am using the HTMLParser module. The

Question

0

Asked: May 26, 20262026-05-26T15:19:17+00:00 2026-05-26T15:19:17+00:00

I am trying to parse a website. I am using the HTMLParser module. The

0

I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href=""> after the comment: , but I don’t really know how to do it. So I have found in the documentation that there is an function which is called handle_comment, but I haven’t found out how to use it correctly. I have the following:

import HTMLParser

class LinkFinder(HTMLParser.HTMLParser):
def __init__(self, *args, **kwargs):
    # Can't use super() - HTMLParser is an old-style class
    HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
    self.in_linktag = False
    self.url_cache = []
def handle_comment(self,data):
    if data == "topOfPage":
        print data
def handle_starttag(self, tag, attrs):
    if tag == "a" and any("href" == t[0] for t in attrs): # found link
        self.in_linktag = True
        self.url_cache.append([dict(attrs)['href']])
def handle_endtag(self, tag):
    if tag == "a" and self.in_linktag: # ignore '<a name=""...'
        self.in_linktag = False
def handle_data(self, data):
    if self.in_linktag:
        self.url_cache[-1].append(data)
TESTDATA = """
< html>
< body>
< div>
 < ul>
    < !-- /topOfPage --> 
< tr >
    < td class="empty-cell-left">&nbsp;</td>
    < td class="image">


    < a  href="http://test" rel="nofollow">
 < ul>
< /div>
< /body>
 < /html>
"""
def main():
lf = LinkFinder()
lf.feed(TESTDATA)
lf.close()
print lf.url_cache
if __name__ == "__main__":
    main()

How to do it?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-26T15:19:17+00:00

You need an additional variable to indicate that the parser has just come past to the comment, so that you can save the reference from the first link after it.

def __init__(self, *args, **kwargs):
    # ...
    self.first_link_after_comment = False

Then, when you encounter the comment, the flag must be switched.

def handle_comment(self, data):
    if data.strip() == '/topOfPage':
        self.first_link_after_comment = True

When you handle an opening tag, you want to be sure to just make it pass by if the parsing has not passed over the comment

def handle_starttag(self, tag, attrs):
    if not self.first_link_after_comment:
        return
    # ...

Conversely, when you handle the closing tag, you want to acknowledge that the mission has been accomplished.

def handle_endtag(self, tag):
    if tag == 'a' and self.in_linktag: # ignore '<a name=""...'
        self.in_linktag = False
        self.first_link_after_comment = False

Finally, when you append data, just make sure that it’s not just a string that’s empty or contains white space only.

def handle_data(self, data):
    if self.in_linktag and data.strip():
        self.url_cache[-1].append(data)

And here you are.

$ your_script.py
[['http://test']]

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am trying to parse a website. I am using the HTMLParser module. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply