I am trying to parse a website. I am using the HTMLParser module. The problem is i want to parse the first <a href=""> after the comment: <!-- /topOfPage -->, but I don’t really know how to do it. So I have found in the documentation that there is an function which is called handle_comment, but I haven’t found out how to use it correctly. I have the following:
import HTMLParser
class LinkFinder(HTMLParser.HTMLParser):
def __init__(self, *args, **kwargs):
# Can't use super() - HTMLParser is an old-style class
HTMLParser.HTMLParser.__init__(self, *args, **kwargs)
self.in_linktag = False
self.url_cache = []
def handle_comment(self,data):
if data == "topOfPage":
print data
def handle_starttag(self, tag, attrs):
if tag == "a" and any("href" == t[0] for t in attrs): # found link
self.in_linktag = True
self.url_cache.append([dict(attrs)['href']])
def handle_endtag(self, tag):
if tag == "a" and self.in_linktag: # ignore '<a name=""...'
self.in_linktag = False
def handle_data(self, data):
if self.in_linktag:
self.url_cache[-1].append(data)
TESTDATA = """
< html>
< body>
< div>
< ul>
< !-- /topOfPage -->
< tr >
< td class="empty-cell-left"> </td>
< td class="image">
< a href="http://test" rel="nofollow">
< ul>
< /div>
< /body>
< /html>
"""
def main():
lf = LinkFinder()
lf.feed(TESTDATA)
lf.close()
print lf.url_cache
if __name__ == "__main__":
main()
How to do it?
You need an additional variable to indicate that the parser has just come past to the comment, so that you can save the reference from the first link after it.
Then, when you encounter the comment, the flag must be switched.
When you handle an opening tag, you want to be sure to just make it pass by if the parsing has not passed over the comment
Conversely, when you handle the closing tag, you want to acknowledge that the mission has been accomplished.
Finally, when you append data, just make sure that it’s not just a string that’s empty or contains white space only.
And here you are.