I’m reading Dive into Python, especially trying to understand the examples, and I had some questions about list-urls.py.
In the last line it compiles the lists of urls from “parser.urls”. Where is this data coming from? I don’t see a urls method in URLLister or SGMLparser.
Also a method start_a was created but never used. What is this?
Link to full code, below is a condensed version http://pastebin.com/EbB4micK
#!/usr/bin/python
"""Extract list of URLs in a web page"""
from sgmllib import SGMLParser
import sys
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
link = sys.argv[1]
try:
usock = urllib.urlopen(link)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: print url
It’s an attribute, bound and mutated within the methods.
…
start_a()is part of the protocol ofSGMLParser, of whichURLListeris a descendant.