I’m researching web crawlers made in Python, and I’ve stumbled across a pretty simple one. But, I don’t understand the last few lines, highlighted in the following code:
import sys
import re
import urllib2
import urlparse
tocrawl = [sys.argv[1]]
crawled = []
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')
while 1:
crawling = tocrawl.pop(0)
response = urllib2.urlopen(crawling)
msg = response.read()
keywordlist = keywordregex.findall(msg)
crawled.append(crawling)
links = linkregex.findall(msg)
url = urlparse.urlparse(crawling)
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.append(link)
That line looks like some kind of a list comprehension to me, but I’m not sure and I need an explanation.
It’s a generator expression and it simply empties the list
linksas you iterate over it.They could have replaced this part
With this:
And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:
Of course, if you’re fine with following the links in reverse order (I don’t see why they need to be processed in order), it would be even more efficient to not reverse the
linkslist and just pop off the end.