I’m researching web crawlers made in Python, and I’ve stumbled across a pretty simple

Question

0

Asked: June 9, 20262026-06-09T03:02:51+00:00 2026-06-09T03:02:51+00:00

I’m researching web crawlers made in Python, and I’ve stumbled across a pretty simple

0

I’m researching web crawlers made in Python, and I’ve stumbled across a pretty simple one. But, I don’t understand the last few lines, highlighted in the following code:

import sys
import re
import urllib2
import urlparse

tocrawl = [sys.argv[1]]
crawled = []

keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

while 1:
    crawling = tocrawl.pop(0)
    response = urllib2.urlopen(crawling)
    msg = response.read()
    keywordlist = keywordregex.findall(msg)
    crawled.append(crawling)
    links = linkregex.findall(msg)
    url = urlparse.urlparse(crawling)

    a = (links.pop(0) for _ in range(len(links))) //What does this do?

    for link in a:
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link

        if link not in crawled:
            tocrawl.append(link)

That line looks like some kind of a list comprehension to me, but I’m not sure and I need an explanation.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T03:02:54+00:00

It’s a generator expression and it simply empties the list links as you iterate over it.

They could have replaced this part

a = (links.pop(0) for _ in range(len(links))) //What does this do?

for link in a:

With this:

while links:
    link = links.pop(0)

And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:

links.reverse()
while links:
    link = links.pop()

Of course, if you’re fine with following the links in reverse order (I don’t see why they need to be processed in order), it would be even more efficient to not reverse the links list and just pop off the end.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m researching web crawlers made in Python, and I’ve stumbled across a pretty simple

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply