Instead of just using urllib does anyone know of the most efficient package for

Question

0

Asked: May 12, 20262026-05-12T20:31:34+00:00 2026-05-12T20:31:34+00:00

Instead of just using urllib does anyone know of the most efficient package for

0

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don’t know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-12T20:31:35+00:00

is’s simple to implement this in python.

The urlopen() function works
transparently with proxies which do
not require authentication. In a Unix
or Windows environment, set the
http_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that
identifies the proxy server before
starting the Python interpreter

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Instead of just using urllib does anyone know of the most efficient package for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply