Here is the code in question (a very simple crawler), the file is a

Question

0

Asked: June 9, 20262026-06-09T15:40:07+00:00 2026-06-09T15:40:07+00:00

Here is the code in question (a very simple crawler), the file is a

0

Here is the code in question (a very simple crawler), the file is a list of urls, usually something > 1000.

import sys, gevent
from gevent import monkey
from gevent.pool import Pool
import httplib, socket
from urlparse import urlparse
from time import time

pool = Pool(100)

monkey.patch_all(thread=False)

count = 0
size = 0
failures = 0

global_timeout = 5
socket.setdefaulttimeout(global_timeout)

def process(ourl, mode = 'GET'):
    global size, failures, global_timeout, count
    try:
        url = urlparse(ourl)
        start = time()
        conn = httplib.HTTPConnection(url.netloc, timeout = global_timeout)
        conn.request(mode, ourl)
        res = conn.getresponse()
        req = res.read()
        end = time()
        bytes = len(req)
        took = end - start
        print mode, ourl, bytes, took
        size = size + len(req)
        count += 1
    except Exception, e:
        failures += 1

start = time()

gevent.core.dns_init()
print "spawning..."
for url in open('domains'):
    pool.spawn(process, url.rstrip())
print "done...joining..."
pool.join()
print "complete"

end = time()
took = end - start
rate = size / took
print "It took %.2f seconds to process %d urls." % (took, count)
print rate, " bytes/sec"
print rate/1024, " KB/sec"
print rate/1048576, " MB/sec"

print "--- summary ---"
print "total:", count, "failures:", failures

I get so many different speed variations when I alter the pool size: –

pool = Pool(100)

I’ve been mulling over the thought of writing an algorithm to calculate the ideal pool size on the fly but rather than jumping in I’d like to know if theres something I’ve overlooked?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-09T15:40:09+00:00

Any paralell processing will either be CPU-bound or IO-bound. From the nature of your code, it looks like at smaller sizes of the pool it will be IO-bound. Specifically, it will be bound by the bandwidth of your interface and perhaps by the number of concurrently open sockets the system can sustain (thinking some versions of Windows here, where I have managed to run out of available sockets on more than one occasion). It is possible that as you increase the pool size, the process may start tipping towards being CPU-bound (especially, if you have more data processing not showing here). To keep the pool size at the optimal value you need to monitor the usage of all these variables (# of open sockets, bandwith utilization by your process, CPU utilization, etc). You can either do this manually by profiling the metrics as you are running the crawler and making necessary adjustments to the pool size or you can try automating this. Whether or not something like that is possible from within Python is a different matter.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Here is the code in question (a very simple crawler), the file is a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply