Here is the code in question (a very simple crawler), the file is a list of urls, usually something > 1000.
import sys, gevent
from gevent import monkey
from gevent.pool import Pool
import httplib, socket
from urlparse import urlparse
from time import time
pool = Pool(100)
monkey.patch_all(thread=False)
count = 0
size = 0
failures = 0
global_timeout = 5
socket.setdefaulttimeout(global_timeout)
def process(ourl, mode = 'GET'):
global size, failures, global_timeout, count
try:
url = urlparse(ourl)
start = time()
conn = httplib.HTTPConnection(url.netloc, timeout = global_timeout)
conn.request(mode, ourl)
res = conn.getresponse()
req = res.read()
end = time()
bytes = len(req)
took = end - start
print mode, ourl, bytes, took
size = size + len(req)
count += 1
except Exception, e:
failures += 1
start = time()
gevent.core.dns_init()
print "spawning..."
for url in open('domains'):
pool.spawn(process, url.rstrip())
print "done...joining..."
pool.join()
print "complete"
end = time()
took = end - start
rate = size / took
print "It took %.2f seconds to process %d urls." % (took, count)
print rate, " bytes/sec"
print rate/1024, " KB/sec"
print rate/1048576, " MB/sec"
print "--- summary ---"
print "total:", count, "failures:", failures
I get so many different speed variations when I alter the pool size: –
pool = Pool(100)
I’ve been mulling over the thought of writing an algorithm to calculate the ideal pool size on the fly but rather than jumping in I’d like to know if theres something I’ve overlooked?
Any paralell processing will either be CPU-bound or IO-bound. From the nature of your code, it looks like at smaller sizes of the pool it will be IO-bound. Specifically, it will be bound by the bandwidth of your interface and perhaps by the number of concurrently open sockets the system can sustain (thinking some versions of Windows here, where I have managed to run out of available sockets on more than one occasion). It is possible that as you increase the pool size, the process may start tipping towards being CPU-bound (especially, if you have more data processing not showing here). To keep the pool size at the optimal value you need to monitor the usage of all these variables (# of open sockets, bandwith utilization by your process, CPU utilization, etc). You can either do this manually by profiling the metrics as you are running the crawler and making necessary adjustments to the pool size or you can try automating this. Whether or not something like that is possible from within Python is a different matter.