We had a web services server running in python 3.2 (Fedora Core 14 64b) but were forced to back-port to python 2.6.7 because of a new dependency (which did not have 3.2 support). There is a section of the code that was using concurrent futures that has been rewritten to use the multiprocessing.Pool to perform a couple critical sections in parallel. The code now looks like this:
import multiprocessing
def _run_threads(callable_obj, args, threads):
pool = multiprocessing.Pool(processes=threads)
process_list = [pool.apply_async(callable_obj, a) for a in args]
pool.close()
pool.join()
return [x.get() for x in process_list]
Apologies for the confusing abuse of the name “threads.” These are processes.
Since implementing this function, we find that it sometimes hangs. We get a garbled traceback when we eventually kill the parent (master) process; but there are couple lines that seem critical:
[snip]
Process PoolWorker-445:
[snip]
File "/usr/lib64/python2.6/multiprocessing/pool.py", line 59, in worker
task = get()
File "/usr/lib64/python2.6/multiprocessing/queues.py", line 352, in get
return recv()
racquire()
[snip]
It seems to me from the available evidence, that a child process in the pool is failing to receive the “close” signal from the parent process, so it sits waiting for work. The parent sits waiting for the child to shut down. The server hangs. This happens nondeterministically but too frequent for such a critical server (once a day).
Is there a problem in the coding of the run_threads() function? Is this a known problem with a known work-around? Obviously, we are using this for time-critical processing, so we prefer not to recode for sequential execution unless absolutely necessary. And one of the reasons for sticking to multiprocessing.Pool is the easy access to return codes for the operations run in parallel.
Thanks
I am not sure where this issue has its origin. It’s definitely very interesting. However, maybe a little restructuring solves the problem. I think you do not require to terminate the pool processes before having collected your results, right? Maybe sticking to the ‘canonical’ way of using a Pool, as documented, helps:
Or, in your case, call
x.get() for x in process_listbefore closing/joining the pool. If the problem persists and occurs duringget(), we at least know it has nothing to do withclose().