I have a set of command line tools that I’d like to run in parallel on a series of files. I’ve written a python function to wrap them that looks something like this:
def process_file(fn):
print os.getpid()
cmd1 = "echo "+fn
p = subprocess.Popen(shlex.split(cmd1))
# after cmd1 finishes
other_python_function_to_do_something_to_file(fn)
cmd2 = "echo "+fn
p = subprocess.Popen(shlex.split(cmd2))
print "finish"
if __name__=="__main__":
import multiprocessing
p = multiprocessing.Pool()
for fn in files:
RETURN = p.apply_async(process_file,args=(fn,),kwds={some_kwds})
While this works, it does not seem to be running multiple processes; it seems like it’s just running in serial (I’ve tried using Pool(5) with the same result). What am I missing? Are the calls to Popen “blocking”?
EDIT: Clarified a little. I need cmd1, then some python command, then cmd2, to execute in sequence on each file.
EDIT2: The output from the above has the pattern:
pid
finish
pid
finish
pid
finish
whereas a similar call, using map in place of apply (but without any provision for passing kwds) looks more like
pid
pid
pid
finish
finish
finish
However, the map call sometimes (always?) hangs after apparently succeeding
No. Just creating a
subprocess.Popenreturns immediately, giving you an object that you could wait on or otherwise use. If you want to block, that’s simple:Meanwhile, I’m not sure why you’re putting your args together into a string and then trying to
shlexthem back to a list. Why not just write the list?What makes you think this? Given that each process just kicks off two processes into the background as fast as possible, it’s going to be pretty hard to tell whether they’re running in parallel.
If you want to verify that you’re getting work from multiple processing, you may want to add some prints or logging (and throw something like
os.getpid()into the messages).Meanwhile, it looks like you’re trying to exactly duplicate the effects of
multiprocessing.Pool.map_asyncout of a loop aroundmultiprocessing.Pool.apply_async, except that instead of accumulating the results you’re stashing each one in a variable calledRESULTand then throwing it away before you can use it. Why not just usemap_async?Finally, you asked whether
multiprocessingis the right tool for the job. Well, you clearly need something asynchronous:check_call(args(file1))has to blockother_python_function_to_do_something_to_file(file1), but at the same time not blockcheck_call(args(file2)).I would probably have used
threading, but really, it doesn’t make much difference. Even if you’re on a platform where process startup is expensive, you’re already paying that cost because the whole point is running N * M bunch of child processes, so another pool of 8 isn’t going to hurt anything. And there’s little risk of either accidentally creating races by sharing data between threads, or accidentally creating code that looks like it shares data between processes that doesn’t, since there’s nothing to share. So, whichever one you like more, go for it.The other alternative would be to write an event loop. Which I might actually start doing myself for this problem, but I’d regret it, and you shouldn’t do it…