Basically, I have a large object that I want to perform some function on, that lends itself well to parallel processing. In this example, I have a large matrix and I want to compute all pairwise inner products between column vectors.
Please take a look at the following IPython Notebook.
I realise that the @interactive decorator is not necessary in this context and I tried removing the @require decorator but its impact is negligible.
My question is: Is there any way available to improve the performance of the parallel machinery?
I don’t know the implementation details of the map methods, could I avoid overhead by pushing the function that is executed in parallel to the engines in the view? I can’t imagine that it is sent with every argument, though.
Chunking the argument list myself and writing a function for remote use that works on that seems silly as well.
I tried the notebook on a four core machine and the results in the notebook are for a two core machine.
The main performance issue here is that the fortran-contiguous optimization you applied does not survive the network transfer, so
maton the engines is C-contiguous, not F-contiguous after thepush.You can see this with:
Adding:
Should get your performance back (as illustrated in my tweaked version of your notebook).
For diagnosing this issue, I did my best to isolate where the bottlenecks were. Useful for this were the
AsyncResult.serial_timeandAsyncResult.wall_time. When theserial_timeis long, that means the task is actually taking a long time on the engines, rather than spending lots of time in the IPython pipes. That led me to think that the task itself was slow on the engines, so I did the task remotely on one engine,and it was still slow (nothing parallel involved). Here’s a notebook tracking down the issue.
Side note:
The
@interactivedecorator is only necessary for functions that are not interactively defined (i.e. module functions, not functions defined in the notebook), so it’s redundant in your notebook.