These are some general questions I am facing while designing the error handling for an algorithm that is supposed to run in parallel using MPI (in C++):
- Do Exceptions work inside code that is executed in parallel? Is the behaviour defined?
- How do they work? Does that differ for different implementations?
- Is it good practice – or should I use return codes?
Exceptions work the same in an MPI code as with a serial code, but you have to be extremely careful if it is possible for the exception is not raised on all processes in a communicator or you can easily end up with deadlock.
All error handling methods have this problem, it is difficult to recover from errors that do not occur consistently across a communicator. In the case above, you could perform an
MPI_Allreduceso that all ranks choose the same branch.My preference is for calling error handlers and propagating them up the stack since this tends to give me tho most useful/verbose error message and it’s easy to catch with a breakpoint (or the error handler can attach a debugger to itself and send it to your workstation in an xterm).