I write MPI programs with Python using the mpi4py library. The Python generates and calls low-level C code.
I just got a seg-fault and need to track it down.
If I were writing in straight C I would compile and run my program in gdb, find the offending line of code, and start debugging from there.
If I were writing in Python I would still use gdb (you can call gdb python filename.py and gdb will point you to the offending line of C.)
But alas I’m using MPI which is using Python which is using C. I get a stack trace like this
mrocklin@baconost:~/workspace/ape$ mpiexec -np 3 -hostfile tmp/hostfile -rankfile tmp/rankfile python ape/codegen/run.py tmp/
[baconost:27370] *** Process received signal ***
[baconost:27370] Signal: Segmentation fault (11)
[baconost:27370] Signal code: Address not mapped (1)
[baconost:27370] Failing at address: 0x8
[baconost:27370] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfc60) [0x7fcd529c5c60]
[baconost:27370] [ 1] /home/mrocklin/Software/openmpi/lib/libmpi.so.0(MPI_Comm_get_errhandler+0xa0) [0x7fcd46fae240]
[baconost:27370] [ 2] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/python2.7/site-packages/mpi4py/MPI.so(+0x2ddbc) [0x7fcd47234dbc]
[baconost:27370] [ 3] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/python2.7/site-packages/mpi4py/MPI.so(initMPI+0x1f2e) [0x7fcd4724f95e]
[baconost:27370] [ 4] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(_PyImport_LoadDynamicModule+0xc2) [0x7fcd52cc2e02]
[baconost:27370] [ 5] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [ 6] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed031) [0x7fcd52cc1031]
[baconost:27370] [ 7] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x2be) [0x7fcd52cc206e]
[baconost:27370] [ 8] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xd3fed) [0x7fcd52ca7fed]
[baconost:27370] [ 9] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyObject_Call+0x68) [0x7fcd52c19d18]
[baconost:27370] [10] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56) [0x7fcd52ca8516]
[baconost:27370] [11] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x28b8) [0x7fcd52caba78]
[baconost:27370] [12] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8d2) [0x7fcd52cb0722]
[baconost:27370] [13] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fcd52cb0772]
[baconost:27370] [14] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xc2) [0x7fcd52cbf6e2]
[baconost:27370] [15] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xebcae) [0x7fcd52cbfcae]
[baconost:27370] [16] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [17] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed264) [0x7fcd52cc1264]
[baconost:27370] [18] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ImportModuleLevel+0x118) [0x7fcd52cc1ec8]
[baconost:27370] [19] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xd3fed) [0x7fcd52ca7fed]
[baconost:27370] [20] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyObject_Call+0x68) [0x7fcd52c19d18]
[baconost:27370] [21] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x56) [0x7fcd52ca8516]
[baconost:27370] [22] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x28b8) [0x7fcd52caba78]
[baconost:27370] [23] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8d2) [0x7fcd52cb0722]
[baconost:27370] [24] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32) [0x7fcd52cb0772]
[baconost:27370] [25] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyImport_ExecCodeModuleEx+0xc2) [0x7fcd52cbf6e2]
[baconost:27370] [26] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xebcae) [0x7fcd52cbfcae]
[baconost:27370] [27] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed58d) [0x7fcd52cc158d]
[baconost:27370] [28] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xecd90) [0x7fcd52cc0d90]
[baconost:27370] [29] /home/mrocklin/Software/epd-7.2-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xed264) [0x7fcd52cc1264]
[baconost:27370] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 27370 on node baconost.cs.uchicago.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Two questions:
-
In the general case what is the best way to debug programs with this software stack after a bug has been detected (I’d like to avoid the “you should prevent bugs with testing instead of diagnosing them” comments. I agree with this but sometimes segfaults happen despite our best efforts)
-
In this specific case is this pointing me to an MPI/
mpi4pyissue? The last few lines suggest that maybe this is the case.
The problem to this specific error was that older versions of OpenMPI don’t handle rankfiles very well. This bug was reported and appears to be fixed as of
1.6.2. Apparently this part of the code is rarely used and untested.