I’m trying to find some memory errors in a program of mine using electric fence. My program uses OpenMPI and when I try to run it, it segfaults with the following back trace:
Program received signal SIGSEGV, Segmentation fault.
2001 ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: No such file or directory.
__memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
(gdb) bt
#0 __memcpy_ssse3_back ()
at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:2001
#1 0x00007ffff72d6b7f in ompi_ddt_copy_content_same_ddt ()
from /usr/lib/libmpi.so.0
#2 0x00007ffff72d4d0d in ompi_ddt_sndrcv () from /usr/lib/libmpi.so.0
#3 0x00007ffff72dd5b3 in PMPI_Allgather () from /usr/lib/libmpi.so.0
#4 0x00000000004394f1 in ppl::gvec<unsigned int>::gvec (this=0x7fffffffdd60,
length=1) at qppl/gvec.h:32
#5 0x0000000000434a35 in TreeBuilder::TreeBuilder (this=0x7fffffffdc60,
octree=..., mygbodytab=..., mylbodytab=..., cellpool=0x7fffef705fc8,
---Type <return> to continue, or q <return> to quit---
leafpool=0x7fffef707fc8, bodypool=0x7fffef6bdfc0) at treebuild.cxx:93
#6 0x000000000042fb6b in BarnesHut::BuildOctree (this=0x7fffffffde50)
at barnes.cxx:155
#7 0x000000000042af52 in BarnesHut::Run (this=0x7fffffffde50)
at barnes.cxx:386
#8 0x000000000042b164 in main (argc=1, argv=0x7fffffffe118) at barnes.cxx:435
The relevant portion of my code is:
me = spr_locale_id();
world_size = spr_num_locales();
my_elements = std::shared_ptr<T>(new T[1]);
world_element_pointers = std::shared_ptr<T*>(new T*[world_size]);
MPI_Allgather(my_elements.get(), sizeof(T*), MPI_BYTE,
world_element_pointers.get(), sizeof(T*), MPI_BYTE,
MPI_COMM_WORLD);
I’m not sure why __memcpy_ssse3_back is causing a segfault. This part of the program doesn’t segfault when I run without electric fence. Does anyone know what’s going on? I’m using openmpi version 1.4.3
There are two possible reasons for the error:
There is a bug in the data copy routines, present in older Open MPI versions, that appears to have been fixed in version 1.4.4. If this is the case, an upgrade of the Open MPI library to a newer version would solve the problem.
Another possible reason is that
my_elementsis an array of single item of typeT. In theMPI_Allgathercall you pass a pointer to this element, but you specify insteadsizeof(T*)as the number of bytes to be sent. By default, Electric Fence places the newly allocated memory at the end of a memory page and then inserts an inaccessible memory page immediately after. IfThappens to be shorter than a pointer type (e.g.Tisintand you are running on a 64-bit LP64 platform), then access to the inaccessible memory page would occur and hence the segfault. As your intention is to actually send a pointer to the data, then you should passMPI_Allgathera pointer to the value returned bymy_elements.get()instead.By the way, passing pointers around is not a nice thing to do. MPI provides its own portable RDMA implementation. See the One-sided Communications chapter of the MPI standard. It is a bit cumbersome, but it should at least be portable.