I am trying to hunt down a bug in my program. It produces
[vaio:10404] Signal: Segmentation fault (11)
[vaio:10404] Signal code: Address not mapped (1)
[vaio:10404] Failing at address: 0x210000
[vaio:10405] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fa7857ffcb0]
[vaio:10405] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x14fe20) [0x7fa785580e20]
[vaio:10405] [ 2] /usr/lib/libcuda.so.1(+0x1b1f49) [0x7fa78676bf49]
0x210000 is an address living in GPU memory. I have no unified address space in place (due to card limitations: sm_12).
The problem occurs when running the program through mpiexec -n 2. That is, I start 2 CPU processes, both create their own context on the same (!) GPU (just 1 GPU installed). Like this:
cuInit(0);
cuDeviceGet(&cuDevice, 0); // get device number 0
cuCtxCreate(&cuContext, 0, cuDevice); // create context
cuMemAlloc( &mem , size ); // allocate memory pool (hundreds of MB)
(This is simplified here – all cuda calls are checked for error returns)
I am using this setup very successfully for quite some time now (multiple host processes sharing the same GPU). However, that was when using the CUDA runtime API. Now, I am switching to the driver API.
To hunt this down I dump out mem of the above code. Turns out that the memory address that is returned by cuMemAlloc is the same for both processes! This wonders me. Here the dump out:
process 0: allocate internal buffer: unaligned ptr = 0x210000 aligned ptr = 0x210000
process 1: allocate internal buffer: unaligned ptr = 0x210000 aligned ptr = 0x210000
Immediately I checked with my old code (runtime API) which never has such problems. Turns out it reports on both processes (sharing the same device) the same memory address as well. This has never drawn my attention before.
I find this hard to believe. Just to make this clear: Both processes request device memory with cudaMalloc (runtime API) and cuMalloc (driver API) (400MB on a big card), and both processes get the same address returned?
I am sure that not one process is advancing a lot and already freed the memory, so that the first process can reuse the same memory because the first process has already terminated. No, this memory is used as a memory pool to serves subsequent memory allocations and there are a lot of synchronization points in between. The pool is freed at program termination.
Anybody knows how the same pointer value (memory address) can be reported to both processes but on the GPU this refers to different memory? (It must since the calculations it carries out are correct. And that wouldn’t be the case if the memory pools overlapped). To my knowledge this is not possible. What am I missing?
You are missing the concept of virtual addressing.
Each CPU thread has its own CUcontext, and each CUcontext has its own GPU virtual address space. The addresses returned from the API are virtual, and host driver and/or device performs virtual address translation into absolute addresses in GPU memory. There is plenty of evidence that the GPU has a dedicated TLB onboard for precisely this purpose.