I was profiling my Cuda 4 program and it turned out that at some

Question

0

Asked: June 8, 20262026-06-08T15:24:12+00:00 2026-06-08T15:24:12+00:00

I was profiling my Cuda 4 program and it turned out that at some

0

I was profiling my Cuda 4 program and it turned out that at some stage the running process used over 80 GiB of virtual memory. That was a lot more than I would have expected.
After examining the evolution of the memory map over time and comparing what line of code it is executing it turned out that after these simple instructions the virtual memory usage bumped up to over 80 GiB:

  int deviceCount;
  cudaGetDeviceCount(&deviceCount);
  if (deviceCount == 0) {
    perror("No devices supporting CUDA");
  }

Clearly, this is the first Cuda call, thus the runtime got initialized. After this the memory map looks like (truncated):

Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000   89796   14716       0 r-x--  prg
0000000005db1000      12      12       8 rw---  prg
0000000005db4000      80      76      76 rw---    [ anon ]
0000000007343000   39192   37492   37492 rw---    [ anon ]
0000000200000000    4608       0       0 -----    [ anon ]
0000000200480000    1536    1536    1536 rw---    [ anon ]
0000000200600000 83879936       0       0 -----    [ anon ]

Now with this huge memory area mapped into virtual memory space.

Okay, its maybe not a big problem since reserving/allocating memory in Linux doesn’t do much unless you actually write to this memory. But it’s really annoying since for example MPI jobs have to be specified with the maximum amount of vmem usable by the job. And 80GiB that’s s just a lower boundary then for Cuda jobs – one has to add all other stuff too.

I can imagine that it has to do with the so-called scratch space that Cuda maintains. A kind of memory pool for kernel code that can dynamically grow and shrink. But that’s speculation. Also it’s allocated in device memory.

Any insights?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T15:24:13+00:00

Nothing to do with scratch space, it is the result of the addressing system that allows unified andressing and peer to peer access between host and multiple GPUs. The CUDA driver registers all the GPU(s) memory + host memory in a single virtual address space using the kernel’s virtual memory system. It isn’t actually memory consumption, per se, it is just a “trick” to map all the available address spaces into a linear virtual space for unified addressing.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was profiling my Cuda 4 program and it turned out that at some

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply