CUDA programming guide states that Memory allocated via malloc() can be copied using the

Question

0

Asked: June 10, 20262026-06-10T20:57:21+00:00 2026-06-10T20:57:21+00:00

CUDA programming guide states that Memory allocated via malloc() can be copied using the

0

CUDA programming guide states that “Memory allocated via malloc() can be copied using the runtime (i.e., by calling any of the copy memory functions from Device Memory)”, but somehow I’m having trouble to reproduce this functionality. Code:

#include <cstdio>
__device__ int* p;

__global__ void allocate_p() {
  p = (int*) malloc(10);
  printf("p = %p  (seen by GPU)\n", p);
}

int main() {
  cudaError_t err;
  int* localp = (int*) malloc(10);

  allocate_p<<<1,1>>>();
  cudaDeviceSynchronize();

  //Getting pointer to device-allocated memory
  int* tmpp = NULL;
  cudaMemcpyFromSymbol(&tmpp, p, 4);
  printf("p = %p  (seen by CPU)\n", tmpp);

  //cudaMalloc((void**)&tmpp, 40);
  err = cudaMemcpy(tmpp, localp, 40, cudaMemcpyHostToDevice);
  cudaDeviceSynchronize();
  printf(" err:%i %s", (int)err, cudaGetErrorString(err));

  delete localp;
  return 0;
}

crashes with output:

p = 0x601f920  (seen by GPU)
p = 0x601f920  (seen by CPU)
 err:11 invalid argument

I gather, that the host sees the appropriate address on device, but somehow does not like it coming from malloc().

If I allocate earlier by cudaMalloc((void**)&np, 40); and then pass the pointer np as argument to kernel allocate_p, where it will be assigned to p (instead of malloc()), then the code runs fine.

What am I doing wrong / how do we use malloc() allocated device-memory in host-side functions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T20:57:22+00:00

As far as I am aware, it isn’t possible to copy runtime heap memory using the host API functions. It certainly was not possible in CUDA 4.x and the CUDA 5.0 release candidate has not changed this. The only workaround I can offer is to use a kernel to “gather” final results and stuff them into a device transfer buffer or zero copy memory which can be accessed via the API or directly from the host. You can see an example of this approach in this answer and another question where Mark Harris from NVIDIA confirmed that this is a limitation of the (then) current implementation in the CUDA runtime.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

CUDA programming guide states that Memory allocated via malloc() can be copied using the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply