I have something very similar to the code: int k, no_streams = 4; cudaStream_t

Question

0

Asked: May 22, 20262026-05-22T15:21:26+00:00 2026-05-22T15:21:26+00:00

I have something very similar to the code: int k, no_streams = 4; cudaStream_t

0

I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

‘h_ptr_in’ and ‘h_ptr_out’ are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap.
In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?
And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap…

Please HELP, what can be causing this?

I’m running on:

Ubuntu 10.04 x64

Device: “GeForce GTX 460”
(CUDA Driver Version: 3.20,
CUDA Runtime Version: 3.20,
CUDA Capability Major/Minor version number: 2.1,
Concurrent copy and execution: Yes,
Concurrent kernel execution: Yes)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T15:21:27+00:00

Editorial Team

2026-05-22T15:21:27+00:00Added an answer on May 22, 2026 at 3:21 pm

According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you’re using CUDA events…

I’ve been experimenting with streaming lately, and I found the “simpleMultiCopy” example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have something very similar to the code: int k, no_streams = 4; cudaStream_t

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply