My intention is to use n host threads to create n streams concurrently on

Question

0

Asked: May 23, 20262026-05-23T23:23:18+00:00 2026-05-23T23:23:18+00:00

My intention is to use n host threads to create n streams concurrently on

0

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication…I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.

The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart…I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don’t think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.

Following is the code each thread runs:

//every thread would run this method in conjunction

static CUT_THREADPROC solverThread(TGPUplan *plan)
{

    //Allocate memory
    cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );

    //Copy input data from CPU
    cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
    //to make cudaMemcpyAsync blocking
    cudaStreamSynchronize( plan->stream );

    //launch
    launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
    cutilCheckMsg("simpleKernel() execution failed.\n");

    cudaStreamSynchronize(plan->stream);

    //Read back GPU results
    cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
    //to make the cudaMemcpyAsync blocking...               
    cudaStreamSynchronize(plan->stream);

    cutilSafeCall( cudaFree(plan->d_Data) );

    CUT_THREADEND;
}

And creation of multiple threads and calling the above function:

    for(i = 0; i < nkernels; i++)
            threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);

    printf("main(): waiting for GPU results...\n");
    cutWaitForThreads(threadID, nkernels);

I took this strategy from one of the CUDA Code SDK samples. As I’ve said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code…

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T23:23:18+00:00

Editorial Team

2026-05-23T23:23:18+00:00Added an answer on May 23, 2026 at 11:23 pm

My bet is that

BLOCK_N, THREAD_N

, don’t cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.

As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

My intention is to use n host threads to create n streams concurrently on

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply