Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6172409
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T23:23:18+00:00 2026-05-23T23:23:18+00:00

My intention is to use n host threads to create n streams concurrently on

  • 0

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication…I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.

The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart…I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don’t think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.

Following is the code each thread runs:

//every thread would run this method in conjunction

static CUT_THREADPROC solverThread(TGPUplan *plan)
{

    //Allocate memory
    cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );

    //Copy input data from CPU
    cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
    //to make cudaMemcpyAsync blocking
    cudaStreamSynchronize( plan->stream );

    //launch
    launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
    cutilCheckMsg("simpleKernel() execution failed.\n");

    cudaStreamSynchronize(plan->stream);

    //Read back GPU results
    cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
    //to make the cudaMemcpyAsync blocking...               
    cudaStreamSynchronize(plan->stream);

    cutilSafeCall( cudaFree(plan->d_Data) );

    CUT_THREADEND;
}

And creation of multiple threads and calling the above function:

    for(i = 0; i < nkernels; i++)
            threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);

    printf("main(): waiting for GPU results...\n");
    cutWaitForThreads(threadID, nkernels);

I took this strategy from one of the CUDA Code SDK samples. As I’ve said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code…

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T23:23:18+00:00Added an answer on May 23, 2026 at 11:23 pm

    My bet is that

    BLOCK_N, THREAD_N

    , don’t cover the exact size of the array you are passing.
    Please provide the code for initializing the streams and the size of those buffers.

    As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

My intention was to use pyGTK's main loop to create a function that blocks
I use the following code with the intention to create an image that's half
Basically my intention is to create web application. I have to use js, ajax,
Should I learn to use Emacs with no intention to learn Lisp, if my
My intention is to use assertArrayEquals(int[], int[]) JUnit method described in the API for
I'm having trouble with a custom control template. My intention is to use a
I'm trying to use WebRequest.RegisterPrefix to register a decorator IWebRequestCreate implementation with the intention
The idea is to use nodejs instead of comet for longpolling. The intention is
My main intention is just use some functions in the ann library in Pylab
Intention In my Django template, I create a link for each letter in the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.