Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8263163
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T04:00:41+00:00 2026-06-08T04:00:41+00:00

I am just about to embark on converting a program I wrote into CUDA

  • 0

I am just about to embark on converting a program I wrote into CUDA to hopefully increase processing speed.

Now obviously my old program executes many functions one after the other, and I have separated these functions in my main program and call each one in order.

void main ()
{
  *initialization of variables*
  function1()
  function2()
  function3()
  print result;
}

These functions are inherently serial, as funtion2 is dependent on the results of funtion1.

Alright, so now I want to convert these functions into kernels, and run the tasks in the functions in parallel.

Is it as simple as rewriting each function in a parallel way, and then in my main program, call each kernel one after the other? Is this slower than it needs to be? For example can I have my GPU directly execute the next parallel operation without going back to the CPU to initialize the next kernel?

Obviously I will keep all run time variables on the GPU memory to limit the amount of data transfer going on, so should I even worry about the time it takes between kernel calls?

I hope this question is clear, if not please ask me to elaborate.
Thanks.

And here is an extra question so that I can check my sanity. Ultimately this program’s input is a video file, and through the different functions, each frame will lead to a result. My plan is to grab multiple frames at a time (say 8 unique frames) and then divide the total number of blocks I have among these 8 frames, and then the multiple threads in the blocks will be doing even more parallel operations on the image data, such as vector addition, Fourier transforms, etc.

Is this the right way to approach the problem?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T04:00:43+00:00Added an answer on June 8, 2026 at 4:00 am

    There are some cases where you can get programs to run at the full potential speed on the GPU with very little porting work from a plain CPU version, and this might be one of them.

    If it’s possible for you to have a function like this:

    void process_single_video_frame(void* part_of_frame)
    {
      // initialize variables
      ...
      intermediate_result_1 = function1(part_of_frame);
      intermediate_result_2 = function2(intermediate_result_1);
      intermediate_result_3 = function3(intermediate_result_2);
      store_results(intermediate_result_3);
    }
    

    and you can process many part_of_frames at the same time. Say, a few thousand,

    and function1(), function2() and function3() go through pretty much the same code paths (that is, the program flow does not depend heavily on the contents of the frame),

    then, local memory may do all the work for you. Local memory is a type of memory that is stored in global memory. It is different from global memory in a subtle, yet profound way… The memory is simply interleaved in such a way that adjacent threads will access adjacent 32 bit words, enabling the memory access to be fully coalesced if the threads all read from the same location of their local memory arrays.

    The flow of your program would be that you start out by copying part_of_frame to a local array and prepare other local arrays for intermediate results. You then pass pointers to the local arrays between the various functions in your code.

    Some pseudocode:

    const int size_of_one_frame_part = 1000;
    
    __global__ void my_kernel(int* all_parts_of_frames) {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        int my_local_array[size_of_one_frame_part];
        memcpy(my_local_array, all_parts_of_frames + i * size_of_one_frame_part);
        int local_intermediate_1[100];
        function1(local_intermediate_1, my_local_array);
        ...
    }
    
    __device__ void function1(int* dst, int* src) {
       ...
    }
    

    In summary, this approach may let you use your CPU functions pretty much unchanged, as the parallelism does not come from creating parallelized versions of your functions, but instead by running the entire chain of functions in parallel. And this again is made possible by the hardware support for interleaving the memory in local arrays.

    Notes:

    • The initial copy of the part_of_frame from global to local memory is not coalesced, but hopefully, you will have enough calculations to hide that.

    • On devices of compute capability <= 1.3, there is only 16KiB of local memory available per thread, which may not be enough for your part_of_frame and the other intermediate data. But on compute capability >= 2.0, this has bee expanded to 512KiB, which should be plenty.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am just about done writing my first mergesort program and am running into
I'm just about getting into WCF ; but from what I've read so far,
While I was just about to use an old class that is about 2
I am just about to get used with Visual Studio 2010 right now. So
Ok, so I've read just about every resource possible on converting from SVN to
Just about everyone has ran into this specific issue: function addLinks () { for
I have just about finished migrating a asp.net 3.5 site into mvc3. The whole
I'm just about to make my first trip into the world of JNI (Java
Hi just about to get the dev team to start looking at the next
I am just about to get started with Kinect development and am hoping someone

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.