Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4573662
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T19:48:58+00:00 2026-05-21T19:48:58+00:00

What is the advised way of dealing with dynamically-sized datasets in cuda? Is it

  • 0

What is the advised way of dealing with dynamically-sized datasets in cuda?

Is it a case of ‘set the block and grid sizes based on the problem set’ or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?

I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM’s that have a set number of SP’s, and so can handle a particular warp size).

I’ve perused Kirk’s ‘Programming Massively Parallel Processors’ but it doesn’t really touch on this area.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T19:48:59+00:00Added an answer on May 21, 2026 at 7:48 pm

    It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a “sweet spot” number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

    EDIT:
    To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

    // Fused square root and dscal operation
    __global__ 
    void cdivkernel(const int n, double *a)
    {
        __shared__ double oneondiagv;
    
        int imin = threadIdx.x + blockDim.x * blockIdx.x;
        int istride = blockDim.x * gridDim.x;
    
        if (threadIdx.x == 0) {
            oneondiagv = rsqrt( a[0] );
        }
        __syncthreads();
    
        for(int i=imin; i<n; i+=istride) {
            a[i] *= oneondiagv;
        }
    }
    

    To launch this kernel, the execution parameters are calculated as follows:

    1. We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
    2. We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

    The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

    // Fused the diagonal element root and dscal operation into
    // a single "cdiv" operation
    void fusedDscal(const int n, double *a)
    {
        // The semibandwidth (column length) determines
        // how many warps are required per column of the 
        // matrix.
        const int warpSize = 32;
        const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050
    
        int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
        int warpPerBlock = max(1, min(4, warpCount));
    
        // For the cdiv kernel, the block size is allowed to grow to
        // four warps per block, and the block count becomes the warp count over four
        // or the GPU "fill" whichever is smaller
        int threadCount = warpSize * warpPerBlock;
        int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
        dim3 BlockDim = dim3(threadCount, 1, 1);
        dim3 GridDim  = dim3(blockCount, 1, 1);
    
        cdivkernel<<< GridDim,BlockDim >>>(n,a);
        errchk( cudaPeekAtLastError() );
    }
    

    Perhaps this gives some hints about how to design a “universal” scheme for setting execution parameters against input data size.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been dealing many days with a problem without any success. In my
I have the following automatically-generated HTML: http://jsfiddle.net/BrV8X/ What is the advised way, using CSS,
Is there way to or it is possible to do it using javascript/jQuery? Based
After being advised that the native ComboBox was not the way to go I
Bob Glickstein describes in Writing GNU Emacs Extensions, chapter 3, a way to advise
Can anyone advise on a simple way of converting a csv string to an
I need an opinion and advise from experienced ASP.NET people, what way to go.
I was advised in a previous question that I didn't need to maintain the
I know it's not advised to go more then 1 level deep in nested
My understanding is that it's advised testers are separate from developers, i.e you obviously

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.