Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7037073
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T01:29:59+00:00 2026-05-28T01:29:59+00:00

I have a kernel that, for each thread in a given block, computes a

  • 0

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.

However, I’m not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:

__global__ void foo( int* nIterBuf )
{
   __shared__ int nIter;

   if( threadIdx.x == 0 )
      nIter = nIterBuf[blockIdx.x];

   __syncthreads();

   for( int i=0; i < nIter; i++ )
      ...
} 

Any other better solutions? My app will use a lot of data, so I want the best performance.

Thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T01:30:00+00:00Added an answer on May 28, 2026 at 1:30 am

    Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.

    The advantage of constant memory is that it goes through a dedicated cache, so it doesn’t pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.

    You also won’t need to use any shared memory or transfer from global to shared memory.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a CUDA program that calls the kernel repeatedly within a for loop.
I have a kernel module that allocates a large buffer of memory, this buffer
I have a kernel module that provides data to a userland process through read().
I have a huge array that has to be read by different threads in
I have a number of Ruby files, each of which declares a Class ,
I have the task of large number of threads running, each doing a small
I have a kernel service driver , i want to install it with an
I'm building a custom Ubuntu kernel and have modified one of the source files.
I have a ARM11 based hardware board which runs a Linux kernel ver. 2.6.21.5-cfs-v19.
I have an ARM kit beside me and a Linux kernel source code patched

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.