Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8573997
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T19:22:36+00:00 2026-06-11T19:22:36+00:00

I am writing a CUDA kernel that requires maintaining a small associative array per

  • 0

I am writing a CUDA kernel that requires maintaining a small associative array per thread. By small, I mean 8 elements max worst case, and an expected number of entries of two or so; so nothing fancy; just an array of keys and an array of values, and indexing and insertion happens by means of a loop over said arrays.

Now I do this by means of thread local memory; that is identifier[size]; where size is a compile time constant. Now ive heard that under some circumstances this memory is stored off-chip, and under some circumstances it is stored on-chip. Obviously I want the latter, under all circumstances. I understand that I can accomplish such with a block of shared mem, where I let each thread work on its own private block; but really? I dont want to share anything between threads, and it would be a horrible kludge.

What exactly are the rules for where this memory goes? I cant seem to find any word from nvidia. For the record, I am using CUDA5 and targetting Kepler.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T19:22:37+00:00Added an answer on June 11, 2026 at 7:22 pm

    Local variables are either stored in registers, or (cached for compute capability >=2.0) off-chip memory.

    Registers are only used for arrays if all array indices are constant and can be determined at compile time, as the architecture has no means for indexed access to registers at runtime.

    In you case the number of keys may be small enough to use registers (and tolerate the increase in register pressure). Unroll all loops over array accesses to allow the compiler to place the keys in registers, and use cuobjdump -sass to check it actually does.

    If you don’t want to spend registers, you can either choose shared memory with a per-thread offset (but check that the additional registers used to hold per-thread indices into shared memory don’t outvalue the number of keys you use) as you mentioned, or do nothing and use off-chip “local” memory (really “global” memory with just a different addressing scheme) hoping for the cache to do it’s work.

    If you hope for the cache to hold the keys and values, and don’t use much shared memory, it may be beneficial to select the 48kB cache / 16kB shared memory setting over the default 16kB/48kB split using cudaDeviceSetCacheConfig().

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am writing a cuda kernel which requires me to allocate an array of
I'm writing a program that requires the following kernel launch: dim3 blocks(16,16,16); //grid dimensions
I am writing a CUDA kernel in which I'm using the string data type
Writing documentation in html requires some code examples. What to do with characters that
I am writing a CUDA kernel for Histogram on a picture, but I had
I am writing a very very long CUDA kernel, and it is pretty awful
I'm writing a CUDA kernel which involves calculating the maximum value on a given
I'm about writing a CUDA kernel to perform a single operation on every single
i have a kernel where several threads will be writing to the same array
How can I write a statement in my CUDA kernel that is executed by

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.