Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8735533
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T10:04:53+00:00 2026-06-13T10:04:53+00:00

I’m trying to understand coalescing global memory. Say I’d like to load an odd

  • 0

I’m trying to understand coalescing global memory.
Say I’d like to load an odd set of floats to global memory. Each thread will process a set of 3 floats. Say these floats are A, B, and C.

A0,  B0,  C0
A1,  B1,  C1
A2,  B2,  C2
..          
A19, B19, C19

So the threads would grab the data like such:

Thread 0:  A0,  B0,  C0  
Thread 1:  A1,  B1,  C1  
Thread 2:  A2,  B2,  C2
..
Thread 19:  A19, B19, C19  

First approach:
I could load 3 arrays of: float A[20]; float B[20]; floatC[20]; I’d have to cudaMemcpy() three different times to load the data into global memory. This approach would probably not coalesce very well.

Second approach:
A better approach would be something like:

struct {float A, float B, float C} dataPt;
dataPt data[20];

I could load the data with one cudaMemcpy(), but I’m not sure the memory access would coalesce very well.

Third approach:

struct {float A, float B, float C, float padding} dataPt2;
dataPt2 data2[20];

or

struct __align__(16){float A, float B, float C} dataPt3;
dataPt3 data3[20];

I could load the data to global memory with a single cudaMemcpy(), and the thread access to data would be coalesced. (At the cost of wasted global memory.)

1) The 1st approach would not coalesce because each thread will probably need 3 bus cycles to load the input data.
2) The 2nd approach will coalesce for many of the threads, but there will be a few threads that will need two bus cycles to get the input data.
3) The 3rd approach will coalesce for all threads.

Is this accurate? Is there a significant difference between the 2nd & 3rd approach? Is there an approach the uses the 3 thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z)?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T10:04:55+00:00Added an answer on June 13, 2026 at 10:04 am

    Just amplifying on @talonmies answer.
    Let’s assume our kernel looks like this:

    __global__ void kern(float *a, float *b, float *c){
    
      float local_a, local_b, local_c;
      int idx = threadIdx.x + (blockDim.x * blockIdx.x);
    
      local_a = a[idx];
      local_b = b[idx];
      local_c = c[idx];
    }
    

    ignoring optimizations (which would result in an empty kernel), and assuming we launch 1 block of 32 threads:

      kern<<<1, 32>>>(d_a, d_b, d_c);
    

    Then we have 32 threads (1 warp) executing in lock-step. That means each thread will process the following kernel code line:

      local_a = a[idx];
    

    at exactly the same time. The definition of a coalesced load (from global memory) is when a warp loads a sequence of data items that are all within a single 128-byte aligned boundary in global memory (for CC 2.0 devices). A perfectly coalesced load with 100% bandwidth utilization implies that each thread is using one unique 32 bit quantity within that 128 byte aligned region. If thread zero loads a[0], thread 1 loads a[1], etc, that may be a typical example of a coalesced load.

    So in your first case, since the a[] array is all contiguous and aligned, and a[0..31] fit within a 128 byte aligned region in global memory, we get a coalesced load. thread 0 reads a[0], thread 1 reads a[1] etc.

    In the second case, a[0] is not contiguous with a[1], and furthermore the elements a[0..31] (which are all loaded at the same code line) do not fit within a 128 byte aligned sequence in global memory. I’m going to let you parse what happens in your third case, but suffice it to say that like the second case, the elements a[0..31] are niether contiguous nor contained within a single 128 byte aligned region in global memory. While it’s not necessary to have data items that are contiguous to achieve some level of coalescing, a 100% bandwidth utilization (“perfectly”) coalesced load from a 32 thread warp implies that each thread is using a unique 32bit item, all of which are contiguous and contained within a single 128-byte aligned sequence in global memory.

    A handy mental model is to contrast an Arrary of Structures (AoS) (which corresponds to your cases 2 and 3) and a Structure of Arrays (SoA), which is essentially your first case. SoA’s usually present better possibilities for coalescing than AoS’s. From the nvidia webinar page you may find this presentation interesting, especially slides 11-22 or so.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am trying to render a haml file in a javascript response like so:
Let's say I'm outputting a post title and in our database, it's Hello Y&#8217;all
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
I've got a string that has curly quotes in it. I'd like to replace
I would like to run a str_replace or preg_replace which looks for certain words
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.