Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8957121
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T14:54:52+00:00 2026-06-15T14:54:52+00:00

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the

  • 0

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to shared memory and I have developed the following code.

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1))) 
    Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];

In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2).
Thanks in advance.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T14:54:53+00:00Added an answer on June 15, 2026 at 2:54 pm

    I’ve evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:

    #include <stdio.h>
    #include <malloc.h>
    
    typedef struct dim3_t{
        int x;
        int y;
    } dim3;
    
    
    // KERNEL LAUNCH PARAMETERS
    #define GRIDDIMX 4
    #define GRIDDIMY 4
    #define BLOCKDIMX 16
    #define BLOCKDIMY 16
    
    
    // ARCHITECTURE DEPENDENT
    // number of threads aggregated for coalescing
    #define COALESCINGWIDTH 32
    // number of bytes in one coalesced transaction
    #define CACHEBLOCKSIZE 128
    #define CACHE_BLOCK_ADDR(addr,size)  (addr*size)&(~(CACHEBLOCKSIZE-1))
    
    
    int main(){
        // fixed dim3 variables
        // grid and block size
        dim3 blockDim,gridDim;
        blockDim.x=BLOCKDIMX;
        blockDim.y=BLOCKDIMY;
        gridDim.x=GRIDDIMX;
        gridDim.y=GRIDDIMY;
    
        // counters
        int unq_accesses=0;
        int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
        int total_unq_accesses=0;
    
        // iter over total number of threads
        // and count the number of memory requests (the coalesced requests)
        int I, II, III;
        for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
            dim3 blockIdx;
            blockIdx.x = I%GRIDDIMX;
            blockIdx.y = I/GRIDDIMX;
            for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
                if(II%COALESCINGWIDTH==0){
                    // new coalescing bunch
                    total_unq_accesses+=unq_accesses;
                    unq_accesses=0;
                }
                dim3 threadIdx;
                threadIdx.x=II%BLOCKDIMX;
                threadIdx.y=II/BLOCKDIMX;
    
                ////////////////////////////////////////////////////////
                // Change this section to evaluate different accesses //
                ////////////////////////////////////////////////////////
                // do your indexing here
                #define BLOCK_SIZE_X BLOCKDIMX
                #define BLOCK_SIZE_Y BLOCKDIMY
                #define xdim 32
                int i       = threadIdx.x;
                int j       = threadIdx.y;
                int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
                int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
    
                int index1  = j*BLOCK_SIZE_Y+i;
    
                int i1      = (index1)%(BLOCK_SIZE_X+1);
                int j1      = (index1)/(BLOCK_SIZE_Y+1);
    
                int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
                int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
                // calculate the accessed location and offset here
                // change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
                int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
                int size = sizeof(double);
                //////////////////////////
                // End of modifications //
                //////////////////////////
    
                printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
                // check whether it can be merged with existing requests or not
                short merged=0;
                for(III=0; III<unq_accesses; III++){
                    if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
                        merged=1;
                        break;
                    }
                }
                if(!merged){
                    // new cache block accessed over this coalescing width
                    unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
                    unq_accesses++;
                }
            }
        }
        printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
    }
    

    The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.

    To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into ‘address’ and ‘size’. I’ve already done this for your code where the indexings are:

    int i       = threadIdx.x;
    int j       = threadIdx.y;
    int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
    int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;
    
    int index1  = j*BLOCK_SIZE_Y+i;
    
    int i1      = (index1)%(BLOCK_SIZE_X+1);
    int j1      = (index1)/(BLOCK_SIZE_Y+1);
    
    int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
    int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
    

    and the memory access is:

    Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];
    

    The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I would like to read a text file and input its contents into an
I would like to read and write the contents of large, raw volume files
I would like to read in an Array I have stored in a PLIST,
I would like to read those 2 values -- how much memory I allocated
I have a cookie which I would like to read then remove one entry
I would like to read an entire text file and store its entire contents
I would like to read an NFC tag's UID , tags don't have any
I would like to read multiple XML files in an array and then parse
I would like to read asynchronously from stdin with Qt. I don't want to
I would like to read Rails documentation locally though gem server. I go to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.