Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8215003
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T11:38:45+00:00 2026-06-07T11:38:45+00:00

I wrote small CUDA code to understand global memory to shared memory transfer transactions.

  • 0

I wrote small CUDA code to understand global memory to shared memory transfer transactions. The code is as follows:

#include <iostream>
using namespace std;

__global__ void readUChar4(uchar4* c, uchar4* o){
  extern __shared__ uchar4 gc[];
  int tid = threadIdx.x;
  gc[tid] = c[tid];
  o[tid] = gc[tid];
}

int main(){
  string a = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
  uchar4* c;
  cudaError_t e1 = cudaMalloc((void**)&c, 128*sizeof(uchar4));
  if(e1==cudaSuccess){
    uchar4* o;
    cudaError_t e11 = cudaMalloc((void**)&o, 128*sizeof(uchar4));

    if(e11 == cudaSuccess){
      cudaError_t e2 = cudaMemcpy(c, a.c_str(), 128*sizeof(uchar4), cudaMemcpyHostToDevice);
      if(e2 == cudaSuccess){
        readUChar4<<<1,128, 128*sizeof(uchar4)>>>(c, o);
        uchar4* oFromGPU = (uchar4*)malloc(128*sizeof(uchar4));
        cudaError_t e22 = cudaMemcpy(oFromGPU, o, 128*sizeof(uchar4), cudaMemcpyDeviceToHost);
        if(e22 == cudaSuccess){
          for(int i =0; i < 128; i++){
            cout << oFromGPU[i].x << " ";
            cout << oFromGPU[i].y << " ";
            cout << oFromGPU[i].z << " ";
            cout << oFromGPU[i].w << " " << endl;

          }
        }
        else{
          cout << "Failed to copy from GPU" << endl;
        }
      }
      else{
        cout << "Failed to copy" << endl;
      }
    }
    else{
      cout << "Failed to allocate output memory" << endl;
    }
  }
  else{
    cout << "Failed to allocate memory" << endl;
  }
  return 0;
}

This code simply copies data from device memory to shared memory and back to device memory. I have the following three questions:

  1. Is the transfer from device memory to shared memory in this case guaranteed to take 4 memory transactions? I believe it depends on how cudaMalloc allocates memory; if the memory is allocated in a haphazard manner such that the data is scattered over memory, then it will take more than 4 memory transactions. However, if cudaMalloc allocates memory in 128 byte chunks or it allocates memory contiguously, then it should not take more than 4 memory transactions.
  2. Does the above logic also hold for writing data from shared memory to device memory i.e., the transfer will complete in 4 memory transactions?
  3. Can this code cause bank conflicts. I believe that this code will not cause bank conflicts if threads are assigned ids sequentially. However, if thread 32 and 64 are scheduled to run in the same warp, then this code can cause bank conflicts.
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T11:38:46+00:00Added an answer on June 7, 2026 at 11:38 am

    In the code you provided (repeated here) the compiler will completely remove the shared memory store and load since they don’t do anything necessary or beneficial for the code.

     __global__ void readUChar4(uchar4* c, uchar4* o){
      extern __shared__ uchar4 gc[];
      int tid = threadIdx.x;
      gc[tid] = c[tid];
      o[tid] = gc[tid];
    }
    

    Assuming you did something with the shared memory so it was not eliminated, then:

    1. The loads and stores from and to global memory in this code would take ONE transaction per warp (assuming Fermi or later GPU), since they are only 32-bits (uchar4 = 4*8 bits) per thread (total 128 bytes per warp). cudaMalloc allocates memory contiguously.
    2. The answer from 1. applies to stores also, yes.
    3. There are no bank conflicts in this code. Threads in a warp are always contiguous, with the first thread a multiple of the warp size. So threads 32 and 64 will never be in the same warp. And since you are loading and storing a 32-bit data type, and the banks are 32 bits wide, there are no conflicts.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I wrote this small C++ program and built it(Release) #include<iostream> int main(){ std::cout<<Hello World;
I wrote a small internal web app using (a subset of) pylons . As
I wrote some small apps using .NET 3.5 but now I am stuck with
I wrote a small app that turns out to be using a lot of
I wrote a small program using a custom indexOf function but wanted to dismiss
I wrote a small program in VS2005 to test whether C++ global operator new
I wrote small piece of code which should toggle reporting box on my page.
I wrote a small C# Tool deleting all comments within a word file using
i wrote a small prog : 1 #include<stdio.h> 2 main(){ 3 char* str =
I wrote this small code just to see how an iterator actually gets invalidated

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.