Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7823987
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T08:26:30+00:00 2026-06-02T08:26:30+00:00

I’m trying to exploit shared memory in this kernel function, but the performance are

  • 0

I’m trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i’m using shared memory.
This is the kernel:

__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;

// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;

// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];

// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();

// AND each int bitwise
for(j = 0; j < b1_size; j++)
    shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);

// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}

The shared variables are declared extern because I don’t know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:

void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{

int* _memory_device;
int* b1_memory;
int* b2_memory;

int b1_size = b1.getIntSize();

// allocate memory on GPU
(cudaMalloc((void **)&b1_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory,  _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device,  _memSizeInt * SIZE_UINT));

// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));

dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);

AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);

// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));

// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}

b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel’s parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed 🙂
Thank you

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T08:26:32+00:00Added an answer on June 2, 2026 at 8:26 am

    I did not really understand what your kernel wants to do.

    You should read more about CUDA and GPU programming.

    I tried to point out some mistakes:

    1. Shared memory (SM) should reduce global memory reads.
      Analyze your global memory (GM) read and write operations per thread.

      a) You read GM two times and write SM two times.

      b) (nonsense loop ignored, no use of index) You read SM two times and write SM once.

      c) You read SM once and write GM once.

      So in total you have gained nothing. You could directly use the GM.

    2. You use all threads to write out one value at the block index i.
      You should only use one thread to write this data out.
      It makes no sense outputing the same data by multiple threads which will get serialized.

    3. You use a loop and don’t use the loop counter at all.

    4. You write at tid and read at i randomly.

    5. This assignment is overhead.

      unsigned int tid = threadIdx.x;
      
    6. The results cannot be correct with more then one block since with one block tid = i!
      All the wrong indexing results in wrong calculation using more then one block

    7. The shared memory at i was never written!

      _memory_device[i] = shared_memory_data[i];
      

    My assumption on what your kernel should do:

    /*
     * Call kernel with x-block usage and up to 3D Grid
     */
    __global__ void bitwiseAnd(int* outData_g, 
        const long long int inSize_s, 
        const int* inData1_g, 
        const int* inData2_g)
    {
        //get unique block index
        const unsigned long long int blockId = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D
    
        //get unique thread index
        const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x; 
            
        //check global unique thread range
        if(threadId >= inSize_s)
            return;
    
        //output bitwise and
        outData_g[thread] = inData1_g[thread] & inData2_g[thread];
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I want to count how many characters a certain string has in PHP, but
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I need a function that will clean a strings' special characters. I do NOT
I want to construct a data frame in an Rcpp function, but when I
I am trying to understand how to use SyndicationItem to display feed which is
Basically, what I'm trying to create is a page of div tags, each has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
For some reason, after submitting a string like this Jack’s Spindle from a text
this is what i have right now Drawing an RSS feed into the php,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.