I am thinking about reworking my GPU OpenCL kernel to speed things up. The

Question

0

Asked: June 1, 20262026-06-01T09:27:01+00:00 2026-06-01T09:27:01+00:00

I am thinking about reworking my GPU OpenCL kernel to speed things up. The

0

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing down the performance. So I am planning to copy as much of the global memory into local but I have to pick what to copy.

Now my question is: Do many fetches of small chunks of memory hurt more than fewer fetches of larger chunks?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T09:27:02+00:00

You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your ‘small chunks’ close together in memory to keep it fast.

I have two functions below to demonstrate two ways to access the memory — vectorAddFoo, and vectorAddBar. The third function copySomeMemory(…) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.

__kernel void  
vectorAddFoo(__global const float * a,  
          __global const float * b,  
          __global       float * c,
          __global const totalElements) 
{ 
  int gid = get_global_id(0); 
  int elementsPerWorkItem = totalElements/get_global_size(0);
  int start = elementsPerWorkItem * gid;

  for(int i=0;i<elementsPerWorkItem;i++){
    c[start+i] = a[start+i] + b[start+i]; 
  }
} 
__kernel void  
vectorAddBar(__global const float * a,  
          __global const float * b,  
          __global       float * c,
          __global const totalElements) 
{ 
  int gid = get_global_id(0); 
  int gSize = get_global_size(0);

  for(int i=gid;i<totalElements;i+=gSize){
    c[i] = a[i] + b[i]; 
  }
} 
__kernel void  
copySomeMemory(__global const int * src,
          __global const count,
          __global const position) 
{ 
  //copy 16kb of integers to local memory, starting at 'position'
  int start = position + get_local_id(0); 
  int lSize = get_local_size(0);
  __local dst[4096];
  for(int i=0;i<4096;i+=lSize ){
    dst[start+i] = src[start+i]; 
  }
  barrier(CLK_GLOBAL_MEM_FENCE);
  //use dst here...
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am thinking about reworking my GPU OpenCL kernel to speed things up. The

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply