Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7743123
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 1, 20262026-06-01T09:27:01+00:00 2026-06-01T09:27:01+00:00

I am thinking about reworking my GPU OpenCL kernel to speed things up. The

  • 0

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing down the performance. So I am planning to copy as much of the global memory into local but I have to pick what to copy.

Now my question is: Do many fetches of small chunks of memory hurt more than fewer fetches of larger chunks?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-01T09:27:02+00:00Added an answer on June 1, 2026 at 9:27 am

    You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

    Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your ‘small chunks’ close together in memory to keep it fast.

    I have two functions below to demonstrate two ways to access the memory — vectorAddFoo, and vectorAddBar. The third function copySomeMemory(…) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

    vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

    vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.

    __kernel void  
    vectorAddFoo(__global const float * a,  
              __global const float * b,  
              __global       float * c,
              __global const totalElements) 
    { 
      int gid = get_global_id(0); 
      int elementsPerWorkItem = totalElements/get_global_size(0);
      int start = elementsPerWorkItem * gid;
    
      for(int i=0;i<elementsPerWorkItem;i++){
        c[start+i] = a[start+i] + b[start+i]; 
      }
    } 
    __kernel void  
    vectorAddBar(__global const float * a,  
              __global const float * b,  
              __global       float * c,
              __global const totalElements) 
    { 
      int gid = get_global_id(0); 
      int gSize = get_global_size(0);
    
      for(int i=gid;i<totalElements;i+=gSize){
        c[i] = a[i] + b[i]; 
      }
    } 
    __kernel void  
    copySomeMemory(__global const int * src,
              __global const count,
              __global const position) 
    { 
      //copy 16kb of integers to local memory, starting at 'position'
      int start = position + get_local_id(0); 
      int lSize = get_local_size(0);
      __local dst[4096];
      for(int i=0;i<4096;i+=lSize ){
        dst[start+i] = src[start+i]; 
      }
      barrier(CLK_GLOBAL_MEM_FENCE);
      //use dst here...
    } 
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Thinking about avoiding code replication, I got a question that catches me every time
Just thinking about the best way to build an Order form that would (from
Im thinking about learn to develop app for iOS. I had a lot of
For about a year I have been thinking about writing a program that writes
I was thinking about doing a take_timing function that would take the timing of
Thinking about a Windows-hosted build process that will periodically drop files to disk to
Thinking about my other problem , i decided I can't even create a regular
Thinking about installing Visual Studio on my Asus eee 1000HE. Since it is not
I am thinking about this topcoder problem . Given a string of digits, find
Thinking about if we modify the definition of Hamiltonian path as we need to

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.