Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7880745
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 3, 20262026-06-03T04:13:04+00:00 2026-06-03T04:13:04+00:00

I’m a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#,

  • 0

I’m a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#, and am currently working on an application that computes a large 3D matrix. At each pixel in the matrix, 192 unique values are computed and then summed to yield the final value for that pixel. So, functionally, it is like a 4-D matrix, (161 x 161 x 161) x 192.

Right now I’m calling the kernel from my host code like this:

//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}

//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);

Sample kernel code is posted below.

__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
    int N = 161; //Size of matrix edges
    int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
    int array_id = get_global_id(1); //The location within the otherArray


    //Finding the x,y,z values of the pixel_id.
    float3 p;
    p.x = pixel_id % N;    
    p.y = ((pixel_id % (N*N))-p.x)/N;
    p.z = (pixel_id - p.x - p.y*N)/(N*N);

    float result;

    //...
    //Some long calculation for 'result' involving otherArray and p...
    //...

    BigMatrix[pixel_id] += result;
}

My code currently works, however I’m looking for speed for this application, and I’m not sure if my worker/group setup is the best approach (i.e. 161*161*161 and 192 for dimensions of the worker pool).

I’ve seen other examples of organizing the global worker pool into local worker groups to increase efficiency, but I’m not quite sure how to implement that in OpenCL.NET. I’m also not sure how this is different than just creating another dimension in the worker pool.

So, my question is: Can I use local groups here, and if so how would I organize them? In general, how is using local groups different than just calling an n-dimensional worker pool? (i.e. calling Execute(args, new int[]{(N*N*N),192}), versus having a local workgroup size of 192?)

Thanks for all the help!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-03T04:13:05+00:00Added an answer on June 3, 2026 at 4:13 am

    I think a lot of performance is lost waiting on memory access. I have answered a similar SO question. I hope my post helps you out. Please ask any questions you have.

    Optimizations:

    1. The big boost in my version of your kernel comes from reading otherArray into local memory.
    2. each work item computes 4 values in BigMatrix. This means they can be written at the same time, on the same cacheline. There is minimal loss of parallelism because there are still > 1M work items to execute.

    …

    #define N 161
    #define Nsqr N*N
    #define Ncub N*N*N
    #define otherSize 192
    
    __kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
    {
        //using 1 quarter of the total size of the matrix
        //this work item will be responsible for computing 4 consecutive values in BigMatrix
        //also reduces global size to (N^3)/4  ~= 1043000 for N=161
    
        int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
        int pixel_id;
        //array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely
    
        int local_id = get_local_id(0); //work item id within the group
        int local_size = get_local_size(0); //size of group
    
    
        float result[4]; //result cached for 4 global values
        int i, j;
        float3 p;
    
        //cache the values in otherArray to local memory
        //now each work item in the group will be able to read the values efficently
        //each element in otherArray will be read a total of N^3 times, so this is important
        //opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
        __local float otherValues[otherSize];
        for(i=local_id; i<otherSize; i+= local_size){
            otherValues[i] = otherArray[i];
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);
    
        //now this work item can compute the complete result for pixel_id 
        for(j=0;j<4;j++){
            result[j] = 0;
            pixel_id = global_id + j;
    
            //Finding the x,y,z values of the pixel_id.
            //TODO: optimize the calculation of p.y and p.z
            //they will be the same most of the time for a given work item
            p.x = pixel_id % N;    
            p.y = ((pixel_id % Nsqr)-p.x)/N;
            p.z = (pixel_id - p.x - p.y*N)/Nsqr;
    
            for(i=0;i<otherSize;i++){
                //...
                //Some long calculation for 'result' involving otherValues[i] and p...
                //...
                //result[j] += ...
            }
        }
        //4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
        BigMatrix[global_id] += result[0];
        BigMatrix[global_id + 1] += result[1];
        BigMatrix[global_id + 2] += result[2];
        BigMatrix[global_id + 3] += result[3];
    }
    

    Notes:

    1. Global work size needs to be a multiple of four. Ideally, a multiple of 4*workgroupsize. This is because there is no error checking to see if each pixel_id falls within the range: 0..N^3-1. Unprocessed elements can be crunched by the cpu while you wait for the kernel to execute.
    2. The work group size should be fairly large. This will force the cached values to be used more heavily and the benefit of caching the data in LDS will grow.
    3. There is a further optimization to be done with the calculation of p.x/y/z in order to avoid too many costly division and modulo operations. see code below.

      __kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)   {
      int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
      int pixel_id = global_id;
      
      int local_id = get_local_id(0); //work item id within the group
      int local_size = get_local_size(0); //size of group
      
      
      float result[4]; //result cached for 4 global values
      int i, j;
      float3 p;
      //Finding the initial x,y,z values of the pixel_id.
      p.x = pixel_id % N;    
      p.y = ((pixel_id % Nsqr)-p.x)/N;
      p.z = (pixel_id - p.x - p.y*N)/Nsqr;
      
      //cache the values here. same as above...
      
      //now this work item can compute the complete result for pixel_id 
      for(j=0;j<4;j++){
          result[j] = 0;
      //increment the x,y,and z values instead of computing them all from scratch
          p.x += 1;
          if(p.x >= N){
              p.x = 0;
              p.y += 1;
              if(p.y >= N){
                  p.y = 0;
                  p.z += 1;
              }
          }
      
          for(i=0;i<otherSize;i++){
              //same i loop as above...
          }
      }
      
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

That's pretty much it. I'm using Nokogiri to scrape a web page what has
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I've got a string that has curly quotes in it. I'd like to replace
I am reading a book about Javascript and jQuery and using one of the
I have a French site that I want to parse, but am running into
I want use html5's new tag to play a wav file (currently only supported
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.