Code taken from a sample. I created a project with it and it works,

Question

0

Asked: June 14, 20262026-06-14T09:25:19+00:00 2026-06-14T09:25:19+00:00

Code taken from a sample. I created a project with it and it works,

0

Code taken from a sample. I created a project with it and it works, but I don’t understand some parts.

For the sake of the example, say I have a 32×32 matrix, there are 36 work items and so get_global_id(0) goes from 0 -> 35 I presume, and size = MATRIX_DIM/4 = 8.

__kernel void transpose(__global float4 *g_mat,
   __local float4 *l_mat, uint size) {

   __global float4 *src, *dst;

   /* Determine row and column location */
   int col = get_global_id(0);
   int row = 0;
   while(col >= size) {
      col -= size--;
      row++;
   }
   col += row;
   size += row;

   /* Read source block into local memory */
   src = g_mat + row * size * 4 + col;
   l_mat += get_local_id(0)*8;

In the clEnqueueNDRangeKernel call, the arg local_work_size was set to NULL which according to the manual means let the compiler or something figure it out:

local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

But I don’t understand the multiply by 8, which gives an address offset into local memory for the work group I suppose. Can someone please explain this?

   l_mat[0] = src[0];
   l_mat[1] = src[size];
   l_mat[2] = src[2*size];
   l_mat[3] = src[3*size];

   /* Process block on diagonal */
   if(row == col) {
      src[0] =
         (float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
      src[size] =
         (float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
      src[2*size] =
         (float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
      src[3*size] =
         (float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);
   }
   /* Process block off diagonal */
   else {
      /* Read destination block into local memory */
      dst = g_mat + col * size * 4 + row;
      l_mat[4] = dst[0];
      l_mat[5] = dst[size];
      l_mat[6] = dst[2*size];
      l_mat[7] = dst[3*size];

      /* Set elements of destination block */
      dst[0] =
         (float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
      dst[size] =
         (float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
      dst[2*size] =
         (float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
      dst[3*size] =
         (float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);

      /* Set elements of source block */
      src[0] =
         (float4)(l_mat[4].x, l_mat[5].x, l_mat[6].x, l_mat[7].x);
      src[size] =
         (float4)(l_mat[4].y, l_mat[5].y, l_mat[6].y, l_mat[7].y);
      src[2*size] =
         (float4)(l_mat[4].z, l_mat[5].z, l_mat[6].z, l_mat[7].z);
      src[3*size] =
         (float4)(l_mat[4].w, l_mat[5].w, l_mat[6].w, l_mat[7].w);
   }
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T09:25:20+00:00

l_mat is being used a a local store for threads in a work-group. Specifically, it is being used because accesses to local memory are orders of magnitude faster than to global memory.

Each thread needs 8 float4s. Doing the following pointer arithmetic

l_mat += get_local_id(0)*8;

moves the l_mat pointer for each thread so that it doesn’t overlap with other threads’ data.

This could cause an error since the local_size wasn’t specified and we are unable to ensure that the size of l_mat is sufficient to store the values for each thread.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Code taken from a sample. I created a project with it and it works,

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply