In my kernel it is necessary to make a large number of random accesses to a small lookup table (only 8 32-bit integers). Each kernel has a unique lookup table. Below is a simplified version of the kernel to illustrate how the lookup table is used.
__kernel void some_kernel(
__global uint* global_table,
__global uint* X,
__global uint* Y) {
size_t gsi = get_global_size(0);
size_t gid = get_global_id(0);
__private uint LUT[8]; // 8 words of of global_table is copied to LUT
// Y is assigned a value from the lookup table based on the current value of X
for (size_t i = 0; i < n; i++) {
Y[i*gsi+gid] = LUT[X[i*gsi+gid]];
}
}
Because of the small size, I am getting the best performance by keeping the table in the __private memory space. However, because of the random nature in which the lookup table is accessed, there is still a large performance hit. With the lookup table code removed (replaced with a simple arithmetic operation, for example), although the kernel would provide the wrong answer, the performance improves by a factor of over 3.
Is there a better way? Have I overlooked some OpenCL feature that provides efficient random access for very small chunks of memory? Could there be an efficient solution using vector types?
[edit] Note, that the maximum value of X is 7, but the maximum value of Y is as large as 2^32-1. In other words, all the bits of the lookup table are being used, so it cannot be packed into a smaller representation.
The fastest solution I can think of is to not use arrays in the first place: use individual variables instead and use some sort of access function to access them as if they were an array. IIRC (at least for the AMD compiler but I’m pretty sure this is true for NVidia as well): generally, arrays are always stored in memory, while scalars may be stored in registers. (But my mind is a little fuzzy on the matter — I might be wrong!)
Even if you need a giant switch statement:
… you might still come out ahead in performance compared to a __private array, since, assuming the arr variables all fit in registers is purely ALU-bound. (Assuming you have enough spare registers for the arr variables, of course.)
Note, some OpenCL targets don’t even have private memory, and anything you declare there just goes to __global. Using register storage is an even bigger win there.
Of course, this LUT approach is likely to be slower to initialize, since you will need at least two separate memory reads to copy the LUT data from global memory.