I have seen solutions like this:
kernel dp_square (const float *a,
float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
and
kernel dp_square (const float *a,
float *result, const unsigned int count)
{
int id = get_global_id(0);
if(id < count)
result[id] = a[id] * a[id];
}
Is the check for id< count important, what happens if a kernel work item tries to process an item not avalible?
Can the reason for it not being there in the first example be that programmer just ensures that the global size is equal the number of elements to be processed ( is this normal) ?
This is often done for two reasons —
To ensure that a developer-error doesn’t kill the code or read bad memory
Because sometimes it is optimal to run more work-items than there are data points. For example, if the optimal work-group size for my device is 32 (not uncommon), and I have an array of 61 pieces of data, I’ll run 64-work items, and the last three will simply “play dead.”
In order to not include this check, you’d have to use a work-group size that divides the total number of work-items. In this case, that would leave you with a work-group size of 1 (as 61 is prime), which would be very slow!