I’m using a HD5770, which has 10 Compute units and 32k of local memory.
My global size is 256 * 256,
my local size is 256
Each of the workgroups needs to use 1k of local memory which I am specifying like this:
clSetKernelArg(predicate, param++, 1024, NULL);
First: is this the correct way to allocate local memory, or do I have to specify the whole size of the buffer used by all workgroups together when setting the kernel arg and later on index into this buffer depending on the local id?
Second: Will one workgroup execute on only one compute unit?
Third: Will the memory be freed after the workgroup finished? (32k wont be enough for 256 workgroups if each of them uses 1k)
Or in a more general way: will the scheduler take care of not scheduling more than 32 Workgroups in parallel?
Thank you!
1) This is how you would allocate 1024 bytes of uninitialized local memory, so if that’s what you want then yes, you are doing it right. You can also define the memory inside the kernel like this:
2) This is implementation defined so there’s no way to know and you can’t assume that. Generally a workgroup is executed on more than 1 compute unit but like I said, you never really know.
3) The memory will be overwritten the next time it is used so you shouldn’t have to worry about freeing it, especially since it is uninitialized and you aren’t passing in some buffer.
Hope this helps