I’m writing an algorithm in OpenCL which needs a data structure temporarily (during execution) only. This is going to be big enough to exceed most device’s local or private memory. So I have to use the global memory for this data.
I read about the different memory types in OpenCL and I know that accessing global memory randomly is really slow. In my case, every work group accesses different addresses in the global memory, so in other words, I use the global memory as a kind of local memory.
So what I ask myself is now, if the device “knows” that I don’t read data written by another work group / item, can memory access be speeded up? What exactly does __constant effect in terms of memory access mechanism? Could I misuse this or a similar keyword? Or is there even a keyword / method for my problem I overlooked?
Another thing is: The memory for this data structure only has to be allocated in the device’s memory; I do not need to access (not even initialize) it in the host. Is there a more efficient way to do this than send an uninitialized array to the device? I use QtOpenCL which lets me pass a host-initialized vector of primitives which is converted (host-)internally to a buffer and sent to the device on kernel invocation. So I’m looking for a QtOpenCL-way to do this. AFAIK, only local memory can be allocated from within kernel. (I get errors when defining an array as __global.)
Thanks in advance!
As the constant buffer size is usually rather small (e.g. 64kB on my GTX 580 and this size is shared accross all work-groups similary as the 48kB local memory), I think that using the constant buffer wouldn’t be a solution. I would suggest to look at images – in optimization examples these are often used – these are cached for 2D spacial locality, the access needn’t to be coalesced.
Btw, I have somewhere saw that disabling L1 cache with some compiler option on NVidia can result in slightly better performance with often random access to global memory.
For the second question: I think that if you don’t pass any of the
CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTRorCL_MEM_COPY_HOST_PTRflags the memory on host is not allocated and therefore nothing can be copied to the device.