I’ve a vector on the host and I want to halve it and send to the device. Doing a benchmark shows that CL_MEM_ALLOC_HOST_PTR is faster than CL_MEM_USE_HOST_PTR and much faster than CL_MEM_COPY_HOST_PTR. Also memory analysis on device doesn’t show any difference in the buffer size created on device. This differs from the documentation of the mentioned flag on Khronos- clCreateBuffer. Does anyone know what’s going on?
I’ve a vector on the host and I want to halve it and send
Share
First off and if I understand you correctly, clCreateSubBuffer is probably not what you want, as it creates a sub-buffer from an existing OpenCL buffer object. The documentation you linked also tells us that:
You said you have a vector on the host and want to send half of it to the device. For this, I would use a regular buffer of half the vector’s size (in bytes) on the device.
Then, with a regular buffer, the performance you see is expected.
CL_MEM_ALLOC_HOST_PTRonly allocates memory on the host, which does not incur any transfer at all: it is like doing a malloc and not filling the memory.CL_MEM_COPY_HOST_PTRwill allocate a buffer on the device, most probably the RAM on GPUs, and then copy your whole host buffer over to the device memory.CL_MEM_USE_HOST_PTRmost likely allocates so-called page-locked or pinned memory. This kind of memory is the fastest for host->GPU memory transfer and this is the recommended way to do the copy.To read how to correctly use pinned memory on NVidia devices, refer to chapter 3.1.1 of NVidia’s OpenCL best practices guide. Note that if you use too much pinned memory, performance may drop below a host copied memory.
The reason why pinned memory is faster than copied device memory is well-explained in this SO question aswell as this forum thread it points to.