While trying to improve the performance of some OpenCL computation, I used the profiling features of the OpenCL runtime on a clEnqueueWriteBuffer call, and on the immediately following clEnqueueNDRangeKernel (which depends on the previous data tranfer):
clEnqueueWriteBuffer(cmdq, cl_buf, CL_FALSE, 0, size, data, 0, NULL, &write_ev);
clEnqueueNDRangeKernel(cmdq, ker_with_cl_buf_as_input_param, 2, NULL,
work_sze, local_sze, 1, &write_ev, &ker_ev);
Here is what is returned by clGetEventProfilingInfo (I substracted the initial time and converted to micro seconds) :
QUEUED SUBMIT START END END-START
write_ev 0 113.952 120.448 211.136 90.688
ker_ev 130.016 132.608 217.280 515.200 297.920
My questions are:
- Why does clEnqueueWriteBuffer not return before the memory transfer has been started or submitted ?
- More importantly, why does it take so long for the transfer to be actually submitted ???
It seems to me that 22% performance could be gained just if the memory transfer could start immediately.
Does clEnqueueWriteBuffer copy the data in another host memory region before actually doing the transfer ?
Additional information:
I use the cuda 4.1 framework on a Tesla M2090 GPU.
The buffer is created previously using:
cl_buf = clCreateBuffer(my_context, CL_MEM_READ_ONLY, size, NULL, NULL);
EDIT: clEnqueueReadBuffer does not exhibit such behavior.
You can try to use pinned memory as described in Section 3.1.1 of NVidia OpenCL Best Practices Guide.
They do not mention if a copy is performed in case pageable memory is used, but it may happen.