While trying to improve the performance of some OpenCL computation, I used the profiling

Question

0

Asked: June 12, 20262026-06-12T15:58:35+00:00 2026-06-12T15:58:35+00:00

While trying to improve the performance of some OpenCL computation, I used the profiling

0

While trying to improve the performance of some OpenCL computation, I used the profiling features of the OpenCL runtime on a clEnqueueWriteBuffer call, and on the immediately following clEnqueueNDRangeKernel (which depends on the previous data tranfer):

clEnqueueWriteBuffer(cmdq, cl_buf, CL_FALSE, 0, size, data, 0, NULL, &write_ev);
clEnqueueNDRangeKernel(cmdq, ker_with_cl_buf_as_input_param, 2, NULL,
    work_sze, local_sze, 1, &write_ev, &ker_ev);

Here is what is returned by clGetEventProfilingInfo (I substracted the initial time and converted to micro seconds) :

           QUEUED   SUBMIT    START      END   END-START
write_ev        0  113.952  120.448  211.136      90.688
ker_ev    130.016  132.608  217.280  515.200     297.920

My questions are:

Why does clEnqueueWriteBuffer not return before the memory transfer has been started or submitted ?
More importantly, why does it take so long for the transfer to be actually submitted ???

It seems to me that 22% performance could be gained just if the memory transfer could start immediately.
Does clEnqueueWriteBuffer copy the data in another host memory region before actually doing the transfer ?

Additional information:

I use the cuda 4.1 framework on a Tesla M2090 GPU.

The buffer is created previously using:

cl_buf = clCreateBuffer(my_context, CL_MEM_READ_ONLY, size, NULL, NULL);

EDIT: clEnqueueReadBuffer does not exhibit such behavior.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-12T15:58:37+00:00

Editorial Team

2026-06-12T15:58:37+00:00Added an answer on June 12, 2026 at 3:58 pm

You can try to use pinned memory as described in Section 3.1.1 of NVidia OpenCL Best Practices Guide.

They do not mention if a copy is performed in case pageable memory is used, but it may happen.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

While trying to improve the performance of some OpenCL computation, I used the profiling

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply