I’m working in an algorithm using OpenCL and I need to measure the execution

Question

0

Asked: June 17, 20262026-06-17T13:34:05+00:00 2026-06-17T13:34:05+00:00

I’m working in an algorithm using OpenCL and I need to measure the execution

0

I’m working in an algorithm using OpenCL and I need to measure the execution time of it in its parallel and sequential versions. Due to this, I’m using an external loop to iterate both codes and measure their times but I have obtained:

Sequential: 3.06 segs

Parallel: 269 segs

The code that I’m using for the parallel version is:

t_start=clock();                 /* Start measuring time */

for(i=0;i<=N; i++) // N is really big, around a million, but is the same for both versions
{

fitness = 0;

ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, NULL);

ret = clEnqueueReadBuffer(command_queue, vdistance, CL_TRUE, 0, siz_mem_distance_code, distance_code, 0, NULL, NULL);

ret = clEnqueueReadBuffer(command_queue, vsumatorio, CL_TRUE, 0,siz_mem_sumatorio, sumatorio, 0, NULL, NULL);

fitness = (1/(*sumatorio)) + (*distance_code/12) + ((pow(*distance_code,2))/4) + ((pow(*distance_code,3))/6);

}

t_finish=clock();                 /* End measuring time */

Before this piece of code, I have created/initialized all the things that we need to run a program using OpenCL ( platform, devide, context, queue, buffer, kernel,…) and after this code, I release everything.
I have checked that this increase of time is due to read in each iteration both variables ( distance_code and sumatorio) but I must to do it because I have to obtain the fitness value which is a sequential instruction and can only be excuted when the kernel has finished, so… Could you help me? What am I doing wrong?

I hope to have explained myself properly, thanks in advance.

Note: I’m only working with the CPU.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T13:34:06+00:00

The overhead of launching so many kernels exceeds the benefits of parallelizing a for loop over only 64 data items. You need to rewrite your problem so that you launch relatively few kernels over large batches of data. In that case and if the OpenCL compiler generated appropriate vectorizing machine code you would see an improvement over the sequential version.

Additionally, you should check with either AMD’s CodeXL or Intel’s Offline Compiler if the generated code contains any vector instructions.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m working in an algorithm using OpenCL and I need to measure the execution

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply