I’m working in an algorithm using OpenCL and I need to measure the execution time of it in its parallel and sequential versions. Due to this, I’m using an external loop to iterate both codes and measure their times but I have obtained:
Sequential: 3.06 segs
Parallel: 269 segs
The code that I’m using for the parallel version is:
t_start=clock(); /* Start measuring time */
for(i=0;i<=N; i++) // N is really big, around a million, but is the same for both versions
{
fitness = 0;
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, NULL, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vdistance, CL_TRUE, 0, siz_mem_distance_code, distance_code, 0, NULL, NULL);
ret = clEnqueueReadBuffer(command_queue, vsumatorio, CL_TRUE, 0,siz_mem_sumatorio, sumatorio, 0, NULL, NULL);
fitness = (1/(*sumatorio)) + (*distance_code/12) + ((pow(*distance_code,2))/4) + ((pow(*distance_code,3))/6);
}
t_finish=clock(); /* End measuring time */
Before this piece of code, I have created/initialized all the things that we need to run a program using OpenCL ( platform, devide, context, queue, buffer, kernel,…) and after this code, I release everything.
I have checked that this increase of time is due to read in each iteration both variables ( distance_code and sumatorio) but I must to do it because I have to obtain the fitness value which is a sequential instruction and can only be excuted when the kernel has finished, so… Could you help me? What am I doing wrong?
I hope to have explained myself properly, thanks in advance.
Note: I’m only working with the CPU.
The overhead of launching so many kernels exceeds the benefits of parallelizing a for loop over only 64 data items. You need to rewrite your problem so that you launch relatively few kernels over large batches of data. In that case and if the OpenCL compiler generated appropriate vectorizing machine code you would see an improvement over the sequential version.
Additionally, you should check with either AMD’s CodeXL or Intel’s Offline Compiler if the generated code contains any vector instructions.