I have simple kernel: kernel vecadd(global const float A, global const float B, global

Question

0

Asked: June 2, 20262026-06-02T06:02:53+00:00 2026-06-02T06:02:53+00:00

I have simple kernel: kernel vecadd(global const float A, global const float B, global

0

I have simple kernel:

__kernel vecadd(__global const float *A,
                __global const float *B,
                __global float *C)
{
    int idx = get_global_id(0);
    C[idx] = A[idx] + B[idx];
}

Why when I change float to float4, kernel runs more than 30% slower?

All tutorials says, that using vector types speeds up computation…

On host side, memory alocated for float4 arguments is 16 bytes aligned and global_work_size for clEnqueueNDRangeKernel is 4 times smaller.

Kernel runs on AMD HD5770 GPU, AMD-APP-SDK-v2.6.

Device info for CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT returns 4.

EDIT:
global_work_size = 1024*1024 (and greater)
local_work_size = 256
Time measured using CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END.

For smaller global_work_size (8196 for float / 2048 for float4), vectorized version is faster, but I would like to know, why?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T06:02:54+00:00

Editorial Team

2026-06-02T06:02:54+00:00Added an answer on June 2, 2026 at 6:02 am

I don’t know what are the tutorials you refer to, but they must be old.
Both ATI and NVIDIA use scalar gpu architectures for at least half-decade now.
Nowdays using vectors in your code is only for syntactical convenience, it bears no performance benefit over plain scalar code.
It turns out scalar architecture is better for GPUs than vectored – it is better at utilizing the hardware resources.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have simple kernel: __kernel vecadd(__global const float *A, __global const float *B, __global

Leave an answerCancel reply

1 Answer

I have simple kernel: kernel vecadd(global const float A, global const float B, global

Leave an answer
Cancel reply