Consider a simple example: vector addition. If I build a program for CL_DEVICE_TYPE_GPU, and

Question

0

Asked: May 29, 20262026-05-29T15:51:27+00:00 2026-05-29T15:51:27+00:00

Consider a simple example: vector addition. If I build a program for CL_DEVICE_TYPE_GPU, and

0

Consider a simple example: vector addition.

If I build a program for CL_DEVICE_TYPE_GPU, and I build the same program for CL_DEVICE_TYPE_CPU, what is the difference between them(except that “CPU program” is running on CPU, and “GPU program” is running on GPU)?

Thanks for your help.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T15:51:28+00:00

There are a few differences between the device types. The simple answer to your vector question is: Use a gpu for large vectors, and cpu for smaller workloads.

1) Memory copying. GPUs rely on the data you are working on to be passed into them, and the results are later read back to the host. This is done over PCI-e, which yields about 5GB/s for version 2.0 / 2.1. CPUs can use buffers ‘in place’ – in DDR3 – using either of the CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR flags. See here: clCreateBuffer. This is one of the big bottlenecks for many kernels.

2) Clock speed. cpus currently have a big lead over gpus in clock speed. 2Ghz on the low end for most cpus, vs 1Ghz as a top end for most gpus these days. This is one factor that really helps the cpu ‘win’ over a gpu for small workloads.

3) Concurrent ‘threads’. High-end gpus usually have more compute units than their cpu counterparts. For example, the 6970 gpu (Cayman) has 24 opencl compute units, each of these is divided into 16 SIMD units. Most of the top desktop cpus have 8 cores, and server cpus currently stop at 16 cores. (cpu cores map 1:1 to compute unit count) A compute unit in opencl is a portion of the device which can do work that is different from the rest of the device.

4) Thread types. gpus have a SIMD architecture, with many graphic-oriented instructions. cpus have a lot of their area dedicated to branch prediction and general computations. A cpu may have a SIMD unit and/or floating point unit in every core, but the Cayman chip I mentioned above has 1536 units with the gpu instruction set available to each one. AMD calls them stream processors, and there are 4 in each of the SIMD units mentioned above (24x16x4 = 1536). No cpu will have that many sin(x) or dot-product-capable units unless the manufacturer wants to cut out some cache memory or branch prediction hardware. The SIMD layout of the gpus is probably the largest ‘win’ for large vector addition situations. That the also do other specialized functions is a big bonus.

5) Memory Bandwidth. cpus with DDR3: ~17GB/s. High-end gpus >100GB/s, speeds of over 200GB/s are becoming common lately. If your algorithm is not PCI-e limited (see #1), the gpu will outpace the cpu in raw memory access. The scheduling units in a gpu can hide memory latency further by running only tasks that aren’t waiting on memory access. AMD calls this a wavefront, Nvidia calls it a warp. cpus have a large and complicated caching system to help hide their memory access times in the case where the program is reusing the data. For your vector add problem, you will likely be limited more by the PCI-e bus since the vectors are generally used only once or twice each.

6) Power efficiency. A gpu (used properly) will usually be more electrically efficient than a cpu. Because cpus dominate in clock speed, one of the only ways to really reduce power consumption is to down-clock the chip. This obviously leads to longer compute times. Many of the top systems on the Green 500 list are heavily gpu accelerated. see here: green500.org

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Consider a simple example: vector addition. If I build a program for CL_DEVICE_TYPE_GPU, and

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply