I’m writing a theoretical assignment of the possibilities in heterogeneous computing.
I need to compare the effectiveness of a single thread (non-parallelizable) executed in serial manner on either the CPU or the GPU.
I know it’s an odd question since it doesn’t make sense to execute a single thread on the GPU, but I could really use a guide-line ratio for a heuristic I am developing.
I know that it could easily be tested, but I don’t have any practical experience with neither CUDA nor OpenCL, and I’m in a hurry.
GPU execution units tend to be in-order, and (in the case of nVidia GPUs at least) you only typically get only one instruction per 4 clocks in a single-threaded context. Compare this with modern superscalar CPUs where you can typically get a throughput of > 1 instruction per clock and the CPU wins by factor of 4 or more on a clock-for-clock basis. CPU clock frequencies tend to be much higher than GPU clocks though, so there could easily be a further factor of 3 from clock speed, taking the CPU up to 12x or more relative to the GPU.