I google around a bit, but this is not clear to me now whether some GPUs programmed with CUDA can take advantage or can use instructions similar to those from SSE SIMD extensions; for instance whether we can sum up two vectors of floats in double precission, each one with 4 values. If so, I wonder whether it would be better to use more lighter threads for each of the previous 4 values of the vector or use SIMD.
Share
CUDA programs compile to the PTX instruction set. That instruction set does not contain SIMD instructions. So, CUDA programs cannot make explicit use of SIMD.
However, the whole idea of CUDA is to do SIMD on a grand scale. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions (although some of the instructions may be suppressed for some threads, giving the illusion of different execution sequences). NVidia call it Single Instruction, Multiple Thread (SIMT), but it’s essentially SIMD.