Why is the GPU more performant in numeric calculations than the CPU? And worse at branching? Can someone give me a detailed explanation of it?
Share
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Each SM in GPU is an SIMD processor executing different threads of the warp on each lane of SIMD. Once application is more computation-bound (a few memory accesses) and no branch application achieves the peak FLOPS of GPU. This is due to the fact that upon branch, GPUs mask the one side of divergence and executes the other one first. Both paths are executed serially leaving some SIMD lanes inactive which accordingly drops performance.
I’ve included a useful Figure from Fung’s paper which is publicly available at the mentioned reference to show how performance actually drops:
Figure (a) shows a typical branch divergence in GPUs occurred inside a warp (4 threads in this sample). Suppose you have following kernel code:
Threads at A diverge into B and F. As shown in (b) some of the SIMD lanes are disabled over the time dropping performance. Figure (c) to (e) show how hardware serially executes diverging paths and manages divergence. For more information refer to this useful paper which is great starting point.
Compute-bounded applications like matrix multiply or N-Body simulation well mapped to GPUs and return very high performance. This is due to the fact they well occupy SIMD lanes, follow streaming model, and have a few memory accesses.