I understand that branching in CUDA is not recommended as it can adversely affect performance. In my work, I find myself having to implement large switch statements that contain upward of a few dozen cases.
Does anyone have any idea how badly this will affect performance. (The official documentation isn’t very specific) Also does anyone have a more efficient way of handling this portion?
A good way to avoid multiple switches is to implement function table and select function from table by index based in you switch condition. CUDA allows you to use function pointers on
__device__function in kernels.