I think my kernel is memory bound (because most GPGPU code is memory bound), but I don’t actually know for sure. How can I found it out for myself. Probably one has to use the visual profiler, as it depends on the used GPU.
If it is explained in the CUDA Programming guide or in other NVIDIA documentation, don’t hesitate to just post a link with a page number, so I can read it up for myself.
Clarification
I would prefer are general “rule” how to determine the limiting factor, but in my special case you can find details about my kernel here: Using `overlap`, `kernel time` and `utilization` to optimize one's kernels
This presentation from NVIDIA talks about selectively disabling memory accesses and arithmetic in your kernel by modifying your source code, in order to determine if one of them is limiting your performance.