I got information from CUDA Profiler. I am so confused why
Replays Instruction != Grobal memory replay + Local memory replay + Shared bank conflict replay?
See the following information I got from profiler:
Replayed Instructions(%): 81.60
Global memory replay(%): 21.80
Local memory replays(%): 0.00
Shared bank conflict replay(%): 0.00
Could you help me explain this? Is there any other case causing instruction replay?
Because The SM can replay instructions due to other factors, like different branching logic.
So I can assume that 60% of your code is being reissued due to branching and 20% due to global memory. Can you post a snippet ?
From the F1 Help menu of the Cuda 4.0 profiler: