There used to be profiling counters in cudaprof for global memory (gst_coherent, gst_incoherent, gld_coherent, gld_incoherent) that were useful and clear to me because they told me how many uncoalesced global reads and writes I had.
Now, there seems to be only “gst requests” and “gld requests”. These are the total loads/stores per warp on mp 0. How do I determine if I have uncoalesced reads/writes? I’m guessing that there would be fewer requests if the requests were coalesced. Am I supposed to figure out how many I expect per thread and compare? Unfortunately, my kernel is too dynamic for that.
The coherent/incoherent counters are relevant on sm_10/sm_11 devices, where accesses had to be aligned and coalesced to avoid pathological performance. On sm_12 and sm_13 the hardware attempts to coalesce accesses wherever possible into segment transactions, and on sm_2x the L1 cache provides a similar function with the additional beenift of the cache for when this is not possible.
Ideally you would have a feel for how much data you are reading and writing and compare this with the achieved performance, this will give you an idea of the efficiency. However given that your kernel is very data-dependent you should take a look at a couple of the presentations from GTC2010 to understand the other information that is available in the profiler. I’d recommend the Fundamental Performance Optimizations for GPUs talk and, more importantly but following on from the first one, the Analysis-Driven Performance Optimization talk.
You could also consider instrumenting your code manually with a few extra counters.