I’ve been playing around with GPU (GTX580) profiling counters. Could anyone tell me what causes the uncertainty of the profiling counters results. I have a very simple kernel, which just copies a buffer to another buffer. And I profile the instructions executed in this kernel. For some of the configuration of the work-items count and work-group size, the results are stable across different runs. But for some other configurations, it differs significantly between different runs.
I’ve been told that because warps (and workgroup) to SMs mapping is non-deterministic. But as I know, at least, the warps belonging to a workgroup will be executed in one SM only, and there is no branch in the kernel, so in theory no matter how the warps are mapped to SMs, the results should still be the same.
Any helps would be appreciated.
EDIT: This is the code in question:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
__kernel void histogram(__global float* x, __global float* y)
{
int id = get_global_id(0);
y[id] = x[id];
}
The profiler only instruments a subset of the Multiprocessors to collect data and then scales the results up to the full GPU. If your code doesn’t evenly “fill” every MP with the same number of blocks, then there can be variation in the resulting profiler output depending on how much work the profiling MP(s) received on a given run. In extreme situations with kernels running a very small number of blocks on a large GPU, profiling can actually fail to collect any statistics because the profiling MP(s) ran no blocks at all.