Please help me. I can’t understand why this function, that uses texture memory
__global__ void corr (int * data)
{ int idx = (blockIdx.y*blockDim.y+threadIdx.y)*64+ (blockIdx.x * blockDim.x + threadIdx.x);
data[idx]=0;
for(int i=0; i<blockDim.y-threadIdx.y; i++)
for(int j=0; j<blockDim.x-threadIdx.x; j++)
data [idx] = data[idx] + tex2D(g_TexRef,blockIdx.x * blockDim.x + threadIdx.x +j, blockIdx.y*blockDim.y+threadIdx.y+i);
work slower than another version of this function, that uses global memory
__global__ void corr1(int * in , int * data)
{ int idx = (blockIdx.y*blockDim.y+threadIdx.y)*64+ (blockIdx.x * blockDim.x + threadIdx.x);
data[idx]=0;
for(int i=0; i<blockDim.y-threadIdx.y; i++)
for(int j=0; j<blockDim.x-threadIdx.x; j++)
data [idx] = data[idx] +in[(blockIdx.y*blockDim.y+threadIdx.y+i)*64+blockIdx.x * blockDim.x + threadIdx.x +j];
On Fermi, global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache.
Also, your 2D spacial locality may not be high enough that you benefit from using textures. If so, you may be able to refactor your kernels so that concurrently running threads access values in the cache that are closer to each other.
See this thread for more information, and further pointers: http://forums.nvidia.com/index.php?showtopic=181432