I have a large character array in the device global memory that is accessed
in a coalescent manner by threads. I’ve read somewhere that I could speed up
memory access by reading 4 or 16 chars in one memory transaction per thread.
I believe I would have to use textures and the char4 or int4 structs. However,
I can’t find any documentation or examples on this. Could anyone here please
provide a simple example or pointers to where I can learn more about this?
In my code I define the char array as
char *database = NULL;
cudaMalloc( (void**) &database, SIZE * sizeof(char) );
What would the definition be if I want to use textures and char4 (or int4)?
Thanks very much.
I finally figured out the answer to my own question. The definition with char4
would be
Don’t need textures for this. The kernel does speedup by a factor of three
with char4 but reduces to two if I do loop unrolling. For the sake of completeness
my kernel is
And with char4 it is
I also tried int4 but it’s just .0002 seconds faster than the char4 time.