I want to fill the global memory with as much data as possible, and I know each data I have is a value between 0 and 255. So I was thinking that instead of using an int type I could store their value in a short, or even better in a char, and access and compute the values on the device using the same type.
However, will this affect performance on a Tesla architecture? And will the copy from global to shared memory be coalesced?
Any ideas? Thanks!
The best strategy for optimising bandwidth utilization for 8 or 16 bit types will depend on the access patterns of the memory within a warp. If any given warp will read in a highly scattered fashion, then using byte sizes types and forcing 32 bit transaction sizes per warp might be an advantage, because the standard transaction sizes will waste a lot of the achieved bandwidth. But if the access patterns within a warp are going to hit contiguous or near-contiguous segments of memory, then the best strategy will probably be to use
uchar4orushort2and aim to achieve throughput using coalesced 32 bit reads per thread from global memory to a shared memory array and have all the threads in a block read from that. It might also be worth evaluating the performance of a texture for loads if the per warp accesses contain scatter.This pdf contains a lot of useful information about memory performance optimisation on Fermi. I would encourage you to spend some time reading it. Benchmarking is really the only way of evaluating the best approach for a given application, and that document has some excellent tips on how to understand the memory performance of a given piece of code.