I need to load 128 bit data per thread in CUDA C++. That in this case it is better to use for maximum performance and compatibility with the code for the CPU?
Will the following examples to access the data the equal performance?
1: Use two:
unsigned __int64 src1 = arr[threadIdx.x/2];
unsigned __int64 src2 = arr[threadIdx.x/2 + 1];
2: Use:
struct T_src { unsigned __int64 src1, src2; };
T_src src = arr[threadIdx.x];
3: Use specific types of CUDA:
ulong2 src = arr[threadIdx.x];
Accessing memory in the GPU’s “native” terms using CUDA defined types and primitives is the mostly likely way to maximize performance. This means option #3 in your question.
If you intend to write code that will run on CUDA and can also run on a stand-alone CPU when recompiled, I’d suggest coding for CUDA performance first and then back-porting for host CPU execution. CUDA is more picky about how things must be set up or structured than most host CPU architectures, and the performance benefits of doing things “right” for CUDA will far exceed the costs of doing things slightly suboptimal for the host CPU case.
I’d still use option #3 for the CUDA case and define a ulong2 structure for the host CPU case. Copying that structure around in the host CPU case will still require two (or four) memory moves behind the scenes, but it’s going to require that no matter what you do in source code. Use the simplest, easiest to read and understand source style and let the compiler take care of the heavy lifting.