Given a CUDA vector type int4, how can I load 128 bits of data from constant memory.
This doesn’t seem to work:
#include <stdio.h>
#include <cuda.h>
__constant__ int constant_mem[4];
__global__ void kernel(){
int4 vec;
vec = constant_mem[0];
}
int main(void){return 0;}
On the seventh line I’m trying to load all 4 integer values in the constant memory into the 128-bit vector type. This operation results in the following compilation error:
vectest.cu(7): error: no operator "=" matches these operands
operand types are: int4 = int
Also, is it possible to access the vector type directly without having to cast it, like so:
int data = vec[0];
Switch statement in PTX assembly:
@%p1 bra BB1_55;
setp.eq.s32 %p26, %r1, 1;
@%p26 bra BB1_54;
setp.eq.s32 %p27, %r1, 2;
@%p27 bra BB1_53;
setp.ne.s32 %p28, %r1, 3;
@%p28 bra BB1_55;
mov.u32 %r961, %r61;
bra.uni BB1_56;
BB1_53:
mov.u32 %r961, %r60;
bra.uni BB1_56;
BB1_54:
mov.u32 %r961, %r59;
bra.uni BB1_56;
BB1_55:
mov.u32 %r961, %r58;
BB1_56:
In the first case, casting is probably the simplest solution, so something like this:
(disclaimer written in browser, not compiled or tested, use at own risk)
Using the C++
reinterpret_castoperator will force compiler will emit a 128 bit load instruction.In the second case, it sounds like you want to directly address 32 bit words stored in an array of 128 bit vector types, using 128 bit memory transactions. That requires some helper functions, perhaps something like:
(disclaimer written in browser, not compiled or tested, use at own risk)
Here we force a 128 bit transaction by reading whole
int4values and parsing their contents (the casts to void are an incantation necessary for older versions of the open64 compiler which was prone to optimize vector loads if it thought members were unused). There are a few IOPs of overhead to do the indexing, but they are potentially worth it if the load bandwidth of the resulting transaction is higher. The switch statement is probably compiled using conditional execution, so there shouldn’t be a branch divergence penalty. Be aware that very random access to an array of int4 values can potentially waste a lot of bandwidth and cause warp serialization. There is potentially a big negative performance impact in doing so.