Given a CUDA vector type int4 , how can I load 128 bits of

Question

0

Editorial Team

Asked: June 8, 20262026-06-08T19:01:07+00:00 2026-06-08T19:01:07+00:00

Given a CUDA vector type int4 , how can I load 128 bits of

0

Given a CUDA vector type int4, how can I load 128 bits of data from constant memory.

This doesn’t seem to work:

#include <stdio.h>
#include <cuda.h>

__constant__ int constant_mem[4];
__global__ void kernel(){
    int4 vec;
    vec = constant_mem[0];
}
int main(void){return 0;}

On the seventh line I’m trying to load all 4 integer values in the constant memory into the 128-bit vector type. This operation results in the following compilation error:

vectest.cu(7): error: no operator "=" matches these operands
            operand types are: int4 = int

Also, is it possible to access the vector type directly without having to cast it, like so:

int data = vec[0];

Switch statement in PTX assembly:

    @%p1 bra    BB1_55;

    setp.eq.s32     %p26, %r1, 1;
    @%p26 bra   BB1_54;

    setp.eq.s32     %p27, %r1, 2;
    @%p27 bra   BB1_53;

    setp.ne.s32     %p28, %r1, 3;
    @%p28 bra   BB1_55;

    mov.u32     %r961, %r61;
    bra.uni     BB1_56;

BB1_53:
    mov.u32     %r961, %r60;
    bra.uni     BB1_56;

BB1_54:
    mov.u32     %r961, %r59;
    bra.uni     BB1_56;

BB1_55:
    mov.u32     %r961, %r58;

BB1_56:

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T19:01:09+00:00

In the first case, casting is probably the simplest solution, so something like this:

__constant__ int constant_mem[4];
__global__ void kernel(){
    int4 vec = * reinterpret_cast<int4 *>(&constant_mem);
}

(disclaimer written in browser, not compiled or tested, use at own risk)

Using the C++ reinterpret_cast operator will force compiler will emit a 128 bit load instruction.

In the second case, it sounds like you want to directly address 32 bit words stored in an array of 128 bit vector types, using 128 bit memory transactions. That requires some helper functions, perhaps something like:

__inline__ __device__ int fetch4(const int4 val, const int n)
{
     (void) val.x; (void) val.y; (void) val.z; (void) val.w;
     switch(n) {
         case 3:
            return val.w;
         case 2: 
            return val.z;
         case 1:
            return val.y;
         case 0:
         default:
            return val.x;
    }
}

__device__ int index4(const int4 * array, const int n)
{
    int div = n / 4;
    int mod = n - (div * 4);

    int4 val = array[div]; // 128 bit load here

    return fetch4(val, mod);
}

__constant__ int constant_mem[128];
__global__ void kernel(){
    int val = index4(constant_mem, threadIdx.x);
}

(disclaimer written in browser, not compiled or tested, use at own risk)

Here we force a 128 bit transaction by reading whole int4 values and parsing their contents (the casts to void are an incantation necessary for older versions of the open64 compiler which was prone to optimize vector loads if it thought members were unused). There are a few IOPs of overhead to do the indexing, but they are potentially worth it if the load bandwidth of the resulting transaction is higher. The switch statement is probably compiled using conditional execution, so there shouldn’t be a branch divergence penalty. Be aware that very random access to an array of int4 values can potentially waste a lot of bandwidth and cause warp serialization. There is potentially a big negative performance impact in doing so.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given a CUDA vector type int4 , how can I load 128 bits of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply