I’m trying to understand the difference between working with just long, long2, long3, long4, long8, long16. Let’s assume that my CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG is 2.
When should I be working with long, long2, long3, long4, long8, long16? Assume that I want my kernel to XOR a bunch of bitvectors of, say, length 500.
- If using long, I need to XOR ceil(500/64)=8 long.
- If using long2, I need to XOR ceil(500/128)=4 long2.
- If using long8, I need to XOR ceil(500/512)=1 long8.
So what would be the difference between xorring long[8], long2[4] or long8? Is there any advantage of going beyond the CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG at all?
Edit: I added a small sample script to make it more clear: the for (int j=0; j<8; j++) loops over a vector of length 8, and I wondered if I’d better do this for one long8, four long2s or eight longs.
while (i < to) {
ui64[] row = rows[rowIndex];
ui64 bitchange = i++;
bitchange ^= i;
rowIndex = 63-__builtin_clzll(bitchange);
ui64 cardinality = 0;
for (int j=0; j<8; j++) {
curr[j] ^= row[j];
cardinality += __builtin_popcountll(curr[j]);
}
popcountpolynomial[cardinality]++;
}
mfa is correct aout the preferred width, but using wider vectors usually is good. The device will sequentially issue instructions to process in the widest format vectors it supports, which is good because it helps hide the latency of the operation. This is much more true on GPUs and much less true on CPUs, GPUs tend to have a lot of registers (> 1000).
Think of the preferred width as the width that guarantees you won’t “waste” vector lanes on vector architecture processors – if the GPU has vector ALUs, issuing instructions that don’t use the whole width (say, only use the first item in the vector), then the other lanes might go unused in that instruction, wasting potential computing power. Think of SSE where it could be doing 4 adds with one instruction, but you’re only getting one number as a result because you’re not using 3 of the 4 pieces of the vector.
OpenCL compilers (on vector ALU hardware) try to restructure your code to “vectorize” if you aren’t using the full vector width, but obviously there are limitations to that.
Of course, only use the wider vectors when it feels natural in your algorithm. Never contort your program to try to use really wide vectors.
Using less registers is a good thing too though, if you’re using too many registers, it may restrict the number of wavefronts/warps that can be run in parallel.
Using vectors can actually reduce register pressure if the auto-vectorizer fails to find a vectorized solution in scalar code, in the case that the hardware does use a vector ALU – you’ll “waste” less vector lanes because more will fit in each register.