Let’s say you have a uint64_t and care only about the high order bit for each byte in your uint64_t. Like so:
uint32_t:
0000 … 1000 0000 1000 0000 1000 0000 1000 0000 —> 0000 1111
Is there a faster way than:
return
(
((x >> 56) & 128)+
((x >> 49) & 64)+
((x >> 42) & 32)+
((x >> 35) & 16)+
((x >> 28) & 8)+
((x >> 21) & 4)+
((x >> 14) & 2)+
((x >> 7) & 1)
)
Aka shifting x, masking, and adding the correct bit for each byte? This will compile to a lot of assembly and I’m looking for a quicker way… The machine I’m using only has up to SSE2 instructions and I failed to find helpful SIMD ops.
Thanks for the help.
As I mentioned in a comment,
pmovmskbdoes what you want. Here’s how you could use it:MMX + SSE1:
SSE2:
And I looked up the new way
BMI2: