Update: Please read the code, it is NOT about counting bits in one int
Is it possible to improve performance of the following code with some clever assembler?
uint bit_counter[64];
void Count(uint64 bits) {
bit_counter[0] += (bits >> 0) & 1;
bit_counter[1] += (bits >> 1) & 1;
// ..
bit_counter[63] += (bits >> 63) & 1;
}
Count is in the inner-most loop of my algorithm.
Update:
Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2.
bits variable has almost random bits (close to half zeros and half ones)
Maybe you can do 8 at once, by taking 8 bits spaced 8 apart and keeping 8 uint64’s for the counts. That’s only 1 byte per single counter though, so you can only accumulate 255 invocations of
countbefore you’d have to unpack those uint64’s.