I use a population count (hamming weight) function intensively in a windows c application and have to optimize it as much as possible in order to boost performance. More than half the cases where I use the function I only need to know the value to a maximum of 15. The software will run on a wide range of processors, both old and new. I already make use of the POPCNT instruction when Intel’s SSE4.2 or AMD’s SSE4a is present, but would like to optimize the software implementation (used as a fall back if no SSE4 is present) as much as possible.
Currently I have the following software implementation of the function for 64bit (platform) mode:
int population_count64(unsigned __int64 w) {
w -= (w >> 1) & 0x5555555555555555ULL;
w = (w & 0x3333333333333333ULL) + ((w >> 2) & 0x3333333333333333ULL);
w = (w + (w >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
return int((w * 0x0101010101010101ULL) >> 56);
}
So to summarize:
(1) I would like to know if it is possible to optimize this for the case when I only want to know the value to a maximum of 15.
(2) Is there a faster software implementation (for both Intel and AMD CPU’s) than the function above (for unsigned 64bit integers)?
It is indeed possible to optimise your function for the “maximum 15” case. The following shaves off a few operations:
Inlining the function (using the inline keyword as above) should also increase performance.