In removing conditional branches from high-performance code, converting a true boolean to unsigned long i = -1 to set all bits can be useful.
I came up with a way to obtain this integer-mask-boolean from input of a int b (or bool b) taking values either 1 or 0:
unsigned long boolean_mask = -(!b);
To get the opposite value:
unsigned long boolean_mask = -b;
Has anybody seen this construction before? Am I on to something? When a int value of -1 (which I assume -b or -(!b) does produce) is promoted to a bigger unsigned int type is it guaranteed to set all the bits?
Here’s the context:
uint64_t ffz_flipped = ~i&~(~i-1); // least sig bit unset
// only set our least unset bit if we are not pow2-1
i |= (ffz_flipped < i) ? ffz_flipped : 0;
I will inspect the generated asm before asking questions like this next time. Sounds very likely the compiler will not burden the cpu with a branch here.
The question you should be asking yourself is this: If you write:
then
it_was_truewill be either 1 or 0. But where did that 1 come from?The machine’s instruction set doesn’t contain an instruction of the form:
or, indeed, anything like that. (I put a note on SSE at the end of this answer, illustrating that the former statement is not quite true.) The machine has an internal condition register, consisting of several condition bits, and the compare instruction — and a number of other arithmetic operations — cause those condition bits to be modified in specific ways. Subsequently, you can do a conditional branch, based on some condition bits, or a conditional load, and sometimes other conditional operations.
So actually, it could be a lot less efficient to store that 1 in a variable than it would have been to have directly done some conditional operation. Could have been, but maybe not, because the compiler (or at least, the guys who wrote the compiler) may well be cleverer than you, and it might just remember that it should have put a 1 into
it_was_trueso that when you actually get around to checking the value, the compiler can emit an appropriate branch or whatever.So, speaking of clever compilers, you should take a careful look at the assembly code produced by:
Looking at that expression, I can count five operations: three bitwise negations, one bitwise conjunction (
and), and one subtract. But you won’t find five operations in the assembly output (at least, if you use gcc -O3). You’ll find three.Before we look at the assembly output, let’s do some basic algebra. Here’s the most important identity:
Can you see why that’s true?
-X, in 2’s complement, is just another way of saying2n - X, wherenis the number of bits in the word. In fact, that’s why it’s called “2’s complement”. What about~X? Well, we can think of that as the result of subtracting every bit in X from the corresponding power of 2. For example, if we have four bits in our word, andXis0101(which is 5, or 22 + 20), then~Xis1010which we can think of as23×(1-0) + 22×(1-1) + 21×(1-0) + 20×(1-1), which is exactly the same as1111 − 0101. Or, in other words:−X == 2n − X~X == (2n−1) − Xwhich means that
~X == (−X) − 1
Remember that we had
But we now know that we can change ~(~i−1) into
minusoperations:~(~i−1)
== −(~i−1) − 1
== −(−i - 1 - 1) − 1
== (i + 2) - 1
== i + 1
How cool is that! We could have just written:
which is only three operations, instead of five.
Now, I don’t know if you followed that, and it took me a bit of time to get it right, but now let’s look at gcc’s output:
So gcc just went and figured all that out on its own.
The promised note about SSE: It turns out that SSE can do parallel comparisons, even to the point of doing 16 byte-wise comparisons at a time between two 16-byte registers. Condition registers weren’t designed for that, and anyway no-one wants to branch when they don’t have to. So the CPU does actually change one of the SSE registers (a vector of 16 bytes, or 8 “words” or 4 “double words”, whatever the operation says) into a vector of boolean indicators. But it doesn’t use
1for true; instead, it uses a mask of all1s. Why? Because it’s likely that the next thing the programmer is going to do with that comparison result is use it to mask out values, which I think is just exactly what you were planning to do with your-(!B)trick, except in the parallel streaming version.So, rest assured, it’s been covered.