First Value:
I have a binary value which is actually a compact series of 2-bit values. (That is, each 2 bits in the binary value represents 0, 1, 2, or 3.) So, for example, 0, 3, 1, 2 becomes 00110110. In this binary string, all I care about are the 3’s (or alternately, I could flip the bits and only care about the 0’s, if that makes your answer easier). All the other numbers are irrelevant (for reasons we’ll get into in a bit).
Second Value:
I have a second binary value which is also a compacted series of 2-bit values represented the same way. It has an identical length to the First Value.
Math:
I want the sum of the 2-bit numbers in the Second Value that have the same position as a 3 from the First Value. In other words, if I have:
First: 11000011
Second: 01111101
Then my answer would be “2” (I added the first number and the last number from “Second” together, because those were the only ones that had a “11” in the First Value that matched them.)
I want to do this in as few clock cycles as possible (either on a GPU or on an x86 architecture). However, I’m generally looking for an algorithm, not an assembler solution. Is there any way faster than masking off two bits at a time from each number and running several loops?
Sure.
Now create a mask from a that contains ‘1’ bit at an odd position only if in a there was ’11’ ending at same position.
Expand it to get the ’11’ pattern back:
So now if a was 1101100011, mask is 1100000011.
Then mask b with the mask:
You can then perform the addition of (masked) numbers from b in parallel:
For a 32-bit number, the sum is now at the lowest bits of b. This is a commonly known pattern for parallel addition of bit fields. For larger than 32-bit numbers, you would add one more round for 64-bit numbers and two rounds for 128-bit numbers.