Working on a rules agnostic poker simulator for fun. Testing bottlenecks in enumeration, and for hands that would always get pulled from the “unique” array, I found an interesting bottleneck. I measured the average computation time of running each of the variations below 1,000,000,000 times and then took the best of 100 repetitions of that to allow JIT and Hotspot to work their magic. What I found was there’s a difference in computation time (6ns vs 27ns) between
public int getRank7(int ... cards) {
int q = (cards[0] >> 16) | (cards[1] >> 16) | (cards[2] >> 16) | (cards[3] >> 16) | (cards[4] >> 16) | (cards[5] >> 16) | (cards[6] >> 16);
int product = ((cards[0] & 0xFF) * (cards[1] & 0xFF) * (cards[2] & 0xFF) * (cards[3] & 0xFF) * (cards[4] & 0xFF) * (cards[5] & 0xFF) * (cards[6] & 0xFF));
if(flushes[q] > 0) return flushes[q];
if(unique[q] > 0) return unique[q];
int x = Arrays.binarySearch(products, product);
return rankings[x];
}
and
public int getRank(int ... cards) {
int q = 0;
long product = 1;
for(int c : cards) {
q |= (c >> 16);
product *= (c & 0xFF);
}
if(flushes[q] > 0) return flushes[q];
if(unique[q] > 0) return unique[q];
int x = Arrays.binarySearch(products, product);
return rankings[x];
}
The issue is definitely the for loop, not the addition of handling multiplication at the top of the function. I’m a little baffled by this since I’m running the same number of operations in each scenario… I realized I’d always have 6 or more cards in this function so I brought things closer together by changing it to
public int getRank(int c0, int c1, int c2, int c3, int c4, int c5, int ... cards)
But I’m going to have the same bottleneck as the number of cards goes up. Is there any way to get around this fact, and if not, could somebody explain to me why a for loop for the same number of operations is so much slower?
I think you’ll find that the big difference is branching. Your for loop scenario requires a check and conditional branch on each iteration of the for loop. Your CPU will try and predict which branch will be taken, and pipeline instructions accordingly, but when it mispredicts (at least once per function call, as the loop terminates), the pipeline stalls, which is very expensive.
One thing to try would be a regular for loop with a fixed upper bound (rather than one based on the length of the array); the Java JRE may unroll such a loop, which would result in the same sequence of operations as your more efficient version.