The code without fission looks like this:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[hash(keys[i])]
}
return ret;
}
With fission:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[hash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
Notes:
-
The bottleneck is
map[hash(keys[i])]which accesses memory randomly. -
normally, it would be
if(tmp[i]) res[ret++] = i;to avoid the if, I’m usingret += tmp[i]. -
map[..]is always 0 or 1
The fission version is usually significantly faster and I am trying to explain why. My best guess is that ret += map[..] still introduces some dependency and that prevents speculative execution.
I would like to hear if anyone has a better explanation.
From my tests, I get roughly 2x speed difference between the fused and split loops. This speed difference is very consistent no matter how I tweak the loop.
(Refer to bottom for the full test code.)
Although I’m not 100% sure, I suspect that this is due to a combination of two things:
map[gethash(keys[i])].It’s obvious that
map[gethash(keys[i])]will result in a cache miss nearly every time. In fact, it is probably enough to saturate the entire load-store buffer.Now let’s look at the added dependency. The issue is the
retvariable:The
retvariable is needed for address resolution of the the storeres[ret] = i;.retis coming from a sure cache miss.retis comingtmp[i]– which is much faster.This delay in address resolution of the fused loop case likely causes
res[ret] = ito store to clog up the load-store buffer along withmap[gethash(keys[i])].Since the load-store buffer has a fixed size, but you have double the junk in it:
You are only able to overlap the cache misses half as much as before. Thus 2x slow-down.
Suppose if we changed the fused loop to this:
This will break the address resolution dependency.
(Note that it’s not the same anymore, but it’s just to demonstrate the performance difference.)
Then we get similar timings:
Here’s the complete test code: