I am writing a program to parse a file. It consists of a main loop that parses character by character and treats them. Here is the main loop:
char c;
char * ptr;
for( size_t i = 0; i < size ; ++i )
{
ptr = ( static_cast<char*>(sentenceMap) + i );
c = *ptr;
__builtin_prefetch( ptr + i + 1 );
// some treatment on ptr and c
}
As you can see, I added a builtin_prefetch instruction, hoping to put in cache the next iteration of my loop. I tried with different values : ptr+i+1, ptr+i+2, ptr+i+10 but nothing seems to change.
To measure performance, I use valgrind’s tool cachegrind, which gives me an indication of the number of cache misses. On the line c = *ptr, cachegrind records 632,378 DLmr (L3 cache miss) when __builtin_prefetch is not set. What’s weird though, is that this value does not change, regardless of the parameter I set to __builtin_prefetch.
Any explanation to that?
That’s because the hardware is years ahead of you. 🙂
There are hardware prefetchers that are designed to recognize simple patterns and do the prefetching for you. In this case, you have a simple sequential access pattern, that’s more than trivial for the hardware prefetcher.
Manual prefetching only comes handy when you have access patterns that the hardware cannot predict.
Here’s one such example: Prefetching Examples?