Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I’d like the example to meet the following criteria:
- It is a simple, small, self-contained example.
- Removing the
__builtin_prefetchinstruction results in performance degradation. - Replacing the
__builtin_prefetchinstruction with the corresponding memory access results in performance degradation.
That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn’t be managed without it.
Here’s an actual piece of code that I’ve pulled out of a larger project. (Sorry, it’s the shortest one I can find that had a noticable speedup from prefetching.)
This code performs a very large data transpose.
This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.
To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.
When I run it with ENABLE_PREFETCH enabled, this is the output:
When I run it with ENABLE_PREFETCH disabled, this is the output:
So there’s a 13% speedup from prefetching.
EDIT:
Here’s some more results: