Let’s say I’m writing my own StringBuilder in a compiled language (e.g. C++).
What is the best way to measure the performance of various implementations? Simply timing a few hundred thousand runs yields highly inconsistent results: the timings from one batch to the other can differ by as much as 15%, making it impossible to accurately assess potential performance improvements that yield performance gains smaller than that.
I’ve done the following:
- Disable SpeedStep
- Use RDTSC for timing
- Run the process with realtime priority
- Set the affinity to a single CPU core
This stabilizied the results somewhat. Any other ideas?
It’s really hard to precisely measure a piece of code. For such requirements, I recommend you to have look at Agner Fog’s test suite. By using it, you can measure clock cycles and collect some important factors (such as cache misses, branch mispredictions etc.).
Also, I recommend you to have look at PDF document from Agner’s site. It’s a really invaluable document to make possible such micro-optimization.
As a side note, actual performance is not a function of “clock cycles”. Cache misses can change everything for each run within a real application. So, I would optimize cache misses first. Simply running a piece of code several times for same memory portion, decreases cache miss dramatically. So, it makes it hard to measure precisely. Whole application tuning is usually better idea IMO. Intel VTune and other tools are really good for such usages.