Common widsom is that rep movsb is much slower than rep movsd (or on 64-bit, rep movsq) when performing identical operations. However, I’ve been testing on a few modern machines, and the run times are coming out identical (up to measurement noise) across a huge range of buffer sizes (10 bytes to 2 megs). So far I have just tested on 2 machines (32-bit Intel Atom D510 and 64-bit AMD FX 8120).
-
Are there any modern x86 (32- or 64-bit) machines where
rep movsbis slower thanrep movsd(orrep movsq)? -
If not, what was the last machine where the difference was significant, and how significant was it?
I’m asking this question from a standpoint of wanting to avoid cargo-culting a bunch of tests to break memory up into unaligned head/tail and aligned middle for the sake of using rep movsd or rep movsq if there’s no actual benefit to doing this…
Lots of benchmarks here: instlatx64.atw.hu
For example (Intel Core 2 Duo E6700):
Which shows that there is a difference, but it’s tiny.
This one for SandyBridge is a little weird:
Seems there is a big difference on some Atoms (seems to have disappeared with the D5xx, so you just missed it):
I haven’t found such big difference on anything else that can be considered new.