We have Core2 machines (Dell T5400) with XP64.
We observe that when running 32-bit processes, the performance of memcpy is on the order of 1.2GByte/s; however memcpy in a 64-bit process achieves about 2.2GByte/s (or 2.4GByte/s with the Intel compiler CRT’s memcpy). While the initial reaction might be to just explain this away as due to the wider registers available in 64-bit code, we observe that our own memcpy-like SSE assembly code (which should be using 128-bit wide load-stores regardless of 32/64-bitness of the process) demonstrates similar upper limits on the copy bandwidth it achieves.
My question is, what’s this difference actually due to ? Do 32-bit processes have to jump through some extra WOW64 hoops to get at the RAM ? Is it something to do with TLBs or prefetchers or… what ?
Thanks for any insight.
Also raised on Intel forums.
Of course, you really need to look at the actual machine instructions that are being executed inside the innermost loop of the memcpy, by stepping into the machine code with a debugger. Anything else is just speculation.
My quess is that it probably doesn’t have anything to do with 32-bit versus 64-bit per se; my guess is that the faster library routine was written using SSE non-temporal stores.
If the inner loop contains any variation of conventional load-store instructions, then the destination memory must be read into the machine’s cache, modified, and written back out. Since that read is totally unnecessary — the bits being read are overwritten immediately — you can save half the memory bandwidth by using the ‘non-temporal’ write instructions, which bypass the caches. That way, the destination memory is just written making a one-way trip to the memory instead of a round trip.
I don’t know the Intel compiler’s CRT library, so this is just a guess. There’s no particular reason why the 32-bit libCRT can’t do the same thing, but the speedup you quote is in the ballpark of what I would expect just by converting the movdqa instructions to movnt…
Since memcpy is not doing any calculations, it’s always bound by how fast you can read and write memory.