The theoretical maximum of memory bandwidth for a Core 2 processor with DDR3 dual channel memory is impressive: According to the Wikipedia article on the architecture, 10+ or 20+ gigabytes per second. However, stock memcpy() calls do not attain this. (3 GB/s is the highest I’ve seen on such systems.) Likely, this is due to the OS vendor requirement that memcpy() be tuned for every processor line based on the processor’s characteristics, so a stock memcpy() implementation should be reasonable on a wide number of brands and lines.
My question: Is there a freely available, highly tuned version for Core 2 or Core i7 processors that can be utilized in a C program? I’m sure that I’m not the only person in need of one, and it would be a big waste of effort for everyone to micro-optimize their own memcpy().
If you specify /ARCH:SSE2 to MSVC it should provide you with a tuned memcpy (at least, mine does).
Failing that, use the SSE aligned load/store intrinsics yourself to copy the memory in large chunks, employing a Duff’s Device of word reads where necessary to deal with the head and tail of data to get it to an aligned boundary. You’ll need to use the cache management intrinsics as well to get good performance.
Your limiting factor is probably cache misses and southbridge bandwidth, rather than CPU cycles. Given that there’s always going to be lots of other traffic on the memory bus, I’m usually happy to get to about 90% of theoretical memory bandwidth throughput in such operations.