When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc… I also recently encountered this ‘bug’ due to memcpy optimization in glibc.
The question is: why can’t the hardware manufacturers (Intel, AMD) optimize the specific case of
rep stos
and
rep movs
to be recognized as such, and do the fastest fill and copy as possible on their own architecture?
One thing I’d like to add to the other answers is that
rep movsis not actually slow on all modern processors. For instance,[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly
language
An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].