When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset.

Question

0

Asked: May 29, 20262026-05-29T10:19:59+00:00 2026-05-29T10:19:59+00:00

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset.

0

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc… I also recently encountered this ‘bug’ due to memcpy optimization in glibc.

The question is: why can’t the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T10:20:00+00:00

One thing I’d like to add to the other answers is that rep movs is not actually slow on all modern processors. For instance,

Usually, the REP MOVS instruction has a large overhead for choosing
and setting up the right method. Therefore, it is not optimal for
small blocks of data. For large blocks of data, it may be quite
efficient when certain conditions for alignment etc. are met. These
conditions depend on the specific CPU (see page 143). On Intel Nehalem
and Sandy Bridge processors, this is the fastest method for moving
large blocks of data, even if the data are unaligned.

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly
language
An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply