Consider the following condensed code:
/* Compile: gcc -pthread -m32 -ansi x.c */
#include <stdio.h>
#include <inttypes.h>
#include <pthread.h>
static volatile uint64_t v = 0;
void *func (void *x) {
__sync_add_and_fetch (&v, 1);
return x;
}
int main (void) {
pthread_t t;
pthread_create (&t, NULL, func, NULL);
pthread_join (t, NULL);
printf ("v = %"PRIu64"\n", v);
return 0;
}
I have a uint64_t variable that I want to increment atomically, because the variable is a counter in a multi-threaded program.
To achieve the atomicity I use GCC’s atomic builtins.
If I compile for an amd64 system (-m64) the produced assembler code is easy to understand.
By using a lock addq, the processor guarantees the increment to be atomic.
400660: f0 48 83 05 d7 09 20 lock addq $0x1,0x2009d7(%rip)
But the same C code produces a very complicated ASM code on an ia32 system (-m32):
804855a: a1 28 a0 04 08 mov 0x804a028,%eax
804855f: 8b 15 2c a0 04 08 mov 0x804a02c,%edx
8048565: 89 c1 mov %eax,%ecx
8048567: 89 d3 mov %edx,%ebx
8048569: 83 c1 01 add $0x1,%ecx
804856c: 83 d3 00 adc $0x0,%ebx
804856f: 89 ce mov %ecx,%esi
8048571: 89 d9 mov %ebx,%ecx
8048573: 89 f3 mov %esi,%ebx
8048575: f0 0f c7 0d 28 a0 04 lock cmpxchg8b 0x804a028
804857c: 08
804857d: 75 e6 jne 8048565 <func+0x15>
Here is what I don’t understand:
lock cmpxchg8bdoes guarantee that the changed variable is only written if the expected value still resides in the target address. The compare-and-swap is guaranteed to happen atomically.- But what guarantees that the reading of the variable in 0x804855a and 0x804855f to be atomic?
Probably it does not matter if there was a “dirty read”, but could someone please outline a short proof that there is no problem?
Further: Why does the generated code jump back to 0x8048565 and not 0x804855a? I am positive that this is only correct if other writers, too, only increment the variable. Is this an implicated requirement for the __sync_add_and_fetch function?
The initial read with 2 separate
movinstructions is not atomic, but it’s not in the loop. @interjay’s answer explains why this is fine.Fun fact: the read done by
cmpxchg8bwould be atomic even without alockprefix. (But this code does use alockprefix to make the entire RMW operation atomic, rather than separate atomic load and atomic store.)It’s guaranteed to be atomic due to it being aligned correctly (and it fits on one cache line) and because Intel made the spec this way, see the Intel Architecture manual Vol 1, 4.4.1:
Vol 3A 8.1.1:
Thus by being aligned, it can be read in 1 cycle, and it fits into one cache line making
cmpxchg8b‘s read atomic.If the data had been misaligned, the
lockprefix would still make it atomic, but the performance cost would be very high because a simple cache-lock (delaying response to MESI Invalidate requests for that one cache line) would no longer be sufficient.The code jumps back to
0x8048565(after themovloads, including the copy and add-1) becausevhas already been loaded; there is no need to load it again asCMPXCHG8Bwill setEAX:EDXto the value in the destination if it fails:CMPXCHG8BDescription for the Intel ISA manual Vol. 2A:Thus the code needs only to increment the newly returned value and try again.
If we look at this in C code it becomes easier:
The
value = destis actually from the same read thatcmpxchg8bused for the compare part. There isn’t a separate reload inside the loop.In fact, C11
atomic_compare_exchange_weak/_stronghas this behaviour built-in: it updates the “expected” operand.So does gcc’s modern builtin
__atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder)– it takes theexpectedvalue by reference.With GCC’s older obsolete
__syncbuiltins,__sync_val_compare_and_swapreturns the old val (instead of a boolean swapped / didn’t-swap result for__sync_bool_compare_and_swap)