Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7557721
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T12:12:22+00:00 2026-05-30T12:12:22+00:00

Consider the following condensed code: /* Compile: gcc -pthread -m32 -ansi x.c */ #include

  • 0

Consider the following condensed code:

/* Compile: gcc -pthread -m32 -ansi x.c */
#include <stdio.h>
#include <inttypes.h>
#include <pthread.h>

static volatile uint64_t v = 0;

void *func (void *x) {
    __sync_add_and_fetch (&v, 1);
    return x;
}

int main (void) {
    pthread_t t;
    pthread_create (&t, NULL, func, NULL);
    pthread_join (t, NULL);
    printf ("v = %"PRIu64"\n", v);
    return 0;
}

I have a uint64_t variable that I want to increment atomically, because the variable is a counter in a multi-threaded program.
To achieve the atomicity I use GCC’s atomic builtins.

If I compile for an amd64 system (-m64) the produced assembler code is easy to understand.
By using a lock addq, the processor guarantees the increment to be atomic.

 400660:       f0 48 83 05 d7 09 20    lock addq $0x1,0x2009d7(%rip)

But the same C code produces a very complicated ASM code on an ia32 system (-m32):

804855a:       a1 28 a0 04 08          mov    0x804a028,%eax
804855f:       8b 15 2c a0 04 08       mov    0x804a02c,%edx
8048565:       89 c1                   mov    %eax,%ecx
8048567:       89 d3                   mov    %edx,%ebx
8048569:       83 c1 01                add    $0x1,%ecx
804856c:       83 d3 00                adc    $0x0,%ebx
804856f:       89 ce                   mov    %ecx,%esi
8048571:       89 d9                   mov    %ebx,%ecx
8048573:       89 f3                   mov    %esi,%ebx
8048575:       f0 0f c7 0d 28 a0 04    lock cmpxchg8b 0x804a028
804857c:       08 
804857d:       75 e6                   jne    8048565 <func+0x15>

Here is what I don’t understand:

  • lock cmpxchg8b does guarantee that the changed variable is only written if the expected value still resides in the target address. The compare-and-swap is guaranteed to happen atomically.
  • But what guarantees that the reading of the variable in 0x804855a and 0x804855f to be atomic?

Probably it does not matter if there was a “dirty read”, but could someone please outline a short proof that there is no problem?

Further: Why does the generated code jump back to 0x8048565 and not 0x804855a? I am positive that this is only correct if other writers, too, only increment the variable. Is this an implicated requirement for the __sync_add_and_fetch function?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T12:12:24+00:00Added an answer on May 30, 2026 at 12:12 pm

    The initial read with 2 separate mov instructions is not atomic, but it’s not in the loop. @interjay’s answer explains why this is fine.


    Fun fact: the read done by cmpxchg8b would be atomic even without a lock prefix. (But this code does use a lock prefix to make the entire RMW operation atomic, rather than separate atomic load and atomic store.)

    It’s guaranteed to be atomic due to it being aligned correctly (and it fits on one cache line) and because Intel made the spec this way, see the Intel Architecture manual Vol 1, 4.4.1:

    A word or doubleword operand that crosses a 4-byte boundary or a
    quadword operand that crosses an 8-byte boundary is considered
    unaligned and requires two separate memory bus cycles for access.

    Vol 3A 8.1.1:

    The Pentium processor (and newer processors since) guarantees that the
    following additional memory operations will always be carried out
    atomically:

    • Reading or writing a quadword aligned on a 64-bit
    boundary

    • 16-bit accesses to uncached memory locations that fit
    within a 32-bit data bus

    The P6 family processors (and newer
    processors since) guarantee that the following additional memory
    operation will always be carried out atomically:

    • Unaligned 16-, 32-,
    and 64-bit accesses to cached memory that fit within a cache line

    Thus by being aligned, it can be read in 1 cycle, and it fits into one cache line making cmpxchg8b‘s read atomic.

    If the data had been misaligned, the lock prefix would still make it atomic, but the performance cost would be very high because a simple cache-lock (delaying response to MESI Invalidate requests for that one cache line) would no longer be sufficient.


    The code jumps back to 0x8048565 (after the mov loads, including the copy and add-1) because v has already been loaded; there is no need to load it again as CMPXCHG8B will set EAX:EDX to the value in the destination if it fails:

    CMPXCHG8B Description for the Intel ISA manual Vol. 2A:

    Compare EDX:EAX with m64. If equal, set ZF and load ECX:EBX into m64.
    Else, clear ZF and load m64 into EDX:EAX.

    Thus the code needs only to increment the newly returned value and try again.
    If we look at this in C code it becomes easier:

    value = dest;                    // non-atomic but usually won't tear
    while(!CAS8B(&dest,value,value + 1))
    {
        value = dest;                // atomic; part of lock cmpxchg8b
    }
    

    The value = dest is actually from the same read that cmpxchg8b used for the compare part. There isn’t a separate reload inside the loop.

    In fact, C11 atomic_compare_exchange_weak / _strong has this behaviour built-in: it updates the “expected” operand.

    So does gcc’s modern builtin __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder) – it takes the expected value by reference.

    With GCC’s older obsolete __sync builtins, __sync_val_compare_and_swap returns the old val (instead of a boolean swapped / didn’t-swap result for __sync_bool_compare_and_swap)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Consider following code: main.cpp: #include <iostream> typedef void ( * fncptr)(void); extern void externalfunc(void);
Consider following codes: #include <stdio.h> #include <malloc.h> void allocateMatrix(int **m, int l, int c)
Consider following example: #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <hiredis/hiredis.h> int main(int argc,
please consider following code #include <iostream> using namespace std; class Digit { private: int
Please let us consider following code: #include <iostream> using namespace std; union{ int i;
Consider following program: static void Main (string[] args) { int i; uint ui; i
Consider following SWT code example: http://dev.eclipse.org/viewcvs/index.cgi/org.eclipse.swt.snippets/src/org/eclipse/swt/snippets/Snippet151.java?view=co How can I separate the inline defined class?
Please consider following code: 1. uint16 a = 0x0001; if(a < 0x0002) { //
Consider the following ruby code test.rb: begin puts thisFunctionDoesNotExist x = 1+1 rescue Exception
Consider following example: #include <iostream> #include <functional> #include <algorithm> #include <vector> #include <boost/bind.hpp> const

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.