Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6656127
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T01:36:18+00:00 2026-05-26T01:36:18+00:00

Consider a single memory access (a single read or a single write, not read+write)

  • 0

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document “Intel® 64 Architecture Memory Ordering White Paper” states that for “Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary” the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/…)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer – PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html


Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T01:36:19+00:00Added an answer on May 26, 2026 at 1:36 am

    In the Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:

    The Intel486 processor (and newer processors since) guarantees that
    the following basic memory operations will always be carried out
    atomically:

    • Reading or writing a byte.

    • Reading or writing a word aligned on a 16-bit boundary.

    • Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the
      following additional memory operations will always be carried out
      atomically:

    • Reading or writing a quadword aligned on a 64-bit boundary.

    • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

    The P6 family processors (and newer processors since) guarantee that
    the following additional memory operation will always be carried out
    atomically:

    • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

    Processors that enumerate support for Intel® AVX (by setting the
    feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte
    memory operations performed by the following instructions will always
    be carried out atomically:

    • MOVAPD, MOVAPS, and MOVDQA.
    • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
    • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

    (Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

    Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.

    On AMD processors, AMD64 Architecture Programmer’s Manual, Section 7.3.2 Access Atomicity states that

    Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor
    model, as are misaligned loads or stores of less than a quadword that are contained entirely within a
    naturally-aligned quadword. Misaligned load or store accesses typically incur a small latency penalty.
    Model-specific relaxations of this quadword atomicity boundary, with respect to this latency penalty,
    may be found in a given processor’s Software Optimization Guide.
    Misaligned accesses can be subject to interleaved accesses from other processors or cache-coherent
    devices which can result in unintended behavior. Atomicity for misaligned accesses can be achieved
    where necessary by using the XCHG instruction or any suitable LOCK-prefixed instruction.
    Processors that report CPUID Fn0000_0001_ECX[AVX](bit 28) = 1 extend the atomicity for
    cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

    That is, AMD processors, similarly to Intel, do guarantee that for processors supporting AVX instructions 16-byte atomicity is provided by 16-byte load and store instructions.

    On Intel and AMD processors that don’t support AVX, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

    EDIT: Test program results

    (Test program modified to increase #iterations by a factor of 10)

    On a Xeon X3450 (x86-64):

    0000   999998139       1572
    0001           0          0
    0010           0          0
    0011           0          0
    0100           0          0
    0101           0          0
    0110           0          0
    0111           0          0
    1000           0          0
    1001           0          0
    1010           0          0
    1011           0          0
    1100           0          0
    1101           0          0
    1110           0          0
    1111        1861  999998428
    

    On a Xeon 5150 (32-bit):

    0000   999243100     283087
    0001           0          0
    0010           0          0
    0011           0          0
    0100           0          0
    0101           0          0
    0110           0          0
    0111           0          0
    1000           0          0
    1001           0          0
    1010           0          0
    1011           0          0
    1100           0          0
    1101           0          0
    1110           0          0
    1111      756900  999716913
    

    On an Opteron 2435 (x86-64):

    0000   999995893       1901
    0001           0          0
    0010           0          0
    0011           0          0
    0100           0          0
    0101           0          0
    0110           0          0
    0111           0          0
    1000           0          0
    1001           0          0
    1010           0          0
    1011           0          0
    1100           0          0
    1101           0          0
    1110           0          0
    1111        4107  999998099
    

    Note that the Intel Xeon X3450 and Xeon 5150 don’t support AVX. The Opteron 2435 is an AMD processor (K10 "Istanbul") that also does not support AVX.

    Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It’s not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

    EDIT 2: How to make the test program fail

    Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

    0000   999998634       5990
    0001           0          0
    0010           0          0
    0011           0          0
    0100           0          0
    0101           0          0
    0110           0          0
    0111           0          0
    1000           0          0
    1001           0          0
    1010           0          0
    1011           0          0
    1100           0          1  Not a single memory access!
    1101           0          0
    1110           0          0
    1111        1366  999994009
    

    So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

    EDIT 3: ASM for the thread functions, on request of "GJ."

    Here’s the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:

    
    .globl thread2
            .type   thread2, @function
    thread2:
    .LFB537:
            .cfi_startproc
            movdqa  .LC3(%rip), %xmm1
            xorl    %eax, %eax
            .p2align 5,,24
            .p2align 3
    .L11:
            movaps  x(%rip), %xmm0
            incl    %eax
            movaps  %xmm1, x(%rip)
            movmskps        %xmm0, %edx
            movslq  %edx, %rdx
            incl    n2(,%rdx,4)
            cmpl    $1000000000, %eax
            jne     .L11
            xorl    %eax, %eax
            ret
            .cfi_endproc
    .LFE537:
            .size   thread2, .-thread2
            .p2align 5,,31
    .globl thread1
            .type   thread1, @function
    thread1:
    .LFB536:
            .cfi_startproc
            pxor    %xmm1, %xmm1
            xorl    %eax, %eax
            .p2align 5,,24
            .p2align 3
    .L15:
            movaps  x(%rip), %xmm0
            incl    %eax
            movaps  %xmm1, x(%rip)
            movmskps        %xmm0, %edx
            movslq  %edx, %rdx
            incl    n1(,%rdx,4)
            cmpl    $1000000000, %eax
            jne     .L15
            xorl    %eax, %eax
            ret
            .cfi_endproc
    

    and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:

    
    .LC3:
            .long   -1
            .long   -1
            .long   -1
            .long   -1
            .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
            .section        .note.GNU-stack,"",@progbits
    

    Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn’t matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Consider the following two alternatives: console.log("double"); console.log('single'); The former uses double quotes around the
I have a type which I consider use it as struct. It represents single
I've got, what I would consider, a simple test web site. A single page
I've been tasked to write a small app to be used by a single
Consider the following case: I have a single Excel workbook with 4 sheets in
OK consider this url: example.com/single.php?id=21424 It's pretty obvious to you and i that the
consider the following array of bytes that is intended to be converted into a
Consider the following setup: A windows PC with a LAN interface and a WiFi
Consider the need to develop a lightweight desktop DB application on the Microsoft platforms.
Consider: List<String> someList = new ArrayList<>(); // add "monkey", "donkey", "skeleton key" to someList

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.