Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.
The document “Intel® 64 Architecture Memory Ordering White Paper” states that for “Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary” the memory operation appears to execute as a single memory access regardless of memory type.
The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/…)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer – PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.
This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html
Update:
I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.
// Compile with:
// gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.
#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;
unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));
void* thread1(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n1[mask]++;
x = (v4si){0,0,0,0};
}
return NULL;
}
void* thread2(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n2[mask]++;
x = (v4si){-1,-1,-1,-1};
}
return NULL;
}
int main() {
// Check memory alignment
if ( (((uintptr_t)&x) & 0x0f) != 0 )
abort();
memset(n1, 0, sizeof(n1));
memset(n2, 0, sizeof(n2));
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for (unsigned i=0; i<16; i++) {
for (int j=3; j>=0; j--)
printf("%d", (i>>j)&1);
printf(" %10u %10u", n1[i], n2[i]);
if(i>0 && i<0x0f) {
if(n1[i] || n2[i])
printf(" Not a single memory access!");
}
printf("\n");
}
return 0;
}
The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:
0000 96905702 10512
0001 0 0
0010 0 0
0011 22 12924 Not a single memory access!
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 3092557 1175 Not a single memory access!
1101 0 0
1110 0 0
1111 1719 99975389
In the Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:
Each of the writes
x = (v4si){0,0,0,0}andx = (v4si){-1,-1,-1,-1}are probably compiled into a single 16-byteMOVAPS. The address ofxis 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.On AMD processors, AMD64 Architecture Programmer’s Manual, Section 7.3.2 Access Atomicity states that
That is, AMD processors, similarly to Intel, do guarantee that for processors supporting AVX instructions 16-byte atomicity is provided by 16-byte load and store instructions.
On Intel and AMD processors that don’t support AVX, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).
EDIT: Test program results
(Test program modified to increase #iterations by a factor of 10)
On a Xeon X3450 (x86-64):
On a Xeon 5150 (32-bit):
On an Opteron 2435 (x86-64):
Note that the Intel Xeon X3450 and Xeon 5150 don’t support AVX. The Opteron 2435 is an AMD processor (K10 "Istanbul") that also does not support AVX.
Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It’s not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.
EDIT 2: How to make the test program fail
Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:
So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.
EDIT 3: ASM for the thread functions, on request of "GJ."
Here’s the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:
and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:
Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn’t matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.