Assume I have the following code:
int x[200];
void thread1() {
for(int i = 0; i < 100; i++)
x[i*2] = 1;
}
void thread2() {
for(int i = 0; i < 100; i++)
x[i*2 + 1] = 1;
}
Is the code correct in x86-64 memory model (from what I understand it is) assuming the page was configured with default write cache policy in Linux? What is the impact on performance of such code (from what I understand – none)?
PS. As of performance – I am mostly interested in Sandy Bridge.
EDIT: As of expectation – I want to write to aligned locations from different threads. I expect the upper code after finishing and barrier to contains {1,1,1, ...} in x rather then {0,1,0,1,...} or {1,0,1,0,...}.
If I understand correctly the writes will eventually propagate by snooping requests . The Sandy Bridge uses Quick Path between cores so the snooping would not hit FSB but would use much quicker interconnection. As it is not based on cache-invalidation-on-write it should be ‘fairly’ quick although I wasn’t able to find what is the overhead of conflict resolution (but probably lower then L3 write).
Source
EDIT: According to Intel® 64 and IA-32 Architectures Optimization Reference Manual clean hit have impact of 43 cycles and dirty hit have impact of 60 cycles (compared with 4 cycles normal overhead for L1, 12 for L2 and 26-31 for L3).