Its been noted that access to data elements that fall in same cache-line performs badly due to ping-pong effect.
However, the code I wrote doesn’t and tested with valgrind –tool=cachegrind doesn’t show this behaviour. Would appreciate any insights regarding this?.
Attached below is function that each pthread executes:
void test_cache(void* arg)
{
long id = (long) arg;
uint32_t idx = (uint32_t) id;
uint32_t ctr = 0;
uint32_t total_sum = 0;
for(; ctr < 500000; ++ctr)
{
total_sum += shared[idx];
AO_fetch_and_add(&shared[idx], idx);
}
printf("%d %d,\n",id, total_sum);
}
If you are running on a “dual core” whatever, you are hitting shared cache. You need separate physical CPUs to see the ping-pong effect. Include your hardware spec in the question.