So, I just got a new 16-core server with AMD 6212 processors. I have code that I’ve run on a variety of Intel processors as well. It uses locked queues to distribute work to pthreads, which then write the work back to shared memory with locks on the writes as well. I’m primarily compute bound.
On Intel processors, as I increase the number of threads, my performance immediately increases. Going from 1 to 2 threads nearly doubles performance.
With the same code on the AMD processors, I get no gain (a slight slowdown) with even 4 threads. But, when I use 128 threads, I see a 6x speedup.
Does anyone have an idea what that might be happening?
As for the OS specs, if I type:
cat /proc/version
I get:
Linux version 2.6.32-5-amd64 (Debian 2.6.32-39) (dannf@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Nov 3 03:41:26 UTC 2011
So, this hasn’t been 100% resolved yet, but it appears that the problem was related to memory access. The code has no dynamic memory allocation during most of the run, but threads were allocating about 100 small chunks of memory on the heap when they started up. In small sample versions of the program I was able to eliminate the bottlenecks by allocating the memory for each thread on the stack instead of the heap.
Without looking too deeply into architecture issues it appears that the allocated memory might have gotten interleaved so that the different threads were sharing the same memory, thus destroying parallel performance.