I benchmarked Eigen SGEMM operation using one thread and using 8 threads and what I got was that the performance peaked at 512×512 but then droped when exceding that size. I was wondering if there was any specific reason for this perhaps something with complexety of the larger matrix’s? I looked at the benchmark on the website of Eigen for matrix-matrix operations but didn’t see anything similar.
At 512×512 I got like 4x faster in parallel. But in 4096×4096 I got barely 2x faster. I am using openMP for parallelism and to down it to one thread I set num_of_threads to two.
Your results suggest that this algorithm is primarily memory bandwidth bound at large matrix size. 4Kx4K matrix (float?) exceeds cache size of any CPU available to mere mortals, while 512×512 will comfortably fit into L3 cache on most modern CPUs.