I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:
#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
printf("useless statement");
prime[i-2] = is_prime(i);
}
I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?
Speculating, but maybe the stdout is guarded by a lock?
In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).
My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.
I’ve found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.
Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.
Of course, if the printfs are useless, why are they in the loop at all?