Code:
double x(){return (double)rand()/(double)RAND_MAX;}
double y(){return (double)rand()/(double)RAND_MAX;}
double z(){return (double)rand()/(double)RAND_MAX;}
int d(double x, double y, double z){
if ( ( (pow(x,2)+pow(y,2)) <1 ) && ( z<=1 && z>=0 )) return 1;
return 0;
}
double f(double x, double y, double z){
return 1;
}
#pragma omp parallel default(none) private(id,numt,j,local_sum,local_good_dots,local_coi,x_,y_,z_) shared(total_sum,good_dots,count_of_iterations)
{
local_coi = count_of_iterations;
id = omp_get_thread_num() + 1;
numt = omp_get_num_threads();
#pragma omp for
for (j = 1; j <= local_coi; j++){
x_=x();
y_=y();
z_=z();
if (d(x_,y_,z_) == 1){
local_sum += f(x_,y_,z_);
local_good_dots += 1;
}
}
#pragma omp critical
{
total_sum = total_sum + local_sum;
good_dots = good_dots + local_good_dots;
}
}
Comment: This code is a realization of Monte-Carlo method for calculation of three-dimensional integral of function f() in area d().
I expect, that this code will work faster in multi-thread mode (openmp).
But something go wrong.
After several hours of modifications (reduction in openmp pragma, simplifactions of if-condition (like f(x_,y_,z_) * d(x_,y_,z_))) i’d not understood, why this simple loop become more slower on bigger numbers of threads.
But after i generate a 3-dimension array for each coordinate before loop and drop it in shared, my program become more faster.
So, question:
How to modificate this code and which functions(operations) are allowed in parallel-blocks?
P.S: as i see, that rand function are not allowed (or i’m wrong?)
Thanks for help!
Modification (with @HristoIliev’s help)
double x(){return (double)rand()/(double)RAND_MAX;}
double y(){return (double)rand()/(double)RAND_MAX;}
double z(){return (double)rand()/(double)RAND_MAX;}
int d(double x, double y, double z){
if ( ( (pow(x,2)+pow(y,2)) <1 ) && ( z<=1 && z>=0 )) return 1;
return 0;
}
double f(double x, double y, double z){
return 1;
}
#pragma omp parallel default(none) private(j,local_coi,x_,y_,z_) shared(count_of_iterations) reduction(+:total_sum,good_dots)
{
local_coi = count_of_iterations;
#pragma omp for(prng)
for (j = 1; j <= local_coi; j++){
#pragma omp critical(prng)
{
x_=x();
y_=y();
z_=z();
}
if (d(x_,y_,z_) == 1){
total_sum += f(x_,y_,z_);
good_dots += 1;
}
}
}
The random number generator
rand()uses a global statically allocated state, shared by all threads and is thus not thread safe. Using it from multiple threads you run into a very bad case of unprotected access to shared variables which trashes the cache and slows the program down. You should be usingrand_r()orerand48()instead – they use separate state storages that you have to provide. You have to declare one state per thread (e.g. have itprivate), basically creating different PRNG for each thread. Then you have to seed them accordingly, otherwise you would get statistically bad results. In principle you can use the output of onerand48()generator to seed the others – it should be enough in order to get moderately long uncorrelated sequences.Here is a sample implementation using
rand_r()(not that this is a very bad generator for Monte Carlo simulations,erand48is better and the best would be to use the “Mersenne Twister” type generator from GNU Scientific Library if available):This is just a very bad (from quality point of view) implementation, but it should give you the idea of how things work. It is also how you can achieve thread safety with minimal changes to your original code. The basic points are:
prng_stateis made private to each thread by the OpenMPthreadprivatedirective;rand_r()with the thread-specific state variable is used instead ofrand()inx(),y()andz();prng_state = 67894 + 1337*id;, so that different threads would (hopefully) get uncorrelated streams of pseudo-random numbers.Note that
rand()andrand_r()are of such bad quality that this is just an academic example. With longer PRNG sequences you would get correlated streams in the different threads which would spoil the statistics. I leave it up to you to rewrite the code usingerand48().To answer your initial question – all thread-safe function calls are allowed inside a
parallelblock. You could also call non-thread-safe functions but you have to protect the calls inside (named)criticalconstructs, e.g.:This would ensure that no calls to
rand()would be made in parallel. But you would still have read-modify-write kind of access to the shared state, hence the cache-related slowdown.Also, do not try to reimplement OpenMP
reductionor similar constructs. Compiler vendors are already putting tremendous efforts into making sure they are implemented in the best (read fastest) way possible.