I have a loop which updates a matrix A and I want to make it openmp but I’m not sure what variables should be shared and private. I would have thought just ii and jj would work but it doesn’t. I think I need an !$OMP ATOMIC UPDATE somewhere too…
The loop just calculates the distance between N and N-1 particles and updates a matrix A.
!$OMP PARALLEL DO PRIVATE(ii,jj)
do ii=1,N-1
do jj=ii+1,N
distance_vector=X(ii,:)-X(jj,:)
distance2=sum(distance_vector*distance_vector)
distance=DSQRT(distance2)
coff=distance*distance*distance
PE=PE-M(II)*M(JJ)/distance
A(jj,:)=A(jj,:)+(M(ii)/coff)*(distance_vector)
A(ii,:)=A(ii,:)-(M(jj)/coff)*(distance_vector)
end do
end do
!$OMP END PARALLEL DO
The golden rule of OpenMP is that all variables (with some exclusions), that are defined in an outer scope, are shared by default in the parallel region. Since in Fortran before 2008 there are no local scopes (i.e. there is no
BLOCK ... END BLOCKin earlier versions), all variables (exceptthreadprivateones) are shared, which is very natural for me (unlike Ian Bush, I am not a big fan of usingdefault(none)and then redeclaring the visibility of all 100+ local variables in various complex scientific codes).Here is how to determine the sharing class of each variable:
N– shared, because it should be the same in all threads and they only read its value.ii– it is the counter of loop, subject to a worksharing directive, so its sharing class is predetermined to beprivate. It doesn’t hurt to explicitly declare it in aPRIVATEclause, but that is not really necessary.jj– loop counter of a loop, which is not subject to a worksharing directive, hencejjshould beprivate.X– shared, because all threads reference and only read from it.distance_vector– obviously should beprivateas each thread works on different pairs of particles.distance,distance2, andcoff– ditto.M– should be shared for the same reasons asX.PE– acts as an accumulator variable (I guess this is the potential energy of the system) and should be a subject of an reduction operation, i.e. should be put in aREDUCTION(+:....)clause.A– this one is tricky. It could be either shared and updates toA(jj,:)protected with synchronising constructs, or you could use reduction (OpenMP allows reductions over array variables in Fortran unlike in C/C++).A(ii,:)is never modified by more than one thread so it does not need special treatment.With reduction over
Ain place, each thread would get its private copy ofAand this could be a memory hog, although I doubt you would use this direct O(N2) simulation code to compute systems with very large number of particles. There is also a certain overhead associated with the reduction implementation. In this case you simply need to addAto the list of theREDUCTION(+:...)clause.With synchronising constructs you have two options. You could either use the
ATOMICconstruct or theCRITICALconstruct. AsATOMICis only applicable to scalar contexts, you would have to “unvectorise” the assignment loop and applyATOMICto each statement separately, e.g.:You may also rewrite this as a loop – do not forget to declare the loop counter
private.With
CRITICALthere is no need to unvectorise the loop:Naming critical regions is optional and a bit unnecessary in this particular case but in general it allows to separate unrelated critical regions.
Which is faster? Unrolled with
ATOMICorCRITICAL? It depends on many things. UsuallyCRITICALis way slower since it often involves function calls to the OpenMP runtime while atomic increments, at least on x86, are implemented with locked addition instructions. As they often say, YMMV.To recapitulate, a working version of your loop should be something like:
I’ve assumed that your system is 3-dimensional.
With all this said, I second Ian Bush that you need to rethink how position and acceleration matrices are laid out in memory. Proper cache usage could boost your code and would also allow for certain operations, e.g.
X(:,ii)-X(:,jj)to be vectorised, i.e. implemented using vector SIMD instructions.