This is in F90, but the question holds for any language with OpenMP support. A typical way of structuring data for a simulation code that needs multiple storage arrays for time integration would be (2 dimensional for now):
REAL, DIMENSION(imax,jmax,n_sub_timesteps) :: vars
Which would then be updated with something like:
DO J = 1, jmax
DO I = 1, imax
vars(I,J,2) = func(vars(:,:,1))
END DO
END DO
In my experience, OpenMP won’t actually parallelize those loops because it thinks vars is not thread-safe. But to the programmer, it obviously is.
And let’s assume for further real-case situations that making vars thread-local would be far too expensive to copy data into it.
So, is there a way to gently hint (aka coerce) OpenMP into not locking vars because it may not figure out that there’s no thread dependency issues but there really aren’t? I know there are ways to tell it that something is not thread-safe and needs locking, but is there a way to specify the inverse without making a copy for each thread?
Looks like you mistake OpenMP for automatic parallelisation. I am not aware of any OpenMP implementation that performs data locking unless explicitly told so by the introduction of a
CRITICALsection or anATOMICstatement (or at the end of a parallel region with aREDUCTIONclause). OpenMP compilers do not examine your code for possible data dependencies and prevent you from running in parallel – this is entirely left to you. If you want to do unprotected concurrent access, you can do it and no OpenMP-enabled compiler would stop you from doing so. The following code would always produce a parallel region and would distribute the outer loop among the threads in the team:On the other hand, the built-in automatic parallelisers in most compilers are very conservative and cautious and usually would not parallelise a case like yours without explicit hints from the programmer. Those hints are usually in the form of compiler-specific directives (formatted as comments in Fortran or as pragmas in C/C++). For example Intel Fortran supports the
!DEC$ PARALLELdirective that hints it to ignore assumed data dependencies in the loop that follows the directive:Many compilers reuse their OpenMP implementations and runtime libraries in order to implement the automatic parallelisation feature and hence the working of the resultant executables is usually controlled with OpenMP environment variables like
OMP_NUM_THREADS.If your parallel OpenMP program runs slower than expected, there are many other contributing reasons, mostly related to false sharing, cache trashing, TLB trashing, memory bandwidth limitations, non-local memory access on NUMA systems, using non-temporal loads/stores to shared variables, etc. so it might look like OpenMP is performs automatic data locking, but it doesn’t.