I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. The loop should be able to alter the elements of the list. Is there a simple solution for this? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. In any case, I would prefer to use OpenMP 2.0 as I don’t have full control over which compilers are available to me.
If my container were a vector, I might use:
#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
it->process();
}
I understand that I could copy the list into a vector, do the loop, then copy everything back. However, I would like to avoid this complexity and overhead if possible.
If you decide to use
Openmp 3.0, you can use thetaskfeature:This will execute the loop in one thread, but delegate the processing of elements to others.
Without
OpenMP 3.0the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn’t have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn’t have to much overhead:If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. This would look roughly like this:
The barrier is not strictly needed, however if
processmutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads.To reduce the overhead you could put the generation of the ranges outside of the
#pragma omp parallel, however you will need to know how many threads will form the parallel section. So you’d probably have to manually set thenum_threads, or useomp_get_max_threads()and handle the case that the number of threads created is less thenomp_get_max_threads()(which is only an upper bound). The last way could be handled by possibly assigning each thread severa chunks in that case (using#pragma omp forshould do that):This will take only three iterations over
list(two, if you can get the size of the list without iterating). I think that is about the best you can do for non random access iterators without usingtasksor iterating over some out of place datastructure (like a vector of pointer).