I have a function that I eventually want to parallelize.
Currently, I call things in a for loop.
double temp = 0;
int y = 123; // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
temp += doStuff(i, y);
}
doStuff needs to know how far down the list it is. So I use i – data.begin() to calculate.
Next, I’d like to use the stl::for_each function instead. My challenge is that I need to pass the address of my iterator and the value of y. I’ve seen examples of using bind2nd to pass a parameter to the function, but how can I pass the address of the iterator as the first parameter?
The boost FOREACH functions also looks like a possibility, however I do not know if it will parallelize auto-magically like the STL version does.
Thoughts, ideas, suggestions?
If you want real parallelization here, use
GCC with tree vectorization optimization on (-O3) and SIMD (e.g. -march=native to get SSE support). If the operation (dostuff) is non-trivial, you could opt to do it ahead of time (
std::transformorstd::for_each) and accumulate next (std::accumulate) since the accumulation will be optimized like nothing else on SSE instructions!Note that though this will not actually run on multiple threads, the performance increase will be massive since SSE4 instructions can handle many floating operations *in parallell _on a single core_ .
If you wanted true parallelism, use one of the following
GNU Parallel Mode
Compile with
g++ -fopenmp -D_GLIBCXX_PARALLEL:OpenMP directly
Compile with
g++ -fopenmpThis will result in the loop being parallelized into as many threads (OMP team) as there are (logical) CPU cores on the actual machine, and the result ‘magically’ combined and synchronized.
Final remarks:
You can simulate the binary function for for_each by using a stateful function object. This is not exactly recommended practice. It will also appear to be very inefficient (when compiling without optimization, it is). This is due to the fact that function objects are passed by value thoughout the STL. However, it is reasonable to expect a compiler to completely optimize the potential overhead of that away, especially for simple cases like the following: