I’m playing around with threads in C++, in particular using them to parallelize a map operation.
Here’s the code:
#include <thread>
#include <iostream>
#include <cstdlib>
#include <vector>
#include <math.h>
#include <stdio.h>
double multByTwo(double x){
return x*2;
}
double doJunk(double x){
return cos(pow(sin(x*2),3));
}
template <typename T>
void map(T* data, int n, T (*ptr)(T)){
for (int i=0; i<n; i++)
data[i] = (*ptr)(data[i]);
}
template <typename T>
void parallelMap(T* data, int n, T (*ptr)(T)){
int NUMCORES = 3;
std::vector<std::thread> threads;
for (int i=0; i<NUMCORES; i++)
threads.push_back(std::thread(&map<T>, data + i*n/NUMCORES, n/NUMCORES, ptr));
for (std::thread& t : threads)
t.join();
}
int main()
{
int n = 1000000000;
double* nums = new double[n];
for (int i=0; i<n; i++)
nums[i] = i;
std::cout<<"go"<<std::endl;
clock_t c1 = clock();
struct timespec start, finish;
double elapsed;
clock_gettime(CLOCK_MONOTONIC, &start);
// also try with &doJunk
//parallelMap(nums, n, &multByTwo);
map(nums, n, &doJunk);
std::cout << nums[342] << std::endl;
clock_gettime(CLOCK_MONOTONIC, &finish);
printf("CPU elapsed time is %f seconds\n", double(clock()-c1)/CLOCKS_PER_SEC);
elapsed = (finish.tv_sec - start.tv_sec);
elapsed += (finish.tv_nsec - start.tv_nsec) / 1000000000.0;
printf("Actual elapsed time is %f seconds\n", elapsed);
}
With multByTwo the parallel version is actually slightly slower (1.01 seconds versus .95 real time), and with doJunk its faster (51 versus 136 real time). This implies to me that
- the parallelization is working, and
- there is a REALLY large overhead with declaring
new threads. Any thoughts as to why the overhead is so large, and how I can avoid it?
Just a guess: what you’re likely seeing is that the
multByTwocode is so fast that you’re achieving memory saturation. The code will never run any faster no matter how much processor power you throw at it, because it’s already going as fast as it can get the bits to and from RAM.