I have code that consumes a large number (millions currently, eventually billions) of relatively short (5-100 elements) arrays of random numbers and does some not-very-strenuous math with them. Random numbers being, well, random, ideally I’d like to generate them on multiple cores, since random number generation is > 50% of my runtime in profiling. However, I’m having difficulty distributing a large number of small tasks in a way that’s not slower than the single-threaded approach.
My code currently looks something like this:
for(int i=0;i<1000000;i++){
for(RealVector d:data){
while(!converged){
double[] shortVec = new double[5];
for(int i=0;i<5;i++) shortVec[i]=rng.nextGaussian();
double[] longerVec = new double[50];
for(int i=0;i<50;i++) longerVec[i]=rng.nextGaussian();
/*Do some relatively fast math*/
}
}
}
Approaches I’ve taken that have not worked are:
- 1+ threads populating an ArrayBlockingQueue, and my main loop consuming and populating the array (the boxing/unboxing was killer here)
- Generating the vectors with a Callable (yielding a future) while doing the non-dependent parts of the math (it appears the overhead of the indirection outweighed whatever parallelism gains I got)
- Using 2 ArrayBlockingQueue, each populated by a thread, one for the short and one for the long arrays (still roughly twice as slow as the direct single-threaded case).
I’m not looking for “solutions” to my particular problem so much as how to handle the general case of generating large streams of small, independent primitives in parallel and consuming them from a single thread.
This is more efficient than using a Queue because;
double[]meaning the background thread can generate more data before having to pass it off..
Random is thread safe and synchronized. This means each thread needs it own Random to perform concurrently.
I would use an
Exchanger<double[][]>to populate values in the background as pass them efficiently (without much GC overhead)