Question
I’ve noticed that foreach/%dopar% performs sequential, not parallel setup of a cluster prior to executing tasks in parallel. If each worker requires a dataset and it takes N seconds to transfer the dataset to the worker, then foreach/%dopar% spends #workers * N seconds of setup time. This can be significant for large # of workers or a large N (large datasets to transfer).
My question is whether this is by design or is there some parameter/setting that I’m missing in foreach or perhaps in cluster generation?
Setup
- R 2.15.2
- latest versions of foreach/parallel/doParallel as of today (1/7/2013)
- Windows 7 x64
Example
library( foreach )
library( parallel )
library( doParallel )
# lots of data
data = eval( rnorm( 100000000 ) )
# make cluster/register - creates 6 nodes fairly quickly
cluster = makePSOCKcluster( 6 , outfile = "" )
registerDoParallel( cluster )
# fire up Task Manager. Observer that each node recieves data sequentially.
# When last node gets data, then all nodes process at the same time
results = foreach( i = 1 : 500 ) %dopar%
{
print( data[ i ] )
return( data[ i ] )
}
Thanks to Rich at Revolution Computing for helping with this one….
clusterCalluses a for loop to send data to each worker. Because R is not multi-threaded the for loop must be sequential.There are a few solutions (which would require someone to code them up). R could call out to C/C++ to thread the worker setup. Or the workers could pull the data from a file on disk. Or the workers could listen on the same socket and the master could write to the socket just once and have the data broadcast to all workers.