I have a loop which is loading decent size files around 5MB each and than running some computations on them. I need to load 500-1000 of them. Seems like an easy job for foreach.
I am doing this but the performance of doSNOW seems to be horrendous.
I found this post and this fellow seems to have had the same issues:
http://statsadventure.blogspot.com/2012/06/performance-with-foreach-dosnow-and.html
So a couple of questions.
- Is there an alternative to doSnow? I realize there is doMC but I am running windows.
- Is doMC on linux that much faster than doSNOW?
- Is there anyway to output to screen from a worker so I can at least get some sort of idea how my job is progressing.
Thank you in advance!
Multiple threads trying to access different files on the hard disk can lead to very bad performance.
However, load balanced parallelization may still lead to improvement if enough time goes into calculations: the nodes will get out of synchronization thus hard disk requests will come in one after the other instead of all at the same time.
Here’s a simple example of
snow::clusterApplyvs the load balancedsnow::clusterApplyLB. I use snow instead of parallel as it provides timing and plotting:The black dots mark the start of a new function call. If the function starts with loading the file, all nodes will always try to access the hard disk at the same time with
clusterApplybecause the cluster waits for all nodes to return results before dealing out the new round of tasks. WithclusterApplyLB, the next task is handed out as soon as a node returned the result. Even if the tasks take basically the same time, they will get out of synchronization rather fast and the file loading will not be exactly at the same time.(I don’t know whether this is the actual problem, though)