This is more of a beginner’s question. Say I have the following code:
library("multicore")
library("iterators")
library("foreach")
library("doMC")
registerDoMC(16)
foreach(i in 1:M) %dopar% {
##do stuff
}
This code then will run on 16 cores, if they are available. Now if I understand correctly, using Amazon EC2, on one instance, I get depending on the instance only few cores. So if I want to run simulations on 16 cores, I need to use several instances, which means as I far as I understand launching new R processes. But then I need to write additional code outside of R to gather the results.
So my question is, is there an R package, which lets to launch EC2 instances from within R, automagicaly distributes the load between these instances, and gathers the results in the initial R launched?
To be precise, the maximum instance type on EC2 is currently 8 cores, so anyone, even users of R, would need multiple instances in order to have run concurrently on more than 8 cores.
If you want to use more instances, then you have two options for deploying R: “regular” R invocations or MapReduce invocations. In the former case, you will have to set up code to launch instances, distribute tasks (e.g. the independent iterations in
foreach), return results, etc. This is doable, but you’re not likely to enjoy it. In this case, you can use something likermrorRHipeto manage a MapReduce grid, or you can usesnowand many other HPC tools to create a simple grid. Use ofsnowmay make it easier to keep your code intact, but you will have to learn how to tie this stuff together.In the latter case, you can build upon infrastructure that Amazon has provided, such as Elastic MapReduce (EMR) and packages that make that simpler, such as JD’s
segue. I’d recommendsegueas a good starting point, as others have done, as it has a gentler learning curve. The developer is also on SO, so you can easilyembarrassquery him when it breaks.