As we all know R isn’t the most efficient platform to run large analyses.

Question

0

Asked: June 3, 20262026-06-03T13:53:19+00:00 2026-06-03T13:53:19+00:00

As we all know R isn’t the most efficient platform to run large analyses.

0

As we all know R isn’t the most efficient platform to run large analyses.
If I had a large data frame containing three parameters:

GROUP   X  Y
A       1  2
A       2  2
A       2  3
...
B       1  1
B       2  3
B       1  4
...
millions of rows

and I wanted to run a computation on each group (e.g. compute Pearson’s r on X,Y) and store the results in a new data frame, I can do it like this:

df = loadDataFrameFrom( someFile )
results = data.frame()
for ( g in unique( df$GROUP)) ){
    gdf <- subset( df, df$GROUP == g )
    partialRes <- slowStuff( gdf$X,gdf$Y )
    results = rbind( results, data.frame( GROUP = g, RES = partialRes ) )
}
// results contains all the results here.
useResults(results)

The obvious problem is that this is VERY slow, even on powerful multi-core machine.

My question is: is it possible to parallelise this computation, having for example a separate thread for each group or a block of groups?
Is there a clean R pattern to solve this simple divide et impera problem?

Thanks,
Mulone

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T13:53:20+00:00

First off, R is not necessarily slow. Its speed depends largely on using it correctly, just like any language. There are a few things that can speed up your code without altering much: preallocate your results data.frame before you begin; use a list and matrix or vector construct instead of a data.frame; switch to use data.table; the list goes on, but The R Inferno is an excellent place to start.

Also, take a look here. It provides a good summary on how to take advantage of multi-core machines.

The “clean R pattern” was succinctly solved by Hadley Wickam with his plyr package and specifically ddply:

library(plyr)
library(doMC)
registerDoMC()
ddply(df, .(GROUP), your.function, .parallel=TRUE)

However, it is not necessarily fast. You can use something like:

library(parallel)
mclapply(unique(df$GRUOP), function(x, df)  ...)

Or finally, you can use the foreach package:

foreach(g = unique(df$Group), ...) %dopar$ {
   your.analysis
}

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

As we all know R isn’t the most efficient platform to run large analyses.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply