I think I am using plyr incorrectly. Could someone please tell me if this is ‘efficient’ plyr code?
require(plyr)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume))
A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R.
I tested a few aggregation methods – and found myself waiting around all day.
When I finally got results back, I discovered a huge gap between the plyr method and the others – which makes me think that I’ve done something dead wrong.
I ran the following code (I thought I’d check out the new dataframe package while I was at it):
require(plyr)
require(data.table)
require(dataframe)
require(rbenchmark)
require(xts)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume))
t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
t.apply.x <- function(dd) unlist(tapply(dd[,2], dd[,1], sum))
l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
l.apply.x <- function(dd) unlist(lapply(split(dd[,2], dd[,1]), sum))
b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
b.y.x <- function(dd) unlist(by(dd[,2], dd[,1], sum))
agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)
agg.x <- function(dd) aggregate(dd[,2], list(dd[,1]), sum)
dtd <- function(dd) dd[, sum(volume), by=(price)]
obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')
bmkRL <- list(NULL)
for (i in 1:5){
tt <- timS[1:obs[i]]
for (j in 1:8){
pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
px <- sample(pxl, length(tt), replace=TRUE)
vol <- rnorm(length(tt), 1000, 100)
d.df <- base::data.frame(time=tt, price=px, volume=vol)
d.dfp <- dataframe::data.frame(time=tt, price=px, volume=vol)
d.matrix <- as.matrix(d.df[,-1])
d.dt <- data.table(d.df)
listLabel <- paste('i=',i, 'j=',j)
bmkRL[[listLabel]] <- benchmark(plyr(d.df), plyr(d.dfp), t.apply(d.df),
t.apply(d.dfp), t.apply.x(d.matrix),
l.apply(d.df), l.apply(d.dfp), l.apply.x(d.matrix),
b.y(d.df), b.y(d.dfp), b.y.x(d.matrix), agg(d.df),
agg(d.dfp), agg.x(d.matrix), dtd(d.dt),
columns =c('test', 'elapsed', 'relative'),
replications = 10,
order = 'elapsed')
}
}
The test was supposed to check up to 5e8, but it took too long – mostly due to plyr. The 5e5 the final table shows the problem:
$`i= 5 j= 8`
test elapsed relative
15 dtd(d.dt) 4.156 1.000000
6 l.apply(d.df) 15.687 3.774543
7 l.apply(d.dfp) 16.066 3.865736
8 l.apply.x(d.matrix) 16.659 4.008422
4 t.apply(d.dfp) 21.387 5.146054
3 t.apply(d.df) 21.488 5.170356
5 t.apply.x(d.matrix) 22.014 5.296920
13 agg(d.dfp) 32.254 7.760828
14 agg.x(d.matrix) 32.435 7.804379
12 agg(d.df) 32.593 7.842397
10 b.y(d.dfp) 98.006 23.581809
11 b.y.x(d.matrix) 98.134 23.612608
9 b.y(d.df) 98.337 23.661453
1 plyr(d.df) 9384.135 2257.972810
2 plyr(d.dfp) 9384.448 2258.048123
Is this right? Why is plyr 2250x slower than data.table? And why didn’t using the new data frame package make a difference?
The session info is:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xts_0.8-6 zoo_1.7-7 rbenchmark_0.3 dataframe_2.5 data.table_1.8.1 plyr_1.7.1
loaded via a namespace (and not attached):
[1] grid_2.15.1 lattice_0.20-6 tools_2.15.1
Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states
As for being efficient plyr code I didn’t know either. After a bunch of param testing and bench-marking it looks like we can do better.
The
summarize()in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn’t helping with anything that isn’t already simple and the.dataand.(price)arguments can be made more explicit. The result isThe
summarizemay seem nice, but it just isn’t quicker than a simple function call. It makes sense; just look at our little function versus the code forsummarize. Running your benchmarks with the revised formula yields a noticeable gain. Don’t take that to mean you’ve used plyr incorrectly, you haven’t, it just isn’t efficient; nothing you can do with it will make it as fast as other options.In my opinion the optimized function still stinks as it isn’t clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).
In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released
dplyras the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.The
dataframepackage has been removed from CRAN and subsequently from the tests, along with the matrix function versions.Here’s the
i=5, j=8benchmark results:No doubt the optimizing helped a bit. Take a look at the
d.dffunctions; they just can’t compete.For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (
i=8,j=8).The data.frame is still left in the dust. Not only that, but here’s the elapsed system.time to populate the data structures with the test data:
Both creation and aggregation of the data.frame is slower than that of the data.table.
Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn’t give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.
In the end…
Plyr is slow because of the way it works with and manages the data.frame manipulation.
[punt:: see the comments to the original question].
Data-Generating gist .rmd