Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8372931
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T14:33:23+00:00 2026-06-09T14:33:23+00:00

I think I am using plyr incorrectly. Could someone please tell me if this

  • 0

I think I am using plyr incorrectly. Could someone please tell me if this is ‘efficient’ plyr code?

require(plyr)
plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 

A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R.

I tested a few aggregation methods – and found myself waiting around all day.

When I finally got results back, I discovered a huge gap between the plyr method and the others – which makes me think that I’ve done something dead wrong.

I ran the following code (I thought I’d check out the new dataframe package while I was at it):

require(plyr)
require(data.table)
require(dataframe)
require(rbenchmark)
require(xts)

plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) 
t.apply <- function(dd) unlist(tapply(dd$volume, dd$price, sum))
t.apply.x <- function(dd) unlist(tapply(dd[,2], dd[,1], sum))
l.apply <- function(dd) unlist(lapply(split(dd$volume, dd$price), sum))
l.apply.x <- function(dd) unlist(lapply(split(dd[,2], dd[,1]), sum))
b.y <- function(dd) unlist(by(dd$volume, dd$price, sum))
b.y.x <- function(dd) unlist(by(dd[,2], dd[,1], sum))
agg <- function(dd) aggregate(dd$volume, list(dd$price), sum)
agg.x <- function(dd) aggregate(dd[,2], list(dd[,1]), sum)
dtd <- function(dd) dd[, sum(volume), by=(price)]

obs <- c(5e1, 5e2, 5e3, 5e4, 5e5, 5e6, 5e6, 5e7, 5e8)
timS <- timeBasedSeq('20110101 083000/20120101 083000')

bmkRL <- list(NULL)

for (i in 1:5){
  tt <- timS[1:obs[i]]

  for (j in 1:8){
    pxl <- seq(0.9, 1.1, by= (1.1 - 0.9)/floor(obs[i]/(11-j)))
    px <- sample(pxl, length(tt), replace=TRUE)
    vol <- rnorm(length(tt), 1000, 100)

    d.df <- base::data.frame(time=tt, price=px, volume=vol)
    d.dfp <- dataframe::data.frame(time=tt, price=px, volume=vol)
    d.matrix <- as.matrix(d.df[,-1])
    d.dt <- data.table(d.df)

    listLabel <- paste('i=',i, 'j=',j)

    bmkRL[[listLabel]] <- benchmark(plyr(d.df), plyr(d.dfp), t.apply(d.df),     
                         t.apply(d.dfp), t.apply.x(d.matrix), 
                         l.apply(d.df), l.apply(d.dfp), l.apply.x(d.matrix),
                         b.y(d.df), b.y(d.dfp), b.y.x(d.matrix), agg(d.df),
                         agg(d.dfp), agg.x(d.matrix), dtd(d.dt),
          columns =c('test', 'elapsed', 'relative'),
          replications = 10,
          order = 'elapsed')
  }
}

The test was supposed to check up to 5e8, but it took too long – mostly due to plyr. The 5e5 the final table shows the problem:

$`i= 5 j= 8`
                  test  elapsed    relative
15           dtd(d.dt)    4.156    1.000000
6        l.apply(d.df)   15.687    3.774543
7       l.apply(d.dfp)   16.066    3.865736
8  l.apply.x(d.matrix)   16.659    4.008422
4       t.apply(d.dfp)   21.387    5.146054
3        t.apply(d.df)   21.488    5.170356
5  t.apply.x(d.matrix)   22.014    5.296920
13          agg(d.dfp)   32.254    7.760828
14     agg.x(d.matrix)   32.435    7.804379
12           agg(d.df)   32.593    7.842397
10          b.y(d.dfp)   98.006   23.581809
11     b.y.x(d.matrix)   98.134   23.612608
9            b.y(d.df)   98.337   23.661453
1           plyr(d.df) 9384.135 2257.972810
2          plyr(d.dfp) 9384.448 2258.048123

Is this right? Why is plyr 2250x slower than data.table? And why didn’t using the new data frame package make a difference?

The session info is:

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-6        zoo_1.7-7        rbenchmark_0.3   dataframe_2.5    data.table_1.8.1     plyr_1.7.1      

loaded via a namespace (and not attached):
[1] grid_2.15.1    lattice_0.20-6 tools_2.15.1 
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T14:33:24+00:00Added an answer on June 9, 2026 at 2:33 pm

    Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states

    This is a drawback of the way that ddply always works with data
    frames. It will be a bit faster if you use summarise instead of
    data.frame (because data.frame is very slow), but I’m still thinking
    about how to overcome this fundamental limitation of the ddply
    approach.


    As for being efficient plyr code I didn’t know either. After a bunch of param testing and bench-marking it looks like we can do better.

    The summarize() in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn’t helping with anything that isn’t already simple and the .data and .(price) arguments can be made more explicit. The result is

    ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )
    

    The summarize may seem nice, but it just isn’t quicker than a simple function call. It makes sense; just look at our little function versus the code for summarize. Running your benchmarks with the revised formula yields a noticeable gain. Don’t take that to mean you’ve used plyr incorrectly, you haven’t, it just isn’t efficient; nothing you can do with it will make it as fast as other options.

    In my opinion the optimized function still stinks as it isn’t clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).


    In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released dplyr as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.

    plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
    plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )
    
    dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )    
    
    data_table <- function(dd) dd[, sum(volume), keyby=price]
    

    The dataframe package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.

    Here’s the i=5, j=8 benchmark results:

    $`obs= 500,000 unique prices= 158,286 reps= 5`
                      test elapsed relative
    9     data_table(d.dt)   0.074    1.000
    4          dplyr(d.dt)   0.133    1.797
    3          dplyr(d.df)   1.832   24.757
    6        l.apply(d.df)   5.049   68.230
    5        t.apply(d.df)   8.078  109.162
    8            agg(d.df)  11.822  159.757
    7            b.y(d.df)  48.569  656.338
    2 plyr_Optimized(d.df) 148.030 2000.405
    1  plyr_Original(d.df) 401.890 5430.946
    

    No doubt the optimizing helped a bit. Take a look at the d.df functions; they just can’t compete.

    For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (i=8,j=8).

    $`obs= 50,000,000 unique prices= 15,836,476 reps= 5`
    Unit: seconds
                 expr    min     lq median     uq    max neval
     data_table(d.dt)  1.190  1.193  1.198  1.460  1.574    10
          dplyr(d.dt)  2.346  2.434  2.542  2.942  9.856    10
          dplyr(d.df) 66.238 66.688 67.436 69.226 86.641    10
    

    The data.frame is still left in the dust. Not only that, but here’s the elapsed system.time to populate the data structures with the test data:

    `d.df` (data.frame)  3.181 seconds.
    `d.dt` (data.table)  0.418 seconds.
    

    Both creation and aggregation of the data.frame is slower than that of the data.table.

    Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn’t give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.

    In the end…

    Plyr is slow because of the way it works with and manages the data.frame manipulation.

    [punt:: see the comments to the original question].


    ## R version 3.0.2 (2013-09-25)
    ## Platform: x86_64-pc-linux-gnu (64-bit)
    ## 
    ## attached base packages:
    ## [1] stats     graphics  grDevices utils     datasets  methods   base     
    ## 
    ## other attached packages:
    ## [1] microbenchmark_1.3-0 rbenchmark_1.0.0     xts_0.9-7           
    ## [4] zoo_1.7-11           data.table_1.9.2     dplyr_0.1.2         
    ## [7] plyr_1.8.1           knitr_1.5.22        
    ## 
    ## loaded via a namespace (and not attached):
    ## [1] assertthat_0.1  evaluate_0.5.2  formatR_0.10.4  grid_3.0.2     
    ## [5] lattice_0.20-27 Rcpp_0.11.0     reshape2_1.2.2  stringr_0.6.2  
    ## [9] tools_3.0.2
    

    Data-Generating gist .rmd

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I think I'm using Django in google app engine this way: from google.appengine.ext.webapp import
I think I'm using the Users API incorrectly: class BaseHandler(webapp.RequestHandler): user = users.get_current_user() def
New to MVVM so please excuse my ignorance. I THINK i'm using it right
I think that this problem can be sorted using reflection (a technology which I'm
I want this to work on all screen sizes so I don't think using
I'm uploading files into FTP using this code: http://msdn.microsoft.com/en-us/library/ms229715.aspx . It's all good, but
Its generally related to a question in magento ..but i think using core php
I don't think that using .ini or .xml file is a good idea with
To make my extranet web application even faster/more scalable I think of using some
i'm having a little issue with doctrine using symfony 1.4 (I think it's using

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.