Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9175885
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T16:59:23+00:00 2026-06-17T16:59:23+00:00

I am trying to clean data using ddply but it is running very slowly

  • 0

I am trying to clean data using ddply but it is running very slowly on 1.3M rows.

Sample code:

#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative? 
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).

#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one. 
clean_negatives <- function(x3){
  copies <- abs(sum(x3$Neg))
  sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
  x3 <- x3[0:copies,]
  x3$Consumption <- sgn*x3$Consumption
  x3$Neg <- NULL
  x3}

I then use ddply to apply that function to remove these erroneous rows in the data

ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
                    function(x){clean_negatives(x)})
proc.time() - ptm

I was hoping I could use data.table to make this go faster but I couldn’t work out how to employ data.table to help.

With 1.3M rows, so far it is taking my desktop all day to compute and still hasn’t finished.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T16:59:24+00:00Added an answer on June 17, 2026 at 4:59 pm

    Your question asks about data.table implementation. So, I’ve shown it here. Your function could be drastically simplified as well. You can first get the sign by summing up Neg and then filter the table and then multiply Consumption by sign (as shown below).

    require(data.table)
    # get the data.table in dt
    dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
    # first obtain the sign directly
    dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
    # then filter by abs(sum(Neg))
    dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
    # modifying for final output (line commented after Statquant's comment
    # dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
    dt.fil[, Consumption := (Consumption*sign)]
    dt.fil <- subset(dt.fil, select=-c(Neg, sign))
    

    Benchmarking

    • The data with million rows:

      #Create Sample Data Frame
      num_rows <- 1e6
      df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                      Consumption=sample(-20:20, num_rows, replace=T), 
                      StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
      df$EndDate <- df$StartDate + 90
      df$Neg <- ifelse(df$Consumption < 0, -1, 1)
      df$Consumption <- abs(df$Consumption)
      
    • The data.table function:

      FUN.DT <- function() {
          require(data.table)
          dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
          dt <- dt[, sign := sign(sum(Neg)), 
                     by = c("id", "StartDate", "EndDate", "Consumption")]
          dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], 
                     by=c("id", "StartDate", "EndDate", "Consumption")]
          dt.fil[, Consumption := (Consumption*sign)]
          dt.fil <- subset(dt.fil, select=-c(Neg, sign))
      }
      
    • Your function with ddply

      FUN.PLYR <- function() {
          require(plyr)
          clean_negatives <- function(x3) {
              copies <- abs(sum(x3$Neg))
              sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
              x3 <- x3[0:copies,]
              x3$Consumption <- sgn*x3$Consumption
              x3$Neg <- NULL
              x3
          }
          df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), 
                                 function(x) clean_negatives(x))
      }
      
    • Benchmarking with rbenchmark (with 1 run only)

      require(rbenchmark)
      benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")
      
              test replications elapsed relative user.self sys.self user.child sys.child
      1   FUN.DT()            1   6.137    1.000     5.926    0.211          0         0
      2 FUN.PLYR()            1 242.268   39.477   152.855    82.881         0         0
      

    My data.table implementation is about 39 times faster than your current plyr implementation (I compare mine to your implementation because the functions are different).

    Note: I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted the data.frame to data.table with keys inside the benchmarking function. This is therefore the minimum speed-up.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to simulate some code that I have working with SQL but using
I'm running a data import (using C#/Linq), and naturally I'm trying to optimize my
I am trying to clean up some data that has been incorrectly entered. The
I'm trying to clean up the architecture of my jQuery Mobile application by using
I'm trying to clean up form input using the following Perl transliteration: sub ValidateInput
I'm trying to clean duplicate code. The only difference are calls like MyType x
I'm trying to clean up some warnings in some old Java code (in Eclipse),
I am trying to set the data validation for a range of cells using
I am trying to load my data using a separate query to the server
Maintaining Clean Architecture in Spring MVC with a data-centric approach I'm trying to map

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.