I am trying to clean data using ddply but it is running very slowly on 1.3M rows.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I have written a function to remove rows where there is a consumption value in one row that is identical but negative to a consumption value in another row (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
I then use ddply to apply that function to remove these erroneous rows in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to make this go faster but I couldn’t work out how to employ data.table to help.
With 1.3M rows, so far it is taking my desktop all day to compute and still hasn’t finished.
Your question asks about
data.tableimplementation. So, I’ve shown it here. Your function could be drastically simplified as well. You can first get thesignby summing upNegand then filter the table and then multiplyConsumptionbysign(as shown below).Benchmarking
The data with million rows:
The
data.tablefunction:Your function with
ddplyBenchmarking with
rbenchmark(with 1 run only)My
data.tableimplementation is about 39 times faster than your currentplyrimplementation (I compare mine to your implementation because the functions are different).Note:I loaded the packages within the function in order to obtain the complete time to obtain the result. Also, for the same reason I converted thedata.frametodata.tablewith keys inside the benchmarking function. This is therefore the minimum speed-up.