I often find myself needing to apply a small number of rule-based transformations to dataframes based on certain conditions, typically a fixed number of fields having certain values. The transformations can modify any number of columns, usually one to three. The number of rows involved in these transformations is small compared to the total number of rows in the data frames. Currently I am using ddply but the performance is lacking because ddply modifies all rows.
I am looking for a way to solve this problem in an elegant, generic manner taking advantage of the fact that only a small number of rows needs to be changed. Below is a simplified example of the types of transformations I’m dealing with.
df <- data.frame(Product=gl(4,10,labels=c("A","B", "C", "D")),
Year=sort(rep(2002:2011,4)),
Quarter=rep(c("Q1","Q2", "Q3", "Q4"), 10),
Sales=1:40)
> head(df)
Product Year Quarter Sales
1 A 2002 Q1 1
2 A 2002 Q2 2
3 A 2002 Q3 3
4 A 2002 Q4 4
5 A 2003 Q1 5
6 A 2003 Q2 6
>
transformations <- function(df) {
if (df$Year == 2002 && df$Product == 'A') {
df$Sales <- df$Sales + 3
} else if (df$Year == 2009 && df$Product == 'C') {
df$Sales <- df$Sales - 10
df$Product <- 'E'
}
df
}
library(plyr)
df <- ddply(df, .(Product, Year), transformations)
> head(df)
Product Year Quarter Sales
1 A 2002 Q1 4
2 A 2002 Q2 5
3 A 2002 Q3 6
4 A 2002 Q4 7
5 A 2003 Q1 5
6 A 2003 Q2 6
Insted of hard-coded conditionals I’m using a pairlist of conditional and transformation functions, e.g., the code below, but that’s not a meaningful improvement.
transformation_rules <- list(
list(
condition = function(df) df$Year == 2002 && df$Product == 'A',
transformation = function(df) {
df$Sales <- df$Sales + 3
df
}
)
)
What are some better ways to approach this problem?
I don’t think you need to use
plyrfor this problem at all. I think you can simply useifelse()and take advantage of the fact that R is vectorized and get the same result.Since your function directly modifies the
Salescolumn, I made a copy of it like so before running plyr:df2 <- df. I also had my example make a new columnSales2instead of overwriting theSalescolumn.And then rewrote your function like this:
And testing whether or not the output is equal:
And comparing the system timing between the two shows that the vectorized version avoiding ddply is much faster:
So, unless I’m missing something – you can avoid
plyrall together here and get some nice speed improvements. Ififelse()proves too slow, you could write up some boolean functions to go even faster, though I doubt that will be necessary.