I’m a bit confused. I routinely use transform like this ddply(data.frame, 1, transform, new.column

Question

0

Asked: May 31, 20262026-05-31T06:12:50+00:00 2026-05-31T06:12:50+00:00

I’m a bit confused. I routinely use transform like this ddply(data.frame, 1, transform, new.column

0

I’m a bit confused. I routinely use transform like this

    ddply(data.frame, 1, transform, new.column = function(old.col.1,old.col.2,...))

This is also recommended by Hadley.

But recently I asked a question and Hadley stated this:

Don’t use transform. It’s a helper function suitable for interactive use, not for programming with.

So whats wrong with transform? I think im convinced now that this is stupid:

   transform(data.frame,col2=fun(col1)).

But is it not very useful in the ddply setting?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T06:12:51+00:00

There’s a difference between using transform within ddply and the function transform() as a standalone. It is far better (and quicker) to just do:

Mydata$col3 <- fun(Mydata$col1, Mydata$col2)

The function combination ddply/transform is especially useful if you have more than one column to change, eg

Mynewdata <- ddply(Mydata,1,transform,col3=fun1(col1,col2), col4=fun2(col1,col2))

And even then, you have the more flexible option of using within() that allows you to use calculated results to calculate the next row:

Mynewdata <- within(Mydata,{
    col2 <- fun1(col1)
    col3 <- fun2(col1,col2)
})

The thing with transform() is that it is especially written to be used interactively. If you use it within a function, you might run into trouble. It is similar to subset() in that way: They’re convenience functions, but they’re neither fast nor very safe to use within more complex code.

Opinions differ on ddply(). In some cases it works quick and gives very clean and readible code, in other cases I consider it serious overkill. ddply() often works faster and easier when you have to use non-vectorized functions, in which case the above options wouldn’t work. But for that, you also have the option to use mapply:

Mynewdata <- within(Mydata, col3 <- mapply(myfun,col1,col2))

mapply can in this case also be quite faster. To give you a basic example:

Mydata <- data.frame(col1=rnorm(5),col2=rpois(5,3))
myfun <- function(x,y){
    if(y == 0) mean(x) else
     mean(c(x,seq(1,y,by=1)))
}

code1 <- expression(Newdata <- ddply(Mydata,1,transform,col3=myfun(col1,col2)))
code2 <- expression(Newdata2 <- within(Mydata, col3 <- mapply(myfun,col1,col2)))

> benchmark(code1,code2)
   test replications elapsed relative 
1 code1          100    0.50     12.5 
2 code2          100    0.04      1.0

The main problem I have with ddply() is that the order of your observations is not guaranteed, as you see in the example output below:

Mydata              Newdata2                    Newdata
        col1 col2         col1 col2      col3         col1 col2      col3
1 0.07060223    4 | 0.07060223    4 2.0141204 | 0.05658259    2 1.0188609
2 1.84645791    2 | 1.84645791    2 1.6154860 | 0.07060223    4 2.0141204
3 0.05658259    2 | 0.05658259    2 1.0188609 | 0.84119845    1 0.9205992
4 0.89998084    5 | 0.89998084    5 2.6499968 | 0.89998084    5 2.6499968
5 0.84119845    1 | 0.84119845    1 0.9205992 | 1.84645791    2 1.6154860

Both functions calculate the correct result, but mapply() does so faster in this case and with preserving the order of the observations in the dataframe.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m a bit confused. I routinely use transform like this ddply(data.frame, 1, transform, new.column

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply