In the book Software for Data Analysis: Programming with R, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <-, typically used to store the result of a function.
First, is a technical question. Imagine an R function called proc1 that accepts a data.table object x as its argument (in addition to, maybe, other parameters). proc1 returns NULL but modifies x using :=. From what I understand, proc1 calling proc1(x=x1) makes a copy of x1 just because of the way that promises work. However, as demonstrated below, the original object x1 is still modified by proc1. Why/how is this?
> require(data.table)
> x1 <- CJ(1:2, 2:3)
> x1
V1 V2
1: 1 2
2: 1 3
3: 2 2
4: 2 3
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> proc1(x1)
NULL
> x1
V1 V2 y
1: 1 2 2
2: 1 3 3
3: 2 2 4
4: 2 3 6
>
Furthermore, it seems that using proc1(x=x1) isn’t any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:
> x1 <- CJ(1:2000, 1:500)
> x1[, paste0("V",3:300) := rnorm(1:nrow(x1))]
> proc1 <- function(x){
+ x[,y:= V1*V2]
+ NULL
+ }
> system.time(proc1(x1))
user system elapsed
0.00 0.02 0.02
> x1 <- CJ(1:2000, 1:500)
> system.time(x1[,y:= V1*V2])
user system elapsed
0.03 0.00 0.03
So, given that passing a data.table argument to a function doesn’t add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really “ok” to write this type of procedural programming in R? Why was he arguing that side effects are “bad”? If I’m going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write “good” data.table procedures?
Yes, the addition, modification, deletion of columns in
data.tables is done byreference. In a sense, it is a good thing because adata.tableusually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against theno-side-effectfunctional programming approach that R tries to promote by usingpass-by-valueby default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won’t be affected, and you can just focus on the function’s output. It’s simple, hence comfortable.Of course it is ok to disregard John Chambers’s advice if you know what you are doing. About writing “good” data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:
do.something.to(table)and nottable <- do.something.to(table). If instead the function had another (“real”) output, then when callingresult <- do.something.to(table), it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.While “one output / no-side-effect” functions are the norm in R, the above rules allow for “one output or side-effect”. If you agree that a side-effect is somehow a form of output, then you’ll agree I am not bending the rules too much by loosely sticking to R’s one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can’t do it, but I would try to avoid it if possible.