Using data.table package, is it possible to summarise data preserving combinations of variables that

Question

0

Asked: June 17, 20262026-06-17T18:43:12+00:00 2026-06-17T18:43:12+00:00

Using data.table package, is it possible to summarise data preserving combinations of variables that

0

Using data.table package, is it possible to summarise data preserving combinations of variables that do not appear in the input?

With plyr package I know how to do this with the .drop argument, for example:

require(plyr)
df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9))

df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))

output:

 categories groups sum
1          A      X   2
2          A      Y   1
3          A      Z   0
4          B      X   1
5          B      Y   2
6          B      Z   0
7          C      X   1
8          C      Y   1
9          C      Z   1

In this case I preserve all groups/categories combinations even if its sum is 0.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T18:43:14+00:00

Great question. Here are two ways. They both use by-without-by.

DT = as.data.table(df)
setkey(DT,categories,groups)
DT[CJ(unique(categories),unique(groups)), sum(values,na.rm=TRUE)]

   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z  0
4:          B      X  1
5:          B      Y  2
6:          B      Z  0
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

where CJ stands for Cross Join, see ?CJ. by-without-by just means that j gets executed on each group that each row of i joins to.

Admitedly it looks tricky on first sight. The idea is that if you have a known subset of groups, this syntax is faster than grouping everything and then selecting just the results from that you need. But in this case you’d like everything anyway so there’s not much advantage, other than being able to lookup groups that don’t exist in the data (which you can’t do with by).

Another way is to by first as normal, then join the CJ() result to that :

DT[,sum(values),keyby='categories,groups'][CJ(unique(categories),unique(groups))]
   categories groups V1
1:          A      X  2
2:          A      Y  1
3:          A      Z NA
4:          B      X  1
5:          B      Y  2
6:          B      Z NA
7:          C      X  1
8:          C      Y  1
9:          C      Z  1

but then you get NA instead of the desired 0. Those could be replaced using set() if need be. The second way might be faster because the two unique calls are given much smaller input.

Both methods can be wrapped up into small helper functions if you do this a lot.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Using data.table package, is it possible to summarise data preserving combinations of variables that

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply