Using data.table package, is it possible to summarise data preserving combinations of variables that do not appear in the input?
With plyr package I know how to do this with the .drop argument, for example:
require(plyr)
df <- data.frame(categories = c(rep("A",3), rep("B",3), rep("C",3)), groups = c(rep(c("X", "Y"),4), "Z"), values = rep(1, 9))
df1 <- ddply(df, c("categories","groups"), .drop = F, summarise, sum = sum(values))
output:
categories groups sum
1 A X 2
2 A Y 1
3 A Z 0
4 B X 1
5 B Y 2
6 B Z 0
7 C X 1
8 C Y 1
9 C Z 1
In this case I preserve all groups/categories combinations even if its sum is 0.
Great question. Here are two ways. They both use by-without-by.
where
CJstands for Cross Join, see?CJ. by-without-by just means thatjgets executed on each group that each row ofijoins to.Admitedly it looks tricky on first sight. The idea is that if you have a known subset of groups, this syntax is faster than grouping everything and then selecting just the results from that you need. But in this case you’d like everything anyway so there’s not much advantage, other than being able to lookup groups that don’t exist in the data (which you can’t do with
by).Another way is to
byfirst as normal, then join theCJ()result to that :but then you get NA instead of the desired 0. Those could be replaced using
set()if need be. The second way might be faster because the twouniquecalls are given much smaller input.Both methods can be wrapped up into small helper functions if you do this a lot.