Following on from How to optimise filtering and counting for every row in a large R data frame
I have a data.table such as the following:
name day wages hour colour
1 Ann 1 100 6 Green
2 Ann 1 150 18 Blue
3 Ann 2 200 10 Blue
4 Ann 3 150 10 Green
5 Bob 1 100 11 Red
6 Bob 1 200 17 Red
7 Bob 1 150 20 Green
8 Bob 2 100 11 Red
I wish to know, for every unique name/day pair, for one of four time-periods, a number of facts. The time periods I care about are:
t1 (hour < 9)
t2 (hour < 17)
t3 (hour > 9)
t4 (hour > 17)
Some examples of facts might be:
wages > 175
colour = "Green"
I can accomplish this with the following data.table filter
setkey(dt,name,day)
result <- dt[,list(wages.t1=sum(wages>175&hour<9),
wages.t2=sum(wages>175&hour<17),
wages.t3=sum(wages>175&hour>9),
wages.t4=sum(wages>175&hour>17),
green.t1=sum(colour=="Green"&hour<9),
green.t2=sum(colour=="Green"&hour<17),
green.t3=sum(colour=="Green"&hour>9),
green.t4=sum(colour=="Green"&hour>17)),
list(name,day)]
Giving me
name day wages.t1 wages.t2 wages.t3 wages.t4 green.t1 green.t2 green.t3 green.t4
[1,] Ann 1 0 0 0 0 1 1 0 0
[2,] Ann 2 0 1 1 0 0 0 0 0
[3,] Ann 3 0 0 0 0 0 1 1 0
[4,] Bob 1 0 0 1 0 0 0 1 1
[5,] Bob 2 0 0 0 0 0 0 0 0
But this a) Is horrible to read & write and b) Seems inefficient.
Any tips on how I can do better? Note that in my real scenario I have many hundreds of thousands of rows, four time periods, and 30-35 facts per time period.
— Code to create dt
dt = data.table(
name = factor(c("Ann", "Ann", "Ann", "Ann",
"Bob", "Bob", "Bob", "Bob")),
day = c(1, 1, 2, 3, 1, 1, 1, 2),
wages = c(100, 150, 200, 150, 100, 200, 150, 100),
hour = c(6, 18, 10, 10, 11, 17, 20, 11),
colour = c("Green", "Blue", "Blue", "Green", "Red",
"Red", "Green", "Red")
)
How about something like :
Clearly the column names aren’t tackled but that could be added if this approach is ok.
This should be more efficient because each
tand eachfis evaluated once only per group, then the combinations of those results are combined.