Following on from How to optimise filtering and counting for every row in a

Question

0

Editorial Team

Asked: June 2, 20262026-06-02T09:06:15+00:00 2026-06-02T09:06:15+00:00

Following on from How to optimise filtering and counting for every row in a

0

Following on from How to optimise filtering and counting for every row in a large R data frame

I have a data.table such as the following:

  name day wages hour colour
1  Ann   1   100    6  Green
2  Ann   1   150   18   Blue
3  Ann   2   200   10   Blue
4  Ann   3   150   10  Green
5  Bob   1   100   11    Red
6  Bob   1   200   17    Red
7  Bob   1   150   20  Green
8  Bob   2   100   11    Red

I wish to know, for every unique name/day pair, for one of four time-periods, a number of facts. The time periods I care about are:

t1 (hour < 9) 
t2 (hour < 17) 
t3 (hour > 9) 
t4 (hour > 17)

Some examples of facts might be:

wages > 175
colour = "Green"

I can accomplish this with the following data.table filter

setkey(dt,name,day)
result <- dt[,list(wages.t1=sum(wages>175&hour<9),
     wages.t2=sum(wages>175&hour<17),
     wages.t3=sum(wages>175&hour>9),
     wages.t4=sum(wages>175&hour>17),
     green.t1=sum(colour=="Green"&hour<9),
     green.t2=sum(colour=="Green"&hour<17),
     green.t3=sum(colour=="Green"&hour>9),
     green.t4=sum(colour=="Green"&hour>17)),

list(name,day)]

Giving me

     name day wages.t1 wages.t2 wages.t3 wages.t4 green.t1 green.t2 green.t3 green.t4
[1,]  Ann   1        0        0        0        0        1        1        0        0
[2,]  Ann   2        0        1        1        0        0        0        0        0
[3,]  Ann   3        0        0        0        0        0        1        1        0
[4,]  Bob   1        0        0        1        0        0        0        1        1
[5,]  Bob   2        0        0        0        0        0        0        0        0

But this a) Is horrible to read & write and b) Seems inefficient.

Any tips on how I can do better? Note that in my real scenario I have many hundreds of thousands of rows, four time periods, and 30-35 facts per time period.

— Code to create dt

dt = data.table(
  name = factor(c("Ann", "Ann", "Ann", "Ann", 
                  "Bob", "Bob", "Bob", "Bob")), 
  day = c(1, 1, 2, 3, 1, 1, 1, 2), 
  wages = c(100, 150, 200, 150, 100, 200, 150, 100), 
  hour = c(6, 18, 10, 10, 11, 17, 20, 11), 
  colour = c("Green", "Blue", "Blue", "Green", "Red",
             "Red", "Green", "Red")
)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T09:06:17+00:00

How about something like :

f = list(quote(wages>175),quote(colour=="Green"))
t = list(quote(hour<9),quote(hour<17),quote(hour>9),quote(hour>17))
dt = as.data.table(df)
dt[,as.list(mapply("%*%",
            lapply(t,eval,.SD),
            rep(lapply(f,eval,.SD),each=length(t))
           )), by=list(name,day)]
     name day V1 V2 V3 V4 V5 V6 V7 V8
[1,]  Ann   1  0  0  0  0  1  1  0  0
[2,]  Ann   2  0  1  1  0  0  0  0  0
[3,]  Ann   3  0  0  0  0  0  1  1  0
[4,]  Bob   1  0  0  1  0  0  0  1  1
[5,]  Bob   2  0  0  0  0  0  0  0  0

Clearly the column names aren’t tackled but that could be added if this approach is ok.

This should be more efficient because each t and each f is evaluated once only per group, then the combinations of those results are combined.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Following on from How to optimise filtering and counting for every row in a

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply