A common use case in R (at least for me) is identifying observations in

Question

0

Asked: May 22, 20262026-05-22T21:17:48+00:00 2026-05-22T21:17:48+00:00

A common use case in R (at least for me) is identifying observations in

0

A common use case in R (at least for me) is identifying observations in a data frame that have some characteristic that depends on the values in some subset of other observations.

To make this more concerete, suppose I have a number of workers (indexed by WorkerId) that
have an associated “Iteration”:

    raw <- data.frame(WorkerId=c(1,1,1,1,2,2,2,2,3,3,3,3),
              Iteration = c(1,2,3,4,1,2,3,4,1,2,3,4))

and I want to eventually subset the data frame to exclude the “last” iteration (by creating a “remove” boolean) for each worker. I can write a function to do this:

raw$remove <- mapply(function(wid,iter){
                              iter==max(raw$Iteration[raw$WorkerId==wid])},
                 raw$WorkerId, raw$Iteration)

> raw$remove
  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

but this gets very slow as the data frame gets larger (presumably because I’m needlessly computing the max for every observation).

My question is what’s the more efficient (and idiomatic) way of doing this in the functional programming style. Is it first creating a the WorkerId to Max value dictionary and then using that as a parameter in another function that operates on each observation?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-22T21:17:48+00:00

The “most natural way” IMO is the split-lapply-rbind method. You start by split()-ting into a list of groups, then lapply() the processing rule (in this case removing the last row) and then rbind() them back together. It’s all doable as a nested set of function calls. The inner two steps are illustrated here and the final one-liner is presented at the bottom:

> lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] )
$`1`
  WorkerId Iteration
1        1         1
2        1         2
3        1         3

$`2`
  WorkerId Iteration
5        2         1
6        2         2
7        2         3

$`3`
   WorkerId Iteration
9         3         1
10        3         2
11        3         3

do.call(rbind,  lapply( split(raw, raw$WorkerId), function(x) x[-NROW(x),] ) )

Hadley Wickham has developed a wide set of tools, the plyr package, that extend this strategy to a wider variety of tasks.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A common use case in R (at least for me) is identifying observations in

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply