I am working on a big dataset and have got a problem with data

Question

0

Asked: June 5, 20262026-06-05T06:56:23+00:00 2026-06-05T06:56:23+00:00

I am working on a big dataset and have got a problem with data

0

I am working on a big dataset and have got a problem with data cleaning. My data set looks like this:

data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4), 
               member = c(1,2,3,1,2,1,2,3,1,2,3,4,5), 
               score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))

I just want to keep the group in which the sum of score is equal to 1 and remove the whole group in which the sum of score is equal to 0. For the group in which the sum of the score is greater than 1, e.g., sum of score = 3, I want to randomly select two group members with score equal to 1 and remove them from the group. Then the data may look like this:

newdata <- cbind (group = c(1,1,1,3,3,4,4,4), 
                  member = c(1,2,3,2,3,1,3,5), 
                  score = c(0,1,0,0,1,0,1,0))

Does anybody can help me get this done?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T06:56:24+00:00

I would write a function that combines the various manipulations for you. Here is one such function, heavily commented:

process <- function(x) {
    ## this adds a vector with the group sum score
    x <- within(x, sumScore <- ave(score, group, FUN = sum))
    ## drop the group with sumScore == 0
    x <- x[-which(x$sumScore == 0L), , drop = FALSE]
    ## choose groups with sumScore > 1
    ## sample sumScore - 1 of the rows where score == 1L
    foo <- function(x) {
        scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
        ## which of the grups observations have score = 1L
        want <- which(x$score == 1L)
        ## want to sample all bar one of these
        want <- sample(want, scr-1)
        ## remove the selected rows & retun
        x[-want, , drop = FALSE]
    }
    ## which rows are samples with group sumScore > 1
    want <- which(x$sumScore > 1L)
    ## select only those samples, split up those samples by group, lapplying foo
    ## to each group, then rbind the resulting data frames together
    newX <- do.call(rbind,
                    lapply(split(x[want, , drop = FALSE], x[want, "group"]),
                           FUN = foo))
    ## bind the sampled sumScore > 1L on to x (without sumScore > 1L)
    newX <- rbind(x[-want, , drop = FALSE], newX)
    ## remove row labels
    rownames(newX) <- NULL
    ## return the data without the sumScore column
    newX[, 1:3]
}

that with your data:

dat <- data.frame(group = c(1,1,1,2,2,3,3,3,4,4,4,4,4), 
                  member = c(1,2,3,1,2,1,2,3,1,2,3,4,5), 
                  score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))

gives:

> set.seed(42)
> process(dat)
  group member score
1     1      1     0
2     1      2     1
3     1      3     0
4     3      1     1
5     3      2     0
6     4      1     0
7     4      3     1
8     4      5     0

Which is I think what was wanted.

Update: In process() above, the internal function foo() could be rewritten to sample only 1 row and remove the others. I.e replace foo() with the one below:

foo <- function(x) {
    scr <- unique(x$sumScore) ## sanity & take only 1 of the sumScore
    ## which of the grups observations have score = 1L
    want <- which(x$score == 1L)
    ## want to sample just one of these
    want <- sample(want, 1)
    ## return the selected row & retun
    x[want, , drop = FALSE]
}

They are the same operations essentially but foo() that selects just 1 row makes the intended behaviour explicit; we want to select 1 row at random from those with score == 1L, rather than sample scr-1 values.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am working on a big dataset and have got a problem with data

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply