I am working on a big dataset and have got a problem with data cleaning. My data set looks like this:
data <- cbind (group = c(1,1,1,2,2,3,3,3,4,4,4,4,4),
member = c(1,2,3,1,2,1,2,3,1,2,3,4,5),
score = c(0,1,0,0,0,1,0,1,0,1,1,1,0))
I just want to keep the group in which the sum of score is equal to 1 and remove the whole group in which the sum of score is equal to 0. For the group in which the sum of the score is greater than 1, e.g., sum of score = 3, I want to randomly select two group members with score equal to 1 and remove them from the group. Then the data may look like this:
newdata <- cbind (group = c(1,1,1,3,3,4,4,4),
member = c(1,2,3,2,3,1,3,5),
score = c(0,1,0,0,1,0,1,0))
Does anybody can help me get this done?
I would write a function that combines the various manipulations for you. Here is one such function, heavily commented:
that with your data:
gives:
Which is I think what was wanted.
Update: In
process()above, the internal functionfoo()could be rewritten to sample only 1 row and remove the others. I.e replacefoo()with the one below:They are the same operations essentially but
foo()that selects just 1 row makes the intended behaviour explicit; we want to select 1 row at random from those with score == 1L, rather than samplescr-1values.