Here is an example dataframe:
set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)
mydf <- data.frame(x1, x2, x3, n, y)
What I would like to do is
- identify rows with n=1 and which share identical values of (x1, x2, x3)
- return a single row for each subset with y = mean(y) and n = length(y)
- keep other rows the same.
for example, the new dataframe would be
x1 <- c(1, 1, 1, 1, 2, 2)
x2 <- c(1, 0, 0, 0, 1, 1)
x3 <- c(1, 2, 2, 4, 1, 2)
n <- c(2, 1, 5, 5, 3, 1)
y <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])
newdf <- data.frame(x1, x2, x3, n, y)
I can figure this out with conditionals and loops, but I would prefer to learn more elegant way to do this.
By “identical values in other columns”, I take it you mean that each subset is defined by the same value of
x1in each of the rows of the subset, not thatx1is equal tox2. Thanks for the example to see what you meant.To get parts one and two
This can be
rbind-ed with the part ofmydfwheren!=1to get what you saidThis doesn’t have the same order as you listed. If that is really important, you can add some auxiliary sorting variables.