There is a nice explanation here describing how to eliminate duplicates in a data frame by picking the maximum variable.
I can also see how this can be applied to pick the duplicate with the minimum variable.
my question now is how do I display the mean of all duplicates?
for example:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
I would like the output:
# id var
# 1 3 mean(2,4)
# 2 2 mean(1,3)
# 3 5
# 4 2
My current code is:
averages<-do.call(rbind,lapply(split(z,z$id),function(chunk) mean(chunk$var)))
z<-z[order(z$id),]
z<-z[!duplicated(z$id),]
z$var<-averages
My code runs very slowly and is takes about 10 times longer than the method for picking the maximum. How do I optimize this code?
I would use a combination of
aveandunique:UPDATE: after actually timing the solution, here’s something that’s a little faster.