I have a data frame with 50000 rows and 200 columns. There are duplicate rows in the data and I want to aggregate the data by choosing the row with maximum coefficient of variation among the duplicates using aggregate function in R. With aggregate I can use "mean", "sum" by default but not coefficient variation.
For example
aggregate(data, as.columnname, FUN=mean)
Works fine.
I have a custom function for calculating coefficient of variation but not sure how to use it with aggregate.
co.var <- function(x)
(
100*sd(x)/mean(x)
)
I have tried
aggregate(data, as.columnname, function (x) max (co.var (x, data[index (x),])
but it is giving an error as object x is not found.
Assuming that I understand your problem, I would suggest using
tapply()instead ofaggregate()(see?tapplyfor more info). However, a minimal working example would be very helpful.EDIT:
There are faster ways, one of which is below. However, with a 40000 by 100 dataset, the above code only took between 16 and 20 seconds on my machine.
Time the original method
Time second method
And check to see that we get the same results:
There is an additional change from the original post, in that the edited code considers that there may be more than one row with the highest CV for a given ID. Therefore, to get the results from the edited code, you must
unlistthemySelormySel3objects: