I have a large data frame that looks similar to this:
df <- data.frame(dive = factor(sample(c("dive1","dive2"), 10, replace=TRUE)),
speed = runif(10)
)
> df
dive speed
1 dive1 0.80668490
2 dive1 0.53349584
3 dive2 0.07571784
4 dive2 0.39518628
5 dive1 0.84557955
6 dive1 0.69121443
7 dive1 0.38124950
8 dive2 0.22536126
9 dive1 0.04704750
10 dive2 0.93561651
My goal is to obtain the average of values in one column when another column is equal to a certain value and repeat this for all values. i.e. in the example above I would like to return an average for the column speed for every unique value of the column dive. So when dive==dive1, the average for speed is this and so on for each value of dive.
There are many ways to do this in R. Specifically,
by,aggregate,split, andplyr,cast,tapply,data.table,dplyr, and so forth.Broadly speaking, these problems are of the form split-apply-combine. Hadley Wickham has written a beautiful article that will give you deeper insight into the whole category of problems, and it is well worth reading. His
plyrpackage implements the strategy for general data structures, anddplyris a newer implementation performance tuned for data frames. They allow for solving problems of the same form but of even greater complexity than this one. They are well worth learning as a general tool for solving data manipulation problems.Performance is an issue on very large datasets, and for that it is hard to beat solutions based on
data.table. If you only deal with medium-sized datasets or smaller, however, taking the time to learndata.tableis likely not worth the effort.dplyrcan also be fast, so it is a good choice if you want to speed things up, but don’t quite need the scalability ofdata.table.Many of the other solutions below do not require any additional packages. Some of them are even fairly fast on medium-large datasets. Their primary disadvantage is either one of metaphor or of flexibility. By metaphor I mean that it is a tool designed for something else being coerced to solve this particular type of problem in a ‘clever’ way. By flexibility I mean they lack the ability to solve as wide a range of similar problems or to easily produce tidy output.
Examples
basefunctionstapply:aggregate:aggregatetakes in data.frames, outputs data.frames, and uses a formula interface.by:In its most user-friendly form, it takes in vectors and applies a function to them. However, its output is not in a very manipulable form.:
To get around this, for simple uses of
bytheas.data.framemethod in thetaRifxlibrary works:split:As the name suggests, it performs only the “split” part of the split-apply-combine strategy. To make the rest work, I’ll write a small function that uses
sapplyfor apply-combine.sapplyautomatically simplifies the result as much as possible. In our case, that means a vector rather than a data.frame, since we’ve got only 1 dimension of results.External packages
data.table:
dplyr:plyr(the pre-cursor ofdplyr)Here’s what the official page has to say about
plyr:In other words, if you learn one tool for split-apply-combine manipulation it should be
plyr.reshape2:
The
reshape2library is not designed with split-apply-combine as its primary focus. Instead, it uses a two-part melt/cast strategy to perform a wide variety of data reshaping tasks. However, since it allows an aggregation function it can be used for this problem. It would not be my first choice for split-apply-combine operations, but its reshaping capabilities are powerful and thus you should learn this package as well.Benchmarks
10 rows, 2 groups
As usual,
data.tablehas a little more overhead so comes in about average for small datasets. These are microseconds, though, so the differences are trivial. Any of the approaches works fine here, and you should choose based on:plyris always worth learning for its flexibility;data.tableis worth learning if you plan to analyze huge datasets;byandaggregateandsplitare all base R functions and thus universally available)10 million rows, 10 groups
But what if we have a big dataset? Let’s try 10^7 rows split over ten groups.
Then
data.tableordplyrusing operating ondata.tables is clearly the way to go. Certain approaches (aggregateanddcast) are beginning to look very slow.10 million rows, 1,000 groups
If you have more groups, the difference becomes more pronounced. With 1,000 groups and the same 10^7 rows:
So
data.tablecontinues scaling well, anddplyroperating on adata.tablealso works well, withdplyrondata.frameclose to an order of magnitude slower. Thesplit/sapplystrategy seems to scale poorly in the number of groups (meaning thesplit()is likely slow and thesapplyis fast).bycontinues to be relatively efficient–at 5 seconds, it’s definitely noticeable to the user but for a dataset this large still not unreasonable. Still, if you’re routinely working with datasets of this size,data.tableis clearly the way to go – 100% data.table for the best performance ordplyrwithdplyrusingdata.tableas a viable alternative.