I’m trying to find the idiomatic way in R to partition a numerical vector by some index vector, find the sum of all numbers in that partition and then divide each individual entry by that partition sum. In other words, if I start with this:
df <- data.frame(x = c(1,2,3,4,5,6), index = c('a', 'a', 'b', 'b', 'c', 'c'))
I want the output to create a vector (let’s call it z):
c(1/(1+2), 2/(1+2), 3/(3+4), 3/(3+4), 5/(5+6), 6/(5+6))
If I were doing this is SQL and could use window functions, I would do this:
select
x / sum(x) over (partition by index) as z
from df
and if I were using plyr, I would do something like this:
ddply(df, .(index), transform, z = x / sum(x))
but I’d like to know how to do it using the standard R functional programming tools like mapply/aggregate etc.
Yet another option is
ave. For good measure, I’ve collected the answers above, tried my best to make their output equivalent (a vector), and provided timings over 1000 runs using your example data as an input. First, my answer usingave:ave(df$x, df$index, FUN = function(z) z/sum(z)). I also show an example usingdata.tablepackage since it is usually pretty quick, but I know you’re looking for base solutions, so you can ignore that if you want.And now a bunch of timings:
the
lapply()solution seems to win in this case anddata.table()is surprisingly slow. Let’s see how this scales to a bigger aggregation problem:that seems more consistent with what I’d expect.
In summary, you’ve got plenty of good options. Find one or two methods that work with your mental model of how aggregation tasks should work and master that function. Many ways to skin a cat.
Edit – and an example with 1e7 rows
Probably not large enough for Matt, but as big as my laptop can handle without crashing: