I’m a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can’t figure out the right way to code this in R.
I’m trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.
Here is what the data looks like:
userid visits posts revenue
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
… … … …
280000 80 10 100
280001 42 4 25
280002 31 8 17
Here is what I am trying to get the output to look like:
VisitRange PostRange # of Users Total Revenue Average Revenue
0 0 X Y Z
1-10 0 X Y Z
11-20 0 X Y Z
21-30 0 X Y Z
31-40 0 X Y Z
41-50 0 X Y Z
> 50 0 X Y Z
0 1-10 X Y Z
1-10 1-10 X Y Z
11-20 1-10 X Y Z
21-30 1-10 X Y Z
31-40 1-10 X Y Z
41-50 1-10 X Y Z
> 50 1-10 X Y Z
want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as ‘> 51’
I’ve looked a tapply and ddply as ways to accomplish this, but I don’t think they will work the way I am expecting, but I could be wrong.
Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example – If visits between 1 and 10, then ‘1-10’), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn’t the right tool here, but I think it is…
All help would be appreciated. Thanks in advance.
The idiom in the
plyrpackage andddplyin particular, is very similar to pivot tables in Excel.In your example, the only thing you need to do is the
cutyour grouping variables into the desired breaks, before passing toddply. Here is an example:First, create some sample data:
Now, use
cutto divide your grouping variables into the desired ranges:Finally, use
ddplywithsummarise:The results: