This is a really simple problem, but I cannot figure out how to script it. I cannot move forward until I figure it out. I’m really new to R and to using code, and I’m going through several introductory manuals, but haven’t found anything for this specific problem yet.
Generally, here is the issue. Let’s say I have a data frame called x that looks like:
a <- c(1995,1995,1995,1996,1997,1997,1997,1998)
b <- c(1,2,3,1,2,3,4,1)
c <- c(5,7,8,2,4,5,7,8)
(x <- data.frame(a,b,c))
a b c
1 1995 1 5
2 1995 2 7
3 1995 3 9
4 1996 1 2
5 1997 2 4
6 1997 3 5
7 1997 4 7
8 1998 1 8
There are multiple entries for some of the years in column a (i.e. 1995 appears 3 times), when really I just want one entry for each year. If I try to plot column a against column c, I will end up with multiple points for each date, but that is not helpful. I don’t care about column b, but I want to sum entries for column c for each year, such that I end up with a data frame with one entry for each year. Given the above data, a resulting data frame would look like:
a c
1 1995 21
2 1996 2
3 1997 16
4 1998 8
Any ideas?
The
plyrlibrary is useful for aggregation tasks such as these.plyralso plays very well withggplot2graphics. In my opinion, the benefit of plyr is that you explicitly define the structure of the input and output. Here we are passing in adata.frameobject and also want adata.frameafter processing, so we will useddply. The first letter corresponds to the input object, and the second to the output. So if we wanted to go from alistobject todata.frame, we’d useldply, etc.