I’m trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I’d like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I’m trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I’ve been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I’ve been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?
When you use
summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.You tried this:
when what you probably wanted was something more like this: