Suppose I have a data frame with many columns and a particular summary procedure that I wish to apply. There may be several columns I am interested in summarizing by, e.g. columns 2, 3 and 4 of the baseball dataset:
ddply(baseball, .(year), "nrow")
ddply(baseball, .(stint), "nrow")
ddply(baseball, .(team), "nrow")
Of course I may wish to apply a more complicated summary and have more columns of output, but let’s stick with the assumption that the summary is going to be done by a single column, and there are several columns I may wish to summarize by. So let’s write a function for the summary, so I can easily vary the column to use for the .(var):
baseballByCol <- function(col) {
ddply(baseball, .(baseball[,col]), "nrow")
}
This ALMOST works: baseballByCol(2) is identical to the output from ddply(baseball, .(year), "nrow") except for that colnames(baseballByCol(2)) is c("baseball[, col]", "nrow") while colnames(ddply(baseball, .(year), "nrow")) is the desired c("year", "nrow").
Of course we can solve that:
baseballByCol <- function(col) {
df <- ddply(baseball, .(baseball[,col]), "nrow")
colnames(df)[1] <- colnames(baseball)[col]
return(df)
}
And now baseballByCol(2) is completely identical to the output from ddply(baseball, .(year), "nrow"), to summarize by stint I can use baseballByCol(3) and so on.
But this smells a bit ugly. Is there really no better way to refer to the “by” variable by its column index rather than name, other than the .(baseball[,col]), "nrow") which messes up the column name?
And is there a cleaner solution in which the function takes the variable name as an argument rather than a column index?
works with index and column name.