Suppose I have a data frame with many columns and a particular summary procedure

Question

0

Asked: June 15, 20262026-06-15T23:55:01+00:00 2026-06-15T23:55:01+00:00

Suppose I have a data frame with many columns and a particular summary procedure

0

Suppose I have a data frame with many columns and a particular summary procedure that I wish to apply. There may be several columns I am interested in summarizing by, e.g. columns 2, 3 and 4 of the baseball dataset:

   ddply(baseball, .(year), "nrow")
   ddply(baseball, .(stint), "nrow")
   ddply(baseball, .(team), "nrow")

Of course I may wish to apply a more complicated summary and have more columns of output, but let’s stick with the assumption that the summary is going to be done by a single column, and there are several columns I may wish to summarize by. So let’s write a function for the summary, so I can easily vary the column to use for the .(var):

   baseballByCol <- function(col) {
       ddply(baseball, .(baseball[,col]), "nrow")
   }

This ALMOST works: baseballByCol(2) is identical to the output from ddply(baseball, .(year), "nrow") except for that colnames(baseballByCol(2)) is c("baseball[, col]", "nrow") while colnames(ddply(baseball, .(year), "nrow")) is the desired c("year", "nrow").

Of course we can solve that:

   baseballByCol <- function(col) {
       df <- ddply(baseball, .(baseball[,col]), "nrow")
       colnames(df)[1] <- colnames(baseball)[col]
       return(df)
   }

And now baseballByCol(2) is completely identical to the output from ddply(baseball, .(year), "nrow"), to summarize by stint I can use baseballByCol(3) and so on.

But this smells a bit ugly. Is there really no better way to refer to the “by” variable by its column index rather than name, other than the .(baseball[,col]), "nrow") which messes up the column name?

And is there a cleaner solution in which the function takes the variable name as an argument rather than a column index?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T23:55:01+00:00

Editorial Team

2026-06-15T23:55:01+00:00Added an answer on June 15, 2026 at 11:55 pm

baseballByCol <- function(col) {
    ddply(baseball, col, "nrow")
}

works with index and column name.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Suppose I have a data frame with many columns and a particular summary procedure

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply