I would like to aggregate a data.frame by an identifier variable called ensg .

Question

0

Asked: May 27, 20262026-05-27T15:55:49+00:00 2026-05-27T15:55:49+00:00

I would like to aggregate a data.frame by an identifier variable called ensg .

0

I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:

  chromosome probeset               ensg symbol    XXA_00    XXA_36    XXB_00
1          X  4938842 ENSMUSG00000000003   Pbsn  4.796123  4.737717  5.326664

I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.

In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:

spec.mean <- function(eset.piece)
  {
    cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
  }
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")

My first aggregate implementation looks like this,

mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);

and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.

Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T15:55:50+00:00

First let’s define a toy example:

df <- data.frame(chromosome = gl(3,  10,  labels = c('A',  'B',  'C')),
             probeset = gl(3,  10,  labels = c('X',  'Y',  'Z')),
             ensg =  gl(3,  10,  labels = c('E1',  'E2',  'E3')),
             symbol = gl(3,  10,  labels = c('S1',  'S2',  'S3')),
             XXA_00 = rnorm(30),
             XXA_36 = rnorm(30),
             XXB_00 = rnorm(30))

And then we use aggregate with the formula interface:

df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,  
    data = df,  FUN = mean)

> df1
  ensg chromosome symbol      XXA_00      XXA_36      XXB_00
1   E1          A     S1 -0.02533499 -0.06150447 -0.01234508
2   E2          B     S2 -0.25165987  0.02494902 -0.01116426
3   E3          C     S3  0.09454154 -0.48468517 -0.25644569

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I would like to aggregate a data.frame by an identifier variable called ensg .

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply