I have a data frame that is some 35,000 rows, by 7 columns. it

Question

0

Editorial Team

Asked: June 5, 20262026-06-05T18:52:55+00:00 2026-06-05T18:52:55+00:00

I have a data frame that is some 35,000 rows, by 7 columns. it

0

I have a data frame that is some 35,000 rows, by 7 columns. it looks like this:

head(nuc)

  chr feature    start      end   gene_id    pctAT    pctGC length
1   1     CDS 67000042 67000051 NM_032291 0.600000 0.400000     10
2   1     CDS 67091530 67091593 NM_032291 0.609375 0.390625     64
3   1     CDS 67098753 67098777 NM_032291 0.600000 0.400000     25
4   1     CDS 67101627 67101698 NM_032291 0.472222 0.527778     72
5   1     CDS 67105460 67105516 NM_032291 0.631579 0.368421     57
6   1     CDS 67108493 67108547 NM_032291 0.436364 0.563636     55

gene_id is a factor, that has about 3,500 unique levels. I want to, for each level of gene_id get the min(start), max(end), mean(pctAT), mean(pctGC), and sum(length).

I tried using lapply and do.call for this, but it’s taking forever +30 minutes to run.
the code I’m using is:

nuc_prof = lapply(levels(nuc$gene_id), function(gene){
  t = nuc[nuc$gene_id==gene, ]
  return(list(gene_id=gene, start=min(t$start), end=max(t$end), pctGC =
              mean(t$pctGC), pct = mean(t$pctAT), cdslength = sum(t$length))) 
})
nuc_prof = do.call(rbind, nuc_prof)

I’m certain I’m doing something wrong to slow this down. I haven’t waited for it to finish as I’m sure it can be faster. Any ideas?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T18:52:56+00:00

Since I’m in an evangelizing mood … here’s what the fast data.table solution would look like:

library(data.table)
dt <- data.table(nuc, key="gene_id")

dt[,list(A=min(start),
         B=max(end),
         C=mean(pctAT),
         D=mean(pctGC),
         E=sum(length)), by=key(dt)]
#      gene_id        A        B         C         D   E
# 1: NM_032291 67000042 67108547 0.5582567 0.4417433 283
# 2:       ZZZ 67000042 67108547 0.5582567 0.4417433 283

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a data frame that is some 35,000 rows, by 7 columns. it

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply