I have a dataset composed of values obtained from studies and experiments. Experiments are

Question

0

Asked: June 11, 20262026-06-11T17:20:53+00:00 2026-06-11T17:20:53+00:00

I have a dataset composed of values obtained from studies and experiments. Experiments are

0

I have a dataset composed of values obtained from studies and experiments. Experiments are nested within studies. I want to subsample the dataset so that only 1 experiment is represented for each study. I want to repeat this procedure 10,000 times, randomly drawing the 1 experiment each time, and then calculate some summary statistics for the values. Here is an example dataset:

df=data.frame(study=c(1,1,2,2,2,3,4,4),expt=c(1,2,1,2,3,1,1,2),value=runif(8))

I wrote the following function to do the above, but it is taking forever. Does anyone have any suggestions for streamlining this code? Thanks!

subsample=function(x,A) {
  subsample.list=sapply(1:A,function(m) {
    idx=ddply(x,c("study"),function(i) sample(1:nrow(i),1)) #Sample one experiment from each study
    x[paste(x$study,x$expt,sep="-") %in% paste(idx$study,idx$V1,sep="-"),"value"] } ) #Match the study-experiment combinations and retrieve values
  means.list=ldply(subsample.list,mean) #Calculate the mean of 'values' for each iteration
  c(quantile(means.list$V1,0.025),mean(means.list$V1),upper=quantile(means.list$V1,0.975)) } #Calculate overall means and 95% CIs

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-11T17:20:54+00:00

You can vectorise this way more (even using plyr), and go much much faster:

function=yoursummary(x)c(quantile(x,0.025),mean(x),upper=quantile(x,0.975))
subsampleX=function(x,M)
  yoursummary(
    aaply(
      daply(.drop_o=F,df,.(study),
        function(x)sample(x$value,M,replace=T)
      ),1,mean
    )
  )

The trick here is to do all the sampling up front. If we want to sample M times, why not do all that while you have access to the study.

Original code:

> system.time(subsample(df,20000))
   user  system elapsed 
 123.23    0.06  124.74

New vectorised code:

> system.time(subsampleX(df,20000))
   user  system elapsed 
   0.24    0.00    0.25

That’s about 500x faster.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a dataset composed of values obtained from studies and experiments. Experiments are

Leave an answerCancel reply

1 Answer

Original code:

New vectorised code:

Leave an answer
Cancel reply