I need to generate simulated data where the percent censored cannot be 0 or 1. That’s why I use while loop. The problem is if I increase count to 10,000 (instead of 5), the program is very slow. I have to repeat this with 400 different scenarios so it is extremely slow. I’m trying to figure out places where I can vectorize my code piece by piece. How can I avoid while-loop and still able to keep the condition?
Another approach is keep the while loop and generate a list of 10,000 dataset that meet my criteria and then apply the function to the list. Here I use summary function as an example but my real function use both X_after and delta (ie. mle(X_after,delta)). Is this a better option if I have to use while loop?
Another concern I have is memory issue. How can I avoid using up memory while doing such large simulation?
mu=1 ; sigma=3 ; n=10 ; p=0.10
dset <- function (mu,sigma, n, p) {
Mean <- array()
Median <- array()
Pct_cens_array <- array()
count = 0
while(count < 5) {
lod <- quantile(rlnorm(100000, log(mu), log(sigma)), p = p)
X_before <- rlnorm(n, log(mu), log(sigma))
X_after <- ifelse(X_before <= lod, lod, X_before)
delta <- ifelse(X_before <= lod, 1, 0)
pct_cens <- sum(delta)/length(delta)
# print(pct_cens)
if (pct_cens == 0 | pct_cens == 1 ) next
else {
count <- count +1
if (pct_cens > 0 & pct_cens < 1) {
sumStats <- summary(X_after)
Median[count] <- sumStats[3]
Mean [count]<- sumStats[4]
Pct_cens_array [count] <- pct_cens
print(list(pct_cens=pct_cens,X_after=X_after, delta=delta, Median=Median,Mean=Mean,Pct_cens_array=Pct_cens_array))
}
}
}
return(data.frame(Pct_cens_array=Pct_cens_array, Mean=Mean, Median=Median))
}
I’ve made a few little tweaks to your code without changing the whole style of it. It would be good to heed Yoong Kim’s advice and try to break up the code into smaller pieces, to make it more readable and maintainable.
Your function now gets two “n” arguments, for how many samples you have in each row, and how many iterations (columns) you want.
You were growing the arrays
MedianandMeanin the loop, which requires a lot of messing about reallocating memory and copying things, which slows everything down. I’ve predefinedX_afterand moved the mean and median calculations after the loop to avoid this. (As a bonus,meanandmedianonly get called once instead ofn_iterationtimes.)The calls to
ifelseweren’t really needed.It is a little quicker to call
rlnormonce, generating enough values for x and the lod, than to call it twice.Here’s the updated function.
Compare timings with, for example,
On my machine, there is a factor of 3 speedup.