I have a double loop that I not only don’t like, but would take

Question

0

Asked: May 30, 20262026-05-30T22:12:25+00:00 2026-05-30T22:12:25+00:00

I have a double loop that I not only don’t like, but would take

0

I have a double loop that I not only don’t like, but would take 14 days to run on my computer since it is going over 3200 records and 1090 variables at about .12 per iteration.

A smaller reproducible bit. It simply checks how many numbers are in the same column between two records, not including NA’s. Then it attaches the results to the original data frame.

y <- data.frame(c(1,2,1,NA,NA),c(3,3,3,4,NA),c(5,4,5,7,7),c(7,8,7,9,10))
resultdf <- NULL
for(i in 1:nrow(y))
{
  results <- NULL
  for(j in 1:nrow(y))
  {
    results <- c(results,sum((y[i,]==y[j,]),na.rm=TRUE))
  }
  resultdf <- cbind(resultdf,results)
}
y <- cbind(y,resultdf)

I have repeat calculations that could possibly be avoided leaving about 7 days.

If I understand correctly, a few apply functions are in C that might be faster. I haven’t been able to get any to work though. I’m also curious if there is a package that would run faster. Can anyone help speed up the calculation?

Thank you!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-30T22:12:27+00:00

I have created data to your specifications, and using @BenBolker’s suggestion about using a matrix:

> y <- matrix(sample(c(1:9, NA), 3200 * 1090, replace = TRUE),
+             nrow = 3200, ncol = 1090)

and compared the computation times for three different implementations:

f1 was suggested by @Andrei:

> f1 <- function(y)apply(y, 1, function(r1)
+                  apply(y, 1, function(r2)sum(r1==r2, na.rm=TRUE)))

> system.time(r1 <- f1(y))
   user  system elapsed 
 523.51    0.77  528.73

f2 was suggested by @VincentZoonekynd:

> f2 <- function(y) {
+   f <- function(i,j) sum(y[i,] == y[j,], na.rm=TRUE)
+   d <- outer( 1:nrow(y), 1:nrow(y), Vectorize(f) )
+   return(d)
+ }
> system.time(r2 <- f2(y))
   user  system elapsed 
 658.94    1.96  710.67

f3 is a double loop over the upper triangle as suggested by @BenBolker. It is also a bit more efficient than your OP in that it pre-allocates the output matrix:

> f3 <- function(y) {
+   result <- matrix(NA, nrow(y), nrow(y))
+   for (i in 1:nrow(y)) {
+     row1 <- y[i, ]
+     for (j in i:nrow(y)) {
+       row2 <- y[j, ]
+       num.matches  <- sum(row1 == row2, na.rm = TRUE)
+       result[i, j] <- num.matches
+       result[j, i] <- num.matches
+     }
+   }
+   return(result)
+ }

> system.time(r3 <- f3(y))
   user  system elapsed 
 167.66    0.08  168.72

So the double loop is the fastest of all three, although not as elegant and compact as the other two answers.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a double loop that I not only don’t like, but would take

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply