I have a large dataframe. For some purposes I need to do the following:

Question

0

Editorial Team

Asked: June 15, 20262026-06-15T19:08:16+00:00 2026-06-15T19:08:16+00:00

I have a large dataframe. For some purposes I need to do the following:

0

I have a large dataframe. For some purposes I need to do the following:

Select one column in this data frame
Iterate on all rows of a given data frame except selected column
Select all rows of this data frame that are equal by all elements except one selected column
Group them by the way that group name is the row index and group values are indexes of duplicated rows.

I have wrote a function for this task, but it works slow because of nested loop. I would like to get some ideas how this code can be improved.

Say we have a dataframe like this:

  V1 V2 V3 V4
1  1  2  1  2
2  1  2  2  1
3  1  1  1  2
4  1  1  2  1
5  2  2  1  2

And we want to get this list as a output:

diff.dataframe("V2", conf.new, conf.new)

Ouput:

$`1`
[1] 1

$`2`
[1] 2

$`3`
[1] 1 3

$`4`
[1] 2 4

$`5`
[1] 5

The following code reaces the goal, but it works too slow. Is it possible to improve it somehow?

diff.dataframe <- function(param, df1, df2){
  excl.names <- c(param)
  df1.excl <- data.frame(lapply(df1[, !names(df1) %in% excl.names], as.character), stringsAsFactors=FALSE)
  df2.excl <- data.frame(lapply(df2[, !names(df2) %in% excl.names], as.character), stringsAsFactors=FALSE)
  list.out <- list()

  for (i in 1:nrow(df1.excl)){
     for (j in 1:nrow(df2.excl)){
        if (paste(df1.excl[i,],collapse='') == paste(df2.excl[j,], collapse='')){
          if (!as.character(i) %in% unlist(list.out)){                                                                                                                             
            list.out[[as.character(i)]] <- c(list.out[[as.character(i)]], j)                                                                                                       
          }
        }
     }
  }
  return(list.out)
}

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T19:08:17+00:00

Let’s generate some data first

df <- as.data.frame(matrix(sample(2, 20, TRUE), 5))

# Produces df like this
  V1 V2 V3 V4
1  2  1  1  1
2  2  1  2  2
3  1  1  2  2
4  1  2  1  1
5  1  2  1  1

We then loop through the lines with lapply. Each row i is then compared to all lines of df with apply (including itself). The rows with <= 1 differences returns TRUE, the others return FALSE producing a logical vector, which we convert to a numeric vector with which.

lapply(1:nrow(df), function(i)
    apply(df, 1, function(x) which(sum(x != df[i,]) <= 1)))

# Produces output like this
[[1]]
[1] 1

[[2]]
[1] 2 3

[[3]]
[1] 2 3

[[4]]
[1] 4 5

[[5]]
[1] 4 5

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a large dataframe. For some purposes I need to do the following:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply