I have 1 data.frame named A, there are 5000 columns in it. How can

Question

0

Asked: June 2, 20262026-06-02T10:59:14+00:00 2026-06-02T10:59:14+00:00

I have 1 data.frame named A, there are 5000 columns in it. How can

0

I have 1 data.frame named A, there are 5000 columns in it. How can I find columns in this data.frame that are equal to each other.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T10:59:16+00:00

As @John mentioned, there are problems with using duplicated. I would add that transposing the data.frame forces all the data into a same data type before it is even compared with duplicated. On an example, here is a data.frame:

df <- data.frame( a = LETTERS[1:3],
                  b = 1:3,
                  c = as.character(1:3),
                  d = LETTERS[1:3],
                  e = 1:3,
                  f = 1:3)
df
#   a b c d e f
# 1 A 1 1 A 1 1
# 2 B 2 2 B 2 2
# 3 C 3 3 C 3 3

Note that column c is very similar to columns b, e, and f, but not identical because of the different types (character versus numeric). The solution suggested by @Jubbles would disregard these differences.

Instead, it seems more appropriate to use the identical function on the columns of your data.frame. You can compare columns two-by-two using outer:

are.cols.identical <- function(col1, col2) identical(df[,col1], df[,col2])
identical.mat      <- outer(colnames(df), colnames(df),
                            FUN = Vectorize(are.cols.identical))
identical.mat
# [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# [1,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [3,] FALSE FALSE  TRUE FALSE FALSE FALSE
# [4,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [5,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [6,] FALSE  TRUE FALSE FALSE  TRUE  TRUE

From here, you can use clustering to identify groups of identical columns (there may be better ways so if you know one, feel free to comment or even edit my answer.)

library(cluster)
distances <- as.dist(!identical.mat)
tree      <- hclust(distances)
cut       <- cutree(tree, h = 0.5)
cut
# [1] 1 2 3 1 2 2

split(colnames(df), cut)
# $`1`
# [1] "a" "d"
# 
# $`2`
# [1] "b" "e" "f"
# 
# $`3`
# [1] "c"

Edit 1: to disregard differences in floating point values, one can use

are.cols.identical <- function(col1,col2) isTRUE(all.equal((df[,col1],df[,col2]))

Edit 2: a more efficient method than clustering for grouping the names of identical columns is

cut <- apply(identical.mat, 1, function(x)match(TRUE, x))
split(colnames(df), cut)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have 1 data.frame named A, there are 5000 columns in it. How can

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply