Overview Give a large (nrows > 5,000,000+) data frame, A , with string row

Question

0

Asked: June 13, 20262026-06-13T08:05:26+00:00 2026-06-13T08:05:26+00:00

Overview Give a large (nrows > 5,000,000+) data frame, A , with string row

0

Overview

Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?

Illustration

Below is an example illustrating this problem:

# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+

The desired result would be:

# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
        d index
4655297 A     1
3328423 A     1
2911946 A     2
2829484 A     2
3871770 A    NA
2702914 A    NA
2581677 A    NA
4106410 A    NA
3755846 A    NA
3177816 A     1

Naive Attempt

Something like this can be achieved using the following method.

n <- 0
A$index <- NA
lapply(B, function(x){
  n <<- n + 1
  A[x, "index"] <<- n
})

Problem

However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.

How can the desired result be generated in a quick and efficient manner?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-13T08:05:27+00:00

Here is a suggestion using base that isn’t too bad when compared to your current method.

Sample data:

A <- data.frame(d   = rep("A", 5e6),
                set = sample(c(NA, 1:20000), 5e6, replace = TRUE),
                row.names = as.character(sample(1:5e6)))
B <- split(rownames(A), A$set)

Base method:

system.time({
A$index <- NA
A[unlist(B), "index"] <- rep(seq_along(B), times = lapply(B, length))
})
#    user  system elapsed 
#   15.30    0.19   15.50

Check:

identical(A$set, A$index)
# TRUE

For anything faster, I suppose data.table will come handy.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Overview Give a large (nrows > 5,000,000+) data frame, A , with string row

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply