Overview
Give a large (nrows > 5,000,000+) data frame, A, with string row names and a list of disjoint sets (n = 20,000+), B, where each set consists of row names from A, what is the best way to create a vector representing the sets in B via a unique value?
Illustration
Below is an example illustrating this problem:
# Input
A <- data.frame(d = rep("A", 5e6), row.names = as.character(sample(1:5e6)))
B <- list(c("4655297", "3177816", "3328423"), c("2911946", "2829484"), ...) # Size 20,000+
The desired result would be:
# An index of NA represents that the row is not part of any set in B.
> A[,"index", drop = F]
d index
4655297 A 1
3328423 A 1
2911946 A 2
2829484 A 2
3871770 A NA
2702914 A NA
2581677 A NA
4106410 A NA
3755846 A NA
3177816 A 1
Naive Attempt
Something like this can be achieved using the following method.
n <- 0
A$index <- NA
lapply(B, function(x){
n <<- n + 1
A[x, "index"] <<- n
})
Problem
However this is unreasonably slow (several hours) due to indexing A multiple times and is not very R-esque or elegant.
How can the desired result be generated in a quick and efficient manner?
Here is a suggestion using base that isn’t too bad when compared to your current method.
Sample data:
Base method:
Check:
For anything faster, I suppose
data.tablewill come handy.