The objective is to create indicators for a factor/string variable in a data frame.

Question

0

Asked: June 1, 20262026-06-01T00:57:20+00:00 2026-06-01T00:57:20+00:00

The objective is to create indicators for a factor/string variable in a data frame.

0

The objective is to create indicators for a factor/string variable in a data frame. That dataframe has > 2mm rows, and running R on windows, I don’t have the option of using plyr with .parallel=T. So I’m taking the “divide and conquer” route with plyr and reshape2.

Running melt and cast runs out of memory, and using

ddply( idata.frame(items) , c("ID") , function(x){
       (    colSums( model.matrix( ~ x$element - 1) ) > 0   )
} , .progress="text" )

or

ddply( idata.frame(items) , c("ID") , function(x){
           (    elements %in% x$element   )
    } , .progress="text" )

does take a while. The fastest approach is the call to tapply below. Do you see a way to speed this up? The %in% statement runs faster than the model.matrix call. Thanks.

set.seed(123)

dd <- data.frame(
  id  = sample( 1:5, size=10 , replace=T ) ,
  prd = letters[sample( 1:5, size=10 , replace=T )]
  )

prds <- unique(dd$prd)

tapply( dd$prd , dd$id , function(x) prds %in% x )

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T00:57:21+00:00

For this problem, the packages bigmemory and bigtabulate might be your friends. Here is a slightly more ambitious example:

library(bigmemory)
library(bigtabulate)

set.seed(123)

dd <- data.frame(
  id = sample( 1:15, size=2e6 , replace=T ), 
  prd = letters[sample( 1:15, size=2e6 , replace=T )]
  )

prds <- unique(dd$prd)

benchmark(
bigtable(dd,c(1,2))>0,
table(dd[,1],dd[,2])>0,
xtabs(~id+prd,data=dd)>0,
tapply( dd$prd , dd$id , function(x) prds %in% x )
)

And the results of benchmarking (I’m learning new things all the time):

                                            test replications elapsed relative user.self sys.self user.child sys.child
1                      bigtable(dd, c(1, 2)) > 0          100  54.401 1.000000    51.759    3.817          0         0
2                    table(dd[, 1], dd[, 2]) > 0          100 112.361 2.065422   107.526    6.614          0         0
4 tapply(dd$prd, dd$id, function(x) prds %in% x)          100 178.308 3.277660   166.544   13.275          0         0
3                xtabs(~id + prd, data = dd) > 0          100 229.435 4.217478   217.014   16.660          0         0

And that shows bigtable winning out by a considerable amount. The results are pretty much that all prds are in all IDs, but see ?bigtable for details on the format of the results.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The objective is to create indicators for a factor/string variable in a data frame.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply