The objective is to create indicators for a factor/string variable in a data frame. That dataframe has > 2mm rows, and running R on windows, I don’t have the option of using plyr with .parallel=T. So I’m taking the “divide and conquer” route with plyr and reshape2.
Running melt and cast runs out of memory, and using
ddply( idata.frame(items) , c("ID") , function(x){
( colSums( model.matrix( ~ x$element - 1) ) > 0 )
} , .progress="text" )
or
ddply( idata.frame(items) , c("ID") , function(x){
( elements %in% x$element )
} , .progress="text" )
does take a while. The fastest approach is the call to tapply below. Do you see a way to speed this up? The %in% statement runs faster than the model.matrix call. Thanks.
set.seed(123)
dd <- data.frame(
id = sample( 1:5, size=10 , replace=T ) ,
prd = letters[sample( 1:5, size=10 , replace=T )]
)
prds <- unique(dd$prd)
tapply( dd$prd , dd$id , function(x) prds %in% x )
For this problem, the packages
bigmemoryandbigtabulatemight be your friends. Here is a slightly more ambitious example:And the results of benchmarking (I’m learning new things all the time):
And that shows
bigtablewinning out by a considerable amount. The results are pretty much that all prds are in all IDs, but see?bigtablefor details on the format of the results.