I am using the blackboost function from the mboost package to estimate a model on an approximately 500mb dataset on a Windows 7 64-bit, 8gb RAM machine. During the execution R uses up to virtually all available memory. After the calculation is done, over 4.5gb keeps allocated to R even after calling the garbage collection with gc() or saving and reloading the workspace to a new R session. Using .ls.objects (1358003) I found that the size of all visible objects is about 550mb.
The output of gc() tells me that the bulk of data is in vector cells, although I’m not sure what that means:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2856967 152.6 4418719 236.0 3933533 210.1
Vcells 526859527 4019.7 610311178 4656.4 558577920 4261.7
This is what I’m doing:
> memory.size()
[1] 1443.99
> model <- blackboost(formula, data = mydata[mydata$var == 1,c(dv,ivs)],tree_control=ctree_control(maxdepth = 4))
…a bunch of packages are loaded…
> memory.size()
[1] 4431.85
> print(object.size(model),units="Mb")
25.7 Mb
> memory.profile()
NULL symbol pairlist closure environment promise language
1 15895 826659 20395 4234 13694 248423
special builtin char logical integer double complex
174 1572 1197774 34286 84631 42071 28
character ... any list expression bytecode externalptr
228592 1 0 79877 1 51276 2182
weakref raw S4
413 417 4385
mydata[mydata$var == 1,c(dv,ivs)] has 139593 rows and 75 columns with mostly factor variables and some logical or numerical variables. formula is a formula object of the type: “dv ~ var2 + var3 + …. + var73”. dv is a variable name string and ivs is a string vector with all independent variables var2 … var74.
Why is so much memory being allocated to R? How can I make R free up the extra memory? Any thoughts appreciated!
I have talked to one of the package authors, who told me that much of the data associated with the model object is saved in environments, which explains why object.size does not reflect the complete memory usage of R induced by the blackboost function. He also told me that the mboost package was not optimized in terms of speed and memory efficiency but is aimed at flexibility, and that all trees are saved and thereby the data as well, which explains the large amounts of data generated (I still find the dimensions remarkable..). He recommended using the package gbm (which I couldn’t get to replicate my results, yet) or to serialize, by doing something like this: