I’m running out of memory on a normal 8GB server working with a fairly

Question

0

Asked: May 23, 20262026-05-23T11:01:39+00:00 2026-05-23T11:01:39+00:00

I’m running out of memory on a normal 8GB server working with a fairly

0

I’m running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:

> dim(basetrainf) # this is a dataframe
[1] 58168   118

The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret, cor, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem below) is fairly large. (sparse.model.matrix/Matrix is poorly supported in general, so I can’t use that.)

> lsos()
                 Type      Size PrettySize   Rows Columns
mergem         matrix 879205616   838.5 Mb 115562     943
trainf     data.frame  80613120    76.9 Mb 106944     119
inttrainf      matrix  76642176    73.1 Mb    907   10387
mergef     data.frame  58264784    55.6 Mb 115562      75
dfbase     data.frame  48031936    45.8 Mb  54555     115
basetrainf data.frame  40369328    38.5 Mb  58168     118
df2        data.frame  34276128    32.7 Mb  54555     103
tf         data.frame  33182272    31.6 Mb  54555      98
m.gbm           train  20417696    19.5 Mb     16      NA
res.glmnet       list  14263256    13.6 Mb      4      NA

Also, since many R models don’t support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).

R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.

The next thing I do is:

> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')

Bam – in a few seconds, the Linux out-of-memory killer kills rsession.

I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?

FWIW, I’ve never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and “data frame” support, across all models.

Complete script follows:

library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)

response = 'class'

repr = 'dummy'
do.impute = F

xmode = function(xs) names(which.max(table(xs)))

read.orng = function(path) {
  # read header
  hdr = strsplit(readLines(path, n=1), '\t')
  pairs = sapply(hdr, function(field) strsplit(field, '#'))
  names = sapply(pairs, function(pair) pair[2])
  classes = sapply(pairs, function(pair)
    if (grepl('C', pair[1])) 'numeric' else 'factor')

  # read data
  dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')

  # switch response, remove meta columns
  df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]

  df
}

train.and.test = function(x, y, trains, method) {
  m = train(x[trains,], y[trains,], method=method)
  ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
  perf = postResample(ps$pred, ps$obs)
  list(m=m, ps=ps, perf=perf)
}

# From 
sparse.cor = function(x){
  memory.limit(size=10000)
  n 200 levels')
badfactors = sapply(mergef, function(x)
  is.factor(x) && (nlevels(x)  200))
mergef = mergef[, -which(badfactors)]

print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]

print('create model matrix, making everything numeric')
if (repr == 'dummy') {
  dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
  mergem = predict(dummies, newdata=mergef)
} else {
  mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
  mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
  # remove intercept column
  mergem = mergem[, -1]
}

print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]

print('try a couple of different methods')
do.method = function(method) {
  train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T11:01:39+00:00

Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can’t fit one directly, you won’t be able to use caret to fit many in one go.

At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn’t balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you’ve determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m running out of memory on a normal 8GB server working with a fairly

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply