Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6080137
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T11:01:39+00:00 2026-05-23T11:01:39+00:00

I’m running out of memory on a normal 8GB server working with a fairly

  • 0

I’m running out of memory on a normal 8GB server working with a fairly small dataset in a machine learning context:

> dim(basetrainf) # this is a dataframe
[1] 58168   118

The only pre-modeling step I take which significantly increases memory consumption is convert a data frame to a model matrix. This is since caret, cor, etc. only work with (model) matrices. Even after removing factors with many levels, the matrix (mergem below) is fairly large. (sparse.model.matrix/Matrix is poorly supported in general, so I can’t use that.)

> lsos()
                 Type      Size PrettySize   Rows Columns
mergem         matrix 879205616   838.5 Mb 115562     943
trainf     data.frame  80613120    76.9 Mb 106944     119
inttrainf      matrix  76642176    73.1 Mb    907   10387
mergef     data.frame  58264784    55.6 Mb 115562      75
dfbase     data.frame  48031936    45.8 Mb  54555     115
basetrainf data.frame  40369328    38.5 Mb  58168     118
df2        data.frame  34276128    32.7 Mb  54555     103
tf         data.frame  33182272    31.6 Mb  54555      98
m.gbm           train  20417696    19.5 Mb     16      NA
res.glmnet       list  14263256    13.6 Mb      4      NA

Also, since many R models don’t support example weights, I had to first oversample the minority class, doubling the size of my dataset (why trainf, mergef, mergem have twice as many rows as basetrainf).

R is at this point using 1.7GB of memory, bringing my total memory usage up to 4.3GB out of 7.7GB.

The next thing I do is:

> m = train(mergem[mergef$istrain,], mergef[mergef$istrain,response], method='rf')

Bam – in a few seconds, the Linux out-of-memory killer kills rsession.

I can sample my data, undersample instead of oversample, etc., but these are non-ideal. What (else) should I do (differently), short of rewriting caret and the various model packages I intend to use?

FWIW, I’ve never run into this problem with other ML software (Weka, Orange, etc.), even without pruning out any of my factors, perhaps because of both example weighting and “data frame” support, across all models.

Complete script follows:

library(caret)
library(Matrix)
library(doMC)
registerDoMC(2)

response = 'class'

repr = 'dummy'
do.impute = F

xmode = function(xs) names(which.max(table(xs)))

read.orng = function(path) {
  # read header
  hdr = strsplit(readLines(path, n=1), '\t')
  pairs = sapply(hdr, function(field) strsplit(field, '#'))
  names = sapply(pairs, function(pair) pair[2])
  classes = sapply(pairs, function(pair)
    if (grepl('C', pair[1])) 'numeric' else 'factor')

  # read data
  dfbase = read.table(path, header=T, sep='\t', quote='', col.names=names, na.strings='?', colClasses=classes, comment.char='')

  # switch response, remove meta columns
  df = dfbase[sapply(pairs, function(pair) !grepl('m', pair[1]) && pair[2] != 'class' || pair[2] == response)]

  df
}

train.and.test = function(x, y, trains, method) {
  m = train(x[trains,], y[trains,], method=method)
  ps = extractPrediction(list(m), testX=x[!trains,], testY=y[!trains,])
  perf = postResample(ps$pred, ps$obs)
  list(m=m, ps=ps, perf=perf)
}

# From 
sparse.cor = function(x){
  memory.limit(size=10000)
  n 200 levels')
badfactors = sapply(mergef, function(x)
  is.factor(x) && (nlevels(x)  200))
mergef = mergef[, -which(badfactors)]

print('remove near-zero variance predictors')
mergef = mergef[, -nearZeroVar(mergef)]

print('create model matrix, making everything numeric')
if (repr == 'dummy') {
  dummies = dummyVars(as.formula(paste(response, '~ .')), mergef)
  mergem = predict(dummies, newdata=mergef)
} else {
  mat = if (repr == 'sparse') model.matrix else sparse.model.matrix
  mergem = mat(as.formula(paste(response, '~ .')), data=mergef)
  # remove intercept column
  mergem = mergem[, -1]
}

print('remove high-correlation predictors')
merge.cor = (if (repr == 'sparse') sparse.cor else cor)(mergem)
mergem = mergem[, -findCorrelation(merge.cor, cutoff=.75)]

print('try a couple of different methods')
do.method = function(method) {
  train.and.test(mergem, mergef[response], mergef$istrain, method)
}
res.gbm = do.method('gbm')
res.glmnet = do.method('glmnet')
res.rf = do.method('parRF')
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T11:01:39+00:00Added an answer on May 23, 2026 at 11:01 am

    Check that the underlying randomForest code is not storing the forest of trees. Perhaps reduce the tuneLength so that fewer values of mtry are being tried.

    Also, I would probably just fit a single random forest by hand to see if I could fit such a model on my machine. If you can’t fit one directly, you won’t be able to use caret to fit many in one go.

    At this point I think you need to work out what is causing the memory to balloon and how you might control the model fitting so it doesn’t balloon out of control. So work out how caret is calling randomForest() and what options it is using. You might be able to turn some of those off (like storing the forest I mentioned earlier, but also the variable importance measures). Once you’ve determined the optimal value for mtry, you can then try to fit the model with all the extras you might want to help interpret the fit.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I used javascript for loading a picture on my website depending on which small
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I have a French site that I want to parse, but am running into
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I am currently running into a problem where an element is coming back from
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I'm working with an upstream system that sometimes sends me text destined for HTML/XML
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.