I am using random forests in a big data problem, which has a very

Question

0

Asked: May 27, 20262026-05-27T19:50:58+00:00 2026-05-27T19:50:58+00:00

I am using random forests in a big data problem, which has a very

0

I am using random forests in a big data problem, which has a very unbalanced response class, so I read the documentation and I found the following parameters:

strata 

sampsize

The documentation for these parameters is sparse (or I didn´t have the luck to find it) and I really don´t understand how to implement it. I am using the following code:

randomForest(x=predictors, 
             y=response, 
             data=train.data, 
             mtry=lista.params[1], 
             ntree=lista.params[2], 
             na.action=na.omit, 
             nodesize=lista.params[3], 
             maxnodes=lista.params[4],
             sampsize=c(250000,2000), 
             do.trace=100, 
             importance=TRUE)

The response is a class with two possible values, the first one appears more frequently than the second (10000:1 or more)

The list.params is a list with different parameters (duh! I know…)

Well, the question (again) is: How I can use the ‘strata’ parameter? I am using sampsize correctly?

And finally, sometimes I get the following error:

Error in randomForest.default(x = predictors, y = response, data = train.data,  :
  Still have fewer than two classes in the in-bag sample after 10 attempts.

Sorry If I am doing so many (and maybe stupid) questions …

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-27T19:50:59+00:00

You should try using sampling methods that reduce the degree of imbalance from 1:10,000 down to 1:100 or 1:10. You should also reduce the size of the trees that are generated. (At the moment these are recommendations that I am repeating only from memory, but I will see if I can track down more authority than my spongy cortex.)

One way of reducing the size of trees is to set the “nodesize” larger. With that degree of imbalance you might need to have the node size really large, say 5-10,000. Here’s a thread in rhelp:
https://stat.ethz.ch/pipermail/r-help/2011-September/289288.html

In the current state of the question you have sampsize=c(250000,2000), whereas I would have thought that something like sampsize=c(8000,2000), was more in line with my suggestions. I think you are creating samples where you do not have any of the group that was sampled with only 2000.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using random forests in a big data problem, which has a very

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply