I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2)

Question

0

Asked: June 3, 20262026-06-03T09:38:35+00:00 2026-06-03T09:38:35+00:00

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2)

0

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2) for regression and noticed a significant bias in my results: the prediction error is dependent on the value of the response variable. High values are under predicted and low values are over predicted. At first I suspected this was a consequence of my data but the following simple example shows that this is inherent to the random forest algorithm:

n = 50; 
x1 = seq(1,n) 
x2 = matrix(1, n, 1)
predictors = data.frame(x1=x1, x2=x2)
response = x2 + x1
rf = randomForest(x=predictors, y=response)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

No doubt tree methods have their limitations when it comes to linearity but even the simplest regression tree, e.g. tree() in R, does not exhibit this bias. I can’t imagine that the community would be unaware of this but haven’t found any mention, how is it generally corrected for? Thanks for any comments

EDIT: The example for this question is flawed, please see “RandomForest for regression in R – response distribution dependent bias” at stack exchange for an improved treatment https://stats.stackexchange.com/questions/28732/randomforest-for-regression-in-r-response-distribution-dependent-bias

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-03T09:38:36+00:00

What you’ve discovered isn’t an inherent bias in random forests, but simply a failure to properly adjust the tuning parameters on the model.

Using your example data:

rf = randomForest(x=predictors, y=response,mtry = 2,nodesize = 1)
plot(x1, response)
lines(x1, predict(rf, predictors), col="red")

enter image description here

For your real data the improvement will be unlikely to be so stark, of course, and I’d bet you’ll get more mileage out of nodesize than mtry (mtry did most of the work here).

The reason that regular trees didn’t exhibit this “bias” is because they, by default, search over all variables for the best split.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using the randomForest package in R (R version 2.13.1, randomForest version 4.6-2)

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply