Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I’m trying to use a randomForest to predict sales. I have 3 variables, one of which is a factor variable for storeId. I know that there are levels in the test set that are NOT in the training set. I’m trying to get a prediction for only levels present in the training set but can’t get it to look past the new factor levels.
Here’s what I’ve tried so far:
require(randomForest)
train <- data.frame(sales = runif(10)*1000, storeId = factor(seq(1,10,1)), dat1 =runif(10), dat2 = runif(10)*10)
test <- data.frame(storeId = factor(seq(2,11,1)), dat1 =runif(10), dat2 = runif(10)*10)
> train
sales storeId dat1 dat2
1 414.7791 1 0.7830092 7.178577
2 719.5965 2 0.9512138 6.153049
3 887.3197 3 0.6879827 5.413556
4 706.5828 4 0.4486214 4.955400
5 326.8189 5 0.0944885 6.900802
6 840.5920 6 0.1917165 8.044636
7 936.2206 7 0.2173074 4.835064
8 244.6947 8 0.6526765 6.516790
9 818.8747 9 0.3317644 9.651675
10 631.6104 10 0.6998037 8.443972
> test
storeId dat1 dat2
1 2 0.7513645 3.442052
2 3 0.2862487 3.196189
3 4 0.4971865 6.074281
4 5 0.8631945 8.766129
5 6 0.3848105 5.001426
6 7 0.9032262 7.018274
7 8 0.1560501 4.523618
8 9 0.3461597 5.551672
9 10 0.1318464 3.092640
10 11 0.6587270 1.348623
> RF1 <- randomForest(train[,c("storeId","dat1","dat2")], train$sales, do.trace=TRUE,
+ importance=TRUE,ntree=5,,forest=TRUE)
| Out-of-bag |
Tree | MSE %Var(y) |
1 | 2.915e+05 544.44 |
2 | 1.825e+05 340.84 |
3 | 2.1e+05 392.19 |
4 | 1.914e+05 357.38 |
5 | 1.809e+05 337.78 |
> pred <- predict(RF1, test)
Error in predict.randomForest(RF1, test) :
New factor levels not present in the training data
This part makes sense.
So I try this:
> test2 <- test[test$storeId != 11,]
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
New factor levels not present in the training data
So I try this:
> levels(test2$storeId)
[1] "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
And the “11” level is still in there.
Next I try this:
> test2$storeId <- as.numeric(as.character(test2$storeId))
> test2$storeId <- factor(test2$storeId)
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
despite the fact that things look ok here:
> levels(test2$storeId)
[1] "2" "3" "4" "5" "6" "7" "8" "9" "10"
Any suggestions for getting it to predict on just stores without the “11” level?
EDIT:
> test2$storeId <- as.factor(as.character(test2$storeId))
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
>
> test2$storeId <- drop.levels(test2$storeId)
> pred <- predict(RF1, test2)
Error in predict.randomForest(RF1, test2) :
Type of predictors in new data do not match that of the training data.
> str(train)
'data.frame': 10 obs. of 4 variables:
$ sales : num 800 679 589 812 384 ...
$ storeId: Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10
$ dat1 : num 0.5148 0.5567 0.9871 0.0071 0.736 ...
$ dat2 : num 8.501 2.994 2.948 0.519 1.746 ...
> str(test)
'data.frame': 10 obs. of 3 variables:
$ storeId: Factor w/ 10 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9 10
$ dat1 : num 0.0975 0.7435 0.7055 0.2085 0.2944 ...
$ dat2 : num 5.96 6.84 3.96 8.93 8.62 ...
> str(test2)
'data.frame': 9 obs. of 3 variables:
$ storeId: Factor w/ 9 levels "2","3","4","5",..: 1 2 3 4 5 6 7 8 9
$ dat1 : num 0.0975 0.7435 0.7055 0.2085 0.2944 ...
$ dat2 : num 5.96 6.84 3.96 8.93 8.62 ...
You cannot run the randomForest predict function on newdata that has missing factors as compared to the rf model. Since the factor levels of test$storeId range “2”-“11” and the train$storeId “1”-“10”, when you drop level 11 in the test data your are still missing level “1” and thus randomForest predict is failing.