I am using randomForest package in R platform to build a binary classifier. There

Question

0

Asked: June 16, 20262026-06-16T04:37:22+00:00 2026-06-16T04:37:22+00:00

I am using randomForest package in R platform to build a binary classifier. There

0

I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.

I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.

Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-16T04:37:23+00:00

I can see at least the following approaches. Personally, I prefer the third option.

1) Discard the extra columns

You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.

2) Use the data as it is

In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying “if I’m not sure whether something is present or absent, I’ll treat it as absent”. I know this is obvious, but if you haven’t tried this, you should, to compare it to option 1.

3) Use separate classifiers

If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that’s 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.

Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.

Other tips

If the 15 “normal” variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you’re not sure, normalize the 15 “normal” variables to lie in the interval [0,1] — you probably won’t lose anything by doing this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am using randomForest package in R platform to build a binary classifier. There

Leave an answerCancel reply

1 Answer

1) Discard the extra columns

2) Use the data as it is

3) Use separate classifiers

Other tips

Leave an answer
Cancel reply