I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards
I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying “if I’m not sure whether something is present or absent, I’ll treat it as absent”. I know this is obvious, but if you haven’t tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that’s 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 “normal” variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you’re not sure, normalize the 15 “normal” variables to lie in the interval [0,1] — you probably won’t lose anything by doing this.