Let say I have the following data in R.
training = factor(c(1,1,3,2,1,3,2,34,67,34))
test = factor(c(1,1,2,30,65,30))
(my data is much more complicated, this is a simplification)
I want to check if the levels in the test set exist in the training set, and if not to replace it by the nearest value in the training set.
For example, the levels 30 and 65 in test set do not exist in training set, so I want to replace them by 34 and 67 respectively.
Currently, I created the following code.
replacefactor <- function(dat,new_factor,near_factor) {
if (!(near_factor %in% levels(dat))){
levels(dat) <- c(levels(dat),near_factor)
}
dat[dat==new_factor] <- near_factor
dat <- factor(dat)
}
test <- replacefactor(test,30,34)
test <- replacefactor(test,65,67)
It works, but I need to specify the levels by hand. This is not practical for me due to the size of my data.
I am not sure how I could find the nearest value in the training set.
I could then use a for loop to automate it.
first get the levels that aren’t matched:
then write a function to run along them and find the nearest match:
Then this can be assigned to the correct levels: