I’m trying to do some data manipulation in R. I have 2 data frames, one is training data, the other testing data all the data is categorical and stored as factor variables.
There are some NA’s in the data and I’m trying to convert them to “-1”. When I do it for the training data, things go fine, but not for the test data.
Something changes the values during a loop I run but I can’t figure out what.
Here’s the before:
> class(catTrain1[,"Cat_111"])
[1] "factor"
> class(catTest1[,"Cat_111"])
[1] "factor"
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
0 1 2
1 503 15
Here’s the loop:
> for(i in 1:ncol(catTrain1)){
+ catTrain1[,i] <- as.factor(as.character(ifelse(is.na(catTrain1[,i]), "-1", catTrain1[,i])))
+ }
> for(i in 1:ncol(catTest1)){
+ catTest1[,i] <- as.factor(as.character(ifelse(is.na(catTest1[,i]), "-1", catTest1[,i])))
+ }
Here’s the after:
> table(catTrain1[,"Cat_111"])
1 2
726 25
> table(catTest1[,"Cat_111"])
1 2 3
1 503 15
I’ve seen the shift up by one with character -> numeric conversions but I can’t figure out why this is happening, especially for just one of the dataframes / loops.
Any suggestions?
The column names in your first set of calls to
tableare the levels of the factor. In the second set of calls totable, the column names are the level indexes.ifelseis pulling the indexes, not the levels. In your loops, move theas.characterin around the finalcatTest1[,i]andcatTrain1[,i].