A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows ‘c’ and ‘a’. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
stringsAsFactors = TRUEin the call todata.frame(). See?data.frameif you detest this behaviour so, set the option globally via
(as noted by @JoshuaUlrich in comments) a third option is to wrap character variables in
I(....). This alters the class of the object being assigned to the data frame component to include"AsIs". In general this shouldn’t be a problem as the object inherits (in this case) the class"character"so should work as before.You can check what the default for
stringsAsFactorsis on the currently running R process via:The issue is slightly wider than
data.frame()in scope as this also affectsread.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argumentcolClassesand R will respect that, e.g.