I am using something like this to filter my data frame:
d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])
When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.
> print(d1)
ColA ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2
> print(d1$ColA)
Levels: ColACat1 ColACat2
Maybe this is expected but when I pass d1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.
As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a
factor. This is a feature, not a bug, but like any useful feature if you’re not expecting it and don’t know how to properly use it, it can be quite confusing.factorsare meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the
levelsattribute of that variable remains unchanged, and still has all the original levels in it.This means that any method written in R that is designed to take advantage of
factorswill treat that column as a categorical variable with a bunch of levels, many of which just aren’t present. In statistics, one often wants to track the presence of ‘missing’ levels of categorical variables.I actually also prefer to work with
stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUEis the default, so sharing your code with someone else may be risky unless you preface every single script with a call tooptions).A potentially more convenient solution, particularly for data frames, is to combine the
subsetanddroplevelsfunctions:and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.