I have a 2.5 GB dataset, which is quite large for my 4GB memory. I wonder if converting character variables to factors will save space and processing time.
I would imagine that internally, factors will be stored in numeric with a lookup table for levels. But I am not sure how it actually works.
Converting to factor won’t save space because characters are stored in a hash table. See section 1.10 The CHARSXP cache of R Internals.
Converting to factor may improve processing time if your code would need to convert to factor (running a regression, classification, etc.), but it won’t improve processing time if you’re doing string manipulation because it would have to convert the factor back to a character. So it really depends on what you’re doing.