I’m trying to do some machine learning stuff that involves a lot of factor-type variables (words, descriptions, times, basically non-numeric stuff). I usually rely on randomForest but it doesn’t work w/factors that have >32 levels.
Can anyone suggest some good alternatives?
Tree methods won’t work, because the number of possible splits increases exponentially with the number of levels. However, with words this is typically addressed by creating indicator variables for each word (of the description etc.) – that way splits can use a word at a time (yes/no) instead of picking all possible combinations. In general you can always expand levels into indicators (and some models do that implicitly, such as glm). The same is true in ML for handling text with other methods such as SVM etc. So the answer may be that you need to think about your input data structure, not as much the methods. Alternatively, if you have some kind of order on the levels, you can linearize it (so there are only c-1 splits).