I’m experimenting with some document classification task and SVM works well so far on TF*IDF feature vectors. I want to incorporate some new features that are not term frequency based (e.g. document length) and see if these new features contribute towards classification performance. I’m having the following questions:
- can I simply concatenate the new features with the old term frequency based features and train an SVM on this heterogeneous feature space?
- if not, is Multiple Kernel Learning the way to go about it by training a kernel on each sub feature space and combine them using linear interpolation? (we still don’t have MKL implemented in scikit-learn, right?)
- or shall I turn to alternative learners that handle heterogeneous features well, such as MaxEnt and decision trees?
Thank you in advance for your kind advise!
Since you tagged this with
scikit-learn: yes, you can, and you can useFeatureUnionto do it for you.Linear SVMs are the standard model for this task. Kernel methods are too slow for real-world text classification (except maybe with training algorithms like LaSVM, but that’s not implemented in scikit-learn).
SVMs handle heterogenous features just as well as MaxEnt/logistic regression. In both cases, you really must input scaled data, e.g. with
MinMaxScaler. Note that scikit-learn’sTfidfTransformerproduces normalized vectors by default, so you don’t need to scale its output, just the other features.