In NLTK, using a naive bayes classifier, I know from examples its very simply to use a “bag of words” approach and look for unigrams or bigrams or both. Could you do the same using two completely different sets of features?
For instance, could I use unigrams and length of the training set (I know this has been mentioned once on here)? But of more interest to me would be something like bigrams and “bigrams” or combinations of the POS that appear in the document?
Is this beyond the power of the basic NLTK classifier?
Thanks
Alex
NLTK classifiers can work with any key-value dictionary. I use
{"word": True}for text classification, but you could also use{"contains(word)": 1}to achieve the same effect. You can also combine many features together, so you could have{"word": True, "something something": 1, "something else": "a"}. What matters most is that your features are consistent, so you always have the same kind of keys and a fixed set of possible values. Numeric values can be used, but the classifier isn’t smart about them – it will treat numbers as discrete values, so that 99 and 100 are just as different as 1 and 100. If you want numbers to be handled in a smarter way, then I recommend using scikit-learn classifiers.