I have a list of strings which are all verbs. I need to get the word frequencies for each verb, but I want to count verbs such as "want", "wants", "wanting" and "wanted" as one verb. Formally, a “verb” is defined as a set of 4 words that are of the form {X, Xs, Xed, Xing} or of the form {X, Xes, Xed, Xing} where X is the verb. How would I go about extracting verbs from the list such that I get "X" and a count of how many times the stem appears? I figured I could somehow use regex, however I’m new to regex and I am totally lost.
Share
There is a library called nltk which has an insane array of functions for text processing. One of the subsets of functions are
stemmers, which do just what you want (using algorithms/code developed by people with a lot of experience in the area). Here is the result using the Porter Stemming algorithm:You could use this in conjunction with a
defaultdictto do something like this (note: in Python 2.7+, aCounterwould be equally useful/better):One thing to note: the stemmers aren’t perfect – for instance, adding
ranto the above yields this as the result:However hopefully it will get you close to what you want.