I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words.
I know how to get bigrams and trigrams. For example:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)
However, i found out that scikit-learn can get ngrams with various length. For example I can get ngrams with length from 1 to 5.
v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))
But WordNGramAnalyzer is now deprecated. My question is: How can i get N best word collocations from my text, with collocations length from 1 to 5. Also i need to get FreqList of this collocations/ngrams.
Can i do that with nltk/scikit ? I need to get combinations of ngrams with various lengths from one text ?
For example using NLTK bigrams and trigrams where many situations in which my trigrams include my bitgrams, or my trigrams are part of bigger 4-grams. For example:
bitgrams: hello my
trigrams: hello my name
I know how to exclude bigrams from trigrams, but i need better solutions.
update
Since scikit-learn 0.14 the format has changed to:
Full example:
which outputs the following (note that the word
Iis removed not because it’s a stopword (it’s not) but because of its length: https://stackoverflow.com/a/20743758/):This should/could be much simpler these days, imo. You can try things like
textacy, but that can come with its own complications sometimes, like initializing a Doc, which doesn’t work currently with v.0.6.2 as shown on their docs. If doc initialization worked as promised, in theory the following would work (but it doesn’t):old answer
WordNGramAnalyzeris indeed deprecated since scikit-learn 0.11. Creating n-grams and getting term frequencies is now combined in sklearn.feature_extraction.text.CountVectorizer. You can create all n-grams ranging from 1 till 5 as follows:More examples and information can be found in scikit-learn’s documentation about text feature extraction.