Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8330171
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 9, 20262026-06-09T02:02:22+00:00 2026-06-09T02:02:22+00:00

I need to get most popular ngrams from text. Ngrams length must be from

  • 0

I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words.

I know how to get bigrams and trigrams. For example:

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)

However, i found out that scikit-learn can get ngrams with various length. For example I can get ngrams with length from 1 to 5.

v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))

But WordNGramAnalyzer is now deprecated. My question is: How can i get N best word collocations from my text, with collocations length from 1 to 5. Also i need to get FreqList of this collocations/ngrams.

Can i do that with nltk/scikit ? I need to get combinations of ngrams with various lengths from one text ?

For example using NLTK bigrams and trigrams where many situations in which my trigrams include my bitgrams, or my trigrams are part of bigger 4-grams. For example:

bitgrams: hello my
trigrams: hello my name

I know how to exclude bigrams from trigrams, but i need better solutions.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-09T02:02:26+00:00Added an answer on June 9, 2026 at 2:02 am

    update

    Since scikit-learn 0.14 the format has changed to:

    n_grams = CountVectorizer(ngram_range=(1, 5))
    

    Full example:

    test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
    test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    c_vec = CountVectorizer(ngram_range=(1, 5))
    
    # input to fit_transform() should be an iterable with strings
    ngrams = c_vec.fit_transform([test_str1, test_str2])
    
    # needs to happen after fit_transform()
    vocab = c_vec.vocabulary_
    
    count_values = ngrams.toarray().sum(axis=0)
    
    # output n-grams
    for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
        print(ng_count, ng_text)
    

    which outputs the following (note that the word I is removed not because it’s a stopword (it’s not) but because of its length: https://stackoverflow.com/a/20743758/):

    > (3, u'to')
    > (3, u'from')
    > (2, u'ngrams')
    > (2, u'need')
    > (1, u'words')
    > (1, u'trigrams but need better solutions')
    > (1, u'trigrams but need better')
    ...
    

    This should/could be much simpler these days, imo. You can try things like textacy, but that can come with its own complications sometimes, like initializing a Doc, which doesn’t work currently with v.0.6.2 as shown on their docs. If doc initialization worked as promised, in theory the following would work (but it doesn’t):

    test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
    test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."
    
    import textacy
    
    # some version of the following line
    doc = textacy.Doc([test_str1, test_str2])
    
    ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
    print(ngrams)
    

    old answer

    WordNGramAnalyzer is indeed deprecated since scikit-learn 0.11. Creating n-grams and getting term frequencies is now combined in sklearn.feature_extraction.text.CountVectorizer. You can create all n-grams ranging from 1 till 5 as follows:

    n_grams = CountVectorizer(min_n=1, max_n=5)
    

    More examples and information can be found in scikit-learn’s documentation about text feature extraction.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We need to get tree like structure from a given text document using Java.
I have two tables with a weak relation. I need get a text value
Need to get the 10 word before and 10 words after for the given
I need to get the contents from this URL http://google.fr/ok in a NSString can
This can't be hard, but... I just need to get the most recent three
I need to find the most effecient way to find a random element from
I need to get the latest updated file from number of files present under
I need to get the front-most address of a complete object even if what
I need to get list of all apps which I use most. I do
I need to get average sleep interval from following data: 22:00-06:00 00:00-08:00 02:00-10:00 =>

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.