Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8475893
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T18:01:17+00:00 2026-06-10T18:01:17+00:00

This is a question about linear regression with ngrams, using Tf-IDF (term frequency –

  • 0

This is a question about linear regression with ngrams, using Tf-IDF (term frequency – inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.

I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.

When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.

Is this common? How is this possible? I would have thought that the more features, the better.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T18:01:18+00:00Added an answer on June 10, 2026 at 6:01 pm

    It’s not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.

    I’m not sure what kind of variable you’re trying to predict, and I’ve never done regression on text, but here’s some comparable results from literature to get you thinking:

    • In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate “new” but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can’t remember which chapter and I don’t have my copy handy).
    • In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel’s feature space, and {1,2,3}-grams of that of the cubic kernel.

    Exactly what is happening depends on your training set; it might simply be too small.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have seen this question about deploying to WebSphere using the WAS ant tasks.
This question is about the efficiency of a linear search vs. the efficiency of
This is a beginner question on regularization with regression. Most information about Elastic Net
This recent question about sorting randomly using C# got me thinking about the way
This question about Timers for windows services got me thinking: Say I have (and
Followed this question about delayed_job and monit Its working on my development machine. But
Note: I originally asked this question about an hour ago but only recently realized
I've this question about pass some instances by ref or not: here is my
Follow up to this question about GNU make : I've got a directory, flac
This is sort of the Java analogue of this question about C# . Suppose

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.