Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8510075
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 11, 20262026-06-11T03:36:24+00:00 2026-06-11T03:36:24+00:00

I’m trying to use sk-learn’s RandomForestClassifier for a binary classification task (positive and negative

  • 0

I’m trying to use sk-learn’s RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset’s load_svmlight_file. It produces a sparse matrix of ‘feature values’ (1.177.245 * 40) and one array of ‘target classes’ (1s and 0s, 1.177.245 of them). I don’t know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.

As the sk-learn’s RFC doesn’t accept sparse matrices, I convert the sparse matrix to a dense array (if I’m saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.

When I initiate the classifier and start fitting it to the data, it takes this long:

[Parallel(n_jobs=40)]: Done   1 out of  40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done  40 out of  40 | elapsed: 27.2min finished

(is that output right? Those 963 minutes take about 2 and a half…)

I then dump it using joblib.dump.
When I re-load it:

RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
        criterion=gini, max_depth=None, max_features=auto,
        min_density=0.1, min_samples_leaf=1, min_samples_split=1,
        n_estimators=1500, n_jobs=40, oob_score=False,
        random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
        verbose=1)

And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get “unexpected” results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.

Now I have no reason to believe anything is wrong with what’s happening, it’s just that I get weird results, and furthermore I think it’s all done awfully quick. It’s probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours…

Can anyone enlighten me whether I have any reason to believe something is not working the way it’s supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-11T03:36:26+00:00Added an answer on June 11, 2026 at 3:36 am

    Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).

    Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am trying to understand how to use SyndicationItem to display feed which is
I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka
Basically, what I'm trying to create is a page of div tags, each has
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
I am trying to render a haml file in a javascript response like so:
I want use html5's new tag to play a wav file (currently only supported
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I'm trying to select an H1 element which is the second-child in its group
We are using XSLT to translate a RIXML file to XML. Our RIXML contains

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.