Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7863401
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T23:28:14+00:00 2026-06-02T23:28:14+00:00

I need to train a regression model over a large set of training examples,

  • 0

I need to train a regression model over a large set of training
examples, with the potential to incorporate arbitrary features. What
learning algorithms should I consider and why?

A quick summary of the problem:

  • Approximately 5 million training examples
  • Adding training examples at a rate of 2-4 million per year
  • Training examples currently contain 10 features each
  • Aproximately 400k populated features (out of a much larger total feature space)
  • Additional features added over time
  • Retraining or adapting the model (at least) daily to incorporate new examples
  • Optimization criteria: minimum squared percentage error
  • Output: a single real-valued number

I have some experience training log-linear models on similarly-sized classification problems (using SVMs, Averaged and Voted Perceptrons, etc.) The ability to add arbitrary features is important, but in this instance, training time is valuable as well.

For instance, my one experiment thus far with SVMLight took several weeks to converge on a subset of this data. We could parallelize across a multicore machine or (possibly) a cluster, but we need to train models in minutes. Online training would be even better.

I trained an Averaged Perceptron model successfully (and quickly). However, to my knowledge, the AP isn’t normally applied to regression. Does the AP offer any convergence guarantees for a regression model? Is there another formal reason it shouldn’t be applicable? Or is that a reasonable match for my requirements?

What other options should I research? An SVM would probably offer superior accuracy, but quadratic training time isn’t acceptable. If linear-time SVM algorithms are accessible, that could work well.

Potential pluses:

  • Online training
  • Open-source implementation available (ideally in Java). We can roll our own implementation if necessary, but I’ll avoid that if possible.

Thanks for your input.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T23:28:15+00:00Added an answer on June 2, 2026 at 11:28 pm

    This is the classic problem with large scale SVM. An SVM model would need to be retrained if new features are added, and if new data is added if you are not using an online svm. Some options:

    Practical Options (off the shelf):

    LIBLINEAR – If you can do Linear SVM there are some algorithms which take advantage of the linear kernel to provide better than quadratic training time. Check out LIBLINEAR which is from the same research group as libsvm. They just added regression in Version 1.91 released yesterday. http://www.csie.ntu.edu.tw/~cjlin/liblinear/

    Oracle ODM – Oracle has SVM available in their ODM package. They take a practical approach to basically provide ‘good enough’ SVM without paying the computational cost of finding a truly optimal solution. They use some sampling and model selection techniques – you can find info about that here: http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/overview/support-vector-machines-paper-1205-129825.pdf

    SHOGUN – The SHOGUN Machine Learning Toolbox is designed for large-scale learning, they interface with a number of SVM implementations as well as other methods. I’ve never used it but it might be worth a look: http://www.shogun-toolbox.org

    Kernel-machines.org has a list of software packages: http://www.kernel-machines.org/software

    Other SVM research

    If you are looking to roll your own, there are a number of techniques to try to scale SVM up to large datasets that have been published in research papers, but the code is not necessarily available, useable or maintained as the above examples. They claim good results, but each has its own set of drawbacks. Many involve doing some level of data selection. For example, several research papers use linear time clustering algorithms to cluster the data and train successive SVM models based on the clusters, in order to build the model without using all the data. Core Vector Machines claim a linear training time, but there is some criticism on whether their accuracy is as high as they claim. Numerous papers use different heuristic algorithms to attempt to select the most likely support vector candidates. Many of these are for classification, but could probably be adapted to regression. If you would like more info on some of this research I can add some references.

    Tools for exploring algorithms

    You are probably already be familiar with these, but I figured I’d throw it in here just in case:

    There are other algorithms that have good runtime on large datasets, but whether they will work well is hard to tell, it depends on the makeup of your data. Since run-time is important I would start with the simpler models and work up to the more complex. ANN, Decision Tree regression, Bayesian methods, Locally Weighted Linear Regression, or a hybrid approach such as model trees, which is a decision tree whose leaf nodes are linear models, can all be done more quickly than SVM on large data sets and may produce good results.

    WEKA – Weka is a good tool for exploring your options. I would use WEKA to try subsets of your data in different algorithms. The source code is open and in java if you chose something you could adapt it to your needs. http://www.cs.waikato.ac.nz/ml/weka/

    R – The R Programming Language also implements many algorithms and is similar to programming in Matlab. http://www.r-project.org/

    I wouldn’t recommend using WEKA or R non a large-scale data set, but they are useful tools for trying to narrow down what algorithms may work well for you.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I need to train a network to multiply or add 2 inputs, but it
I need help modeling a train system (like a subway system) in SimPy, the
I'm using Sql Server 2008 analysis services to create/train a mining model (naive bayes
I need to train a neural network with 2-4 hidden layers, not sure yet
I have someone I need to train in Java. That person is well versed
I think it would be fun to model a top view of a train
I need to download jasper reports professional trail version, can anyone help?
Need to insert selected text on the page into textarea. There must be some
Need to call a filter function on some options based on a radio selected
need to add a 2nd css stylesheet to a page. do i add a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.