Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8205931
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T08:28:04+00:00 2026-06-07T08:28:04+00:00

I have programmed a Neural Network in Java and am now working on the

  • 0

I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.

I’ve read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.

As a test I’ve created a time series function of 100 points, such that x = [0..99] and y = f(x). I’ve created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.

I have 100 input points so when I calculate the weight change dw_{ij} for each node it is actually a sum:

dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}

where p = 100 in this case.

Now the weight updates become really huge and therefore my error E bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y to something like 0.7 / p^2.

Is there some general rule for setting the learning rate, based on the amount of samples?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T08:28:09+00:00Added an answer on June 7, 2026 at 8:28 am

    http://francky.me/faqai.php#otherFAQs :

    Subject: What learning rate should be used for
    backprop?

    In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate
    makes the weights and objective function diverge, so there is no learning at all. If the objective function is
    quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and
    Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs
    with hidden units, the optimal learning rate often changes dramatically during the training process, since
    the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a
    tedious process requiring much trial and error. For some examples of how the choice of learning rate and
    momentum interact with numerical condition in some very simple networks, see
    ftp://ftp.sas.com/pub/neural/illcond/illcond.html

    With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
    standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist
    (see Quickprop and RPROP under “What is backprop?” and the numerous training algorithms mentioned
    under “What are conjugate gradients, Levenberg-Marquardt, etc.?”).

    Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
    standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of
    the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a
    large step size; this happens when you initialize a network with small random weights. In other regions of
    the weight space, the gradient is small and you need a small step size; this happens when you are close to a
    local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms
    try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute
    the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The
    great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the
    magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.

    With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the
    learning rate during training. Various proposals have appeared in the NN literature, but most of them don’t
    work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who
    unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and
    Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate.
    There is also a variant of stochastic approximation called “iterate averaging” or “Polyak averaging”
    (Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running
    average of the weight values. I have no personal experience with these methods; if you have any solid
    evidence that these or other methods of automatically setting the learning rate and/or momentum in
    incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer
    (saswss@unx.sas.com).

    References:

    • Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
      Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
    • Darken, C. and Moody, J. (1992), “Towards faster stochastic gradient
      search,” in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
    • Advances in Neural Information Processing Systems 4, San Mateo, CA:
      Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin,
      G. (1997), Stochastic Approximation Algorithms and Applications, NY:
      Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B.
      (1993), “Automatic learning rate maximization by online estimation of
      the Hessian’s eigenvectors,” in Hanson, S.J., Cowan, J.D., and Giles,
    • C.L. (eds.), Advances in Neural Information Processing Systems 5, San
      Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K.
      (1997), “Using curvature information for fast stochastic search,” in
    • Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
      Information Processing Systems 9,Cambridge, MA: The MIT Press, pp.
      606-612.

    Credits:

    • Archive-name: ai-faq/neural-nets/part1
    • Last-modified: 2002-05-17
    • URL: ftp://ftp.sas.com/pub/neural/FAQ.html
    • Maintainer: saswss@unx.sas.com (Warren S. Sarle)
    • Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC, USA.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm an experienced Java programmer that for the last two years have programmed for
I have programmed in HTML, Java, CSS, C++, VB, an Python. I'm looking to
Once I have programmed GUI with Java and have used Form Layouts. Form layout
I have 10 dice that are all programmed the same as D1. D1 =
For my graduate research I am creating a neural network that trains to recognize
I have an algorithm that uses the following OpenSSL calls: HMAC_update() / HMAC_final() //
I have programmed an embedded software (using C of course) and now I'm considering
I have programmed a sub procedure that will be called in the main procedure
I have programmed in Java some and tried to use net-beans to create a
I have a customer mangement system that I have programmed in php and mysql.

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.