I am reading this document, and they stated that the weight adjustment formula is

Question

0

Editorial Team

Asked: May 31, 20262026-05-31T13:52:41+00:00 2026-05-31T13:52:41+00:00

I am reading this document, and they stated that the weight adjustment formula is

0

I am reading this document, and they stated that the weight adjustment formula is this:

new weight = old weight + learning rate * delta * df(e)/de * input

The df(e)/de part is the derivative of the activation function, which is usually a sigmoid function like tanh.

What is this actually for?
Why are we even multiplying with that?
Why isn’t just learning rate * delta * input enough?

This question came after this one and is closely related to it: Why must a nonlinear activation function be used in a backpropagation neural network?.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-31T13:52:42+00:00

Training a neural network just refers to finding values for every cell in the weight matrices (of which there are two for a NN having one hidden layer) such that the squared differences between the observed and predicted data are minimized. In practice, the individual weights comprising the two weight matrices are adjusted with each iteration (their initial values are often set to random values). This is also called the online model, as opposed to the batch one where weights are adjusted after a lot of iterations.

But how should the weights be adjusted–i.e., which direction +/-? And by how much?

That’s where the derivative come in. A large value for the derivative will result in a large adjustment to the corresponding weight. This makes sense because if the derivative is large that means you are far from a minima. Put another way, weights are adjusted at each iteration in the direction of steepest descent (highest value of the derivative) on the cost function’s surface defined by the total error (observed versus predicted).

After the error on each pattern is computed (subtracting the actual value of the response varible or output vector from the value predicted by the NN during that iteration), each weight in the weight matrices is adjusted in proportion to the calculated error gradient.

Because the error calculation begins at the end of the NN (i.e., at the output layer by subtracting observed from predicted) and proceeds to the front, it is called backprop.

More generally, the derivative (or gradient for multivariable problems) is used by the optimization technique (for backprop, conjugate gradient is probably the most common) to locate minima of the objective (aka loss) function.

It works this way:

The first derivative is the point on a curve such that a line tangent to it has a slope of 0.

So if you are walking around a 3D surface defined by the objective function and you walk to a point where slope = 0, then you are at the bottom–you have found a minima (whether global or local) for the function.

But the first derivative is more important than that. It also tells you if you are going in the right direction to reach the function minimum.

It’s easy to see why this is so if you think about what happens to the slope of the tangent line as the point on the curve/surface is moved down toward the function minimumn.

The slope (hence the value of the derivative of the function at that point) gradually decreases. In other words, to minimize a function, follow the derivative–i.e, if the value is decreasing then you are moving in the correct direction.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am reading this document, and they stated that the weight adjustment formula is

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply