according to wikipedia, with the delta rule we adjust the weight by:
dw = alpha * (ti-yi)*g'(hj)xi
when alpha = learning constant, ti – true answer, yi – perceptron’s guess,g’ = the derivative of the activation function g with respect to the weighted sum of the perceptron’s inputs, xi – input.
The part that I don’t understand in this formula is the multiplication by the derivative g’. let g = sign(x) (the sign of the weighted sum). so g’ is always 0, and dw = 0. However, in code examples I saw in the internet, the writers just omitted the g’ and used the formula:
dw = alpha * (ti-yi)*(hj)xi
I will be glad to read a proper explanation!
thank you in advance.
You’re correct that if you use a step function for your activation function
g, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0). This is why a step-function perceptron doesn’t work well with gradient descent. 🙂For a linear perceptron, you’d have
g'(x) = 1, fordw = alpha * (t_i - y_i) * x_i.You’ve seen code that uses
dw = alpha * (t_i - y_i) * h_j * x_i. We can reverse-engineer what’s going on here, because apparentlyg'(h_j) = h_j, which means remembering our calculus that we must haveg(x) = e^x + constant. So apparently the code sample you found uses an exponential activation function.This must mean that the neuron outputs are constrained to be on
(0, infinity)(or I guess(a, infinity)for any finitea, forg(x) = e^x + a). I haven’t run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).