I understand what Gradient Descent does. Basically it tries to move towards the local optimal solution by slowly moving down the curve. I am trying to understand what is the actual difference between the plain gradient descent and the Newton’s method?
From Wikipedia, I read this short line "Newton’s method uses curvature information to take a more direct route." What does this intuitively mean?
At a local minimum (or maximum)
x, the derivative of the target functionfvanishes:f'(x) = 0(assuming sufficient smoothness off).Gradient descent tries to find such a minimum
xby using information from the first derivative off: It simply follows the steepest descent from the current point. This is like rolling a ball down the graph offuntil it comes to rest (while neglecting inertia).Newton’s method tries to find a point
xsatisfyingf'(x) = 0by approximatingf'with a linear functiongand then solving for the root of that function explicitely (this is called Newton’s root-finding method). The root ofgis not necessarily the root off', but it is under many circumstances a good guess (the Wikipedia article on Newton’s method for root finding has more information on convergence criteria). While approximatingf', Newton’s method makes use off''(the curvature off). This means it has higher requirements on the smoothness off, but it also means that (by using more information) it often converges faster.