-

- -

Thursday, September 18, 2014

Neural Network Optimization Algorithm Visualisation



.
Optimization is a mathematical discipline that determines the “best” solution in a quantitatively well-defined sense. Mathematical optimization of the processes governed by partial differential equations has seen considerable progress in the past decade, and since then it has been applied to a wide variety of disciplines e.g., science, engineering, mathematics, economics, and even commerce. Optimization theory provides algorithms to solve well-structured optimization problems along with the analysis of those algorithms. A typical optimization problem includes an objective function that is to be minimized or maximized with the given constraints. Optimization theory provides algorithms to solve well-structured optimization problems along with the analysis of those algorithms. Optimization algorithms in machine learning (especially in neural networks) aim at minimizing an objective function (generally called loss or cost function), which is intuitively the difference between the predicted data and the expected values

Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space. Several optimization algorithms based on gradient descent exist in the literature, but just to name a few the classification of Gradient descent optimization algorithms goes as follows ...

(futher reading: https://medium.com/analytics-vidhya/a-complete-guide-to-adam-and-rmsprop-optimizer-75f4502d83be Feb 2021)



.
Visualizing Optimization Algorithms (algos)


Algos without scaling based on gradient information really struggle to break symmetry here - SGD gets no where and Nesterov Accelerated Gradient (NAG) / Momentum exhibits oscillations until they build up velocity in the optimization direction.

Algos that scale step size based on the gradient quickly break symmetry and begin descent.




Due to the large initial gradient, velocity based techniques shoot off and bounce around - adagrad almost goes unstable for the same reason.

Algos that scale gradients/step sizes like adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability.


Behavior around a saddle point.

NAG/Momentum again like to explore around, almost taking a different path. 

Adadelta/Adagrad/RMSProp proceed like accelerated SGD.
..
Reference: