Second order optimization methods and Neural Networks
Published:
Optimization of deep Neural Networks is often done using Gradient-based methods such as mini-batch gradient descent and its extensions such as Momentm, RMSprop, and Adam. Second order optimization methods such as Newton, BFGS, etc are widely used in different areas of statsitics and Machine Learning. Why are these methods are not popular in deep learning?
Hessian Matrix
One of the fundamental building blocks of the second order optimization methods is the Hessian matrix. In an N dimensional parameter space, the Hessian matrix is going to be an NxN dimensional matrix whose i and j elements are given by the i and j elements of the cost function with respect to the i-th and j-th parameters respectively.
Parameter update in second-order methods
The parameter update step inside many second order methods involves computing the Hessian matrix, taking its inverse and then multiplying the inverse Hessian matrix by the Jacobian matrix, an N dimensional vector containing the partial derivatives of the cost function with respect to parameters.
In a typical deep network, the number of parameters, denoted here by N, can be as large as millions. We know that the computational complexity of computing an inverse of an NxN matrix is N3 (or in the best case scenario N log N). Therfore, taking the inverse of a Hessian matrix in optimizing the parameters of a deep network scales as N3 where O(N) is 106.
In the limit where the dimensional of the parameter space is large (a typical deep neural net), an optimization step that involves taking the inverse of the Hessian matrix can be very expensive. For that reason, it may not be practical to use methods such as Newton for optimizing deep neural nets.