Second order optimization methods and Neural Networks

1 minute read

Published: October 30, 2019

Optimization of deep Neural Networks is often done using Gradient-based methods such as mini-batch gradient descent and its extensions such as Momentm, RMSprop, and Adam. Second order optimization methods such as Newton, BFGS, etc are widely used in different areas of statsitics and Machine Learning. Why are these methods are not popular in deep learning?

Hessian Matrix

One of the fundamental building blocks of the second order optimization methods is the Hessian matrix. In an N dimensional parameter space, the Hessian matrix is going to be an NxN dimensional matrix whose i and j elements are given by the i and j elements of the cost function with respect to the i-th and j-th parameters respectively.

Parameter update in second-order methods

The parameter update step inside many second order methods involves computing the Hessian matrix, taking its inverse and then multiplying the inverse Hessian matrix by the Jacobian matrix, an N dimensional vector containing the partial derivatives of the cost function with respect to parameters.

In a typical deep network, the number of parameters, denoted here by N, can be as large as millions. We know that the computational complexity of computing an inverse of an NxN matrix is N³ (or in the best case scenario N log N). Therfore, taking the inverse of a Hessian matrix in optimizing the parameters of a deep network scales as N³ where O(N) is 10⁶.

In the limit where the dimensional of the parameter space is large (a typical deep neural net), an optimization step that involves taking the inverse of the Hessian matrix can be very expensive. For that reason, it may not be practical to use methods such as Newton for optimizing deep neural nets.

Share on

Twitter Facebook Google+ LinkedIn

Organizing a directory workflow for a Semantic Segmentation task

2 minute read

Published: April 15, 2020

Semantic segmentation is the task of assigning labels to different pixels in an image. So you can think of it as a binary or multi-label classification. That means the targets have the same size as their corresponding images. How do we organize the directory for such a task and how can we make use of the ImageDataGenerator? Let’s start with a simple example.

Many flavors of feature importance

5 minute read

Published: April 01, 2020

I recently read this excellent book on financial machine learning which has a whole chapter dedicated to feature importance and its importance! It offers nice guidelines and some of the best practices for investigating feature importance in problems where we would like to know the extent to which different features contribute to the outcome of a machine learning model. Let’s imagine you are trying to predict whether you need to sell or buy stocks of a certain commidity in the market. Given the historical time evolution of the price, you can hand-engineer a large number of features. This includes:

Mixture density networks for hard to evaluate conditional probabilities in cosmology

1 minute read

Published: March 10, 2019

One of the chalenges of cosmological parameter estimation is marginalizing out the nuisance parameters such as the parameters that model the connection between galaxies and dark matter. For such marginalizations we are required to evaluate this conditional probability p(galaxy properties | dark matter halo properties). The list of galaxy properties consists of stellar mass, star formation rate, etc and the list of dark matter halo properties consists of mass, maximum circular velocity, etc.

Mohammadjavad Vakili