Member-only story

Vanishing and Exploding Gradients

13 min readJul 16, 2019

In the last article, recurrent neural networks and 6 jars of machine learning with respect to recurrent neural networks are discussed.

In this article, we touch upon the problem of the Vanishing and Exploding Gradients that sets the context for the use of LSTM and GRUs

Taking a closer look at the derivative with respect to the weight matrix (W)

In sequence-based problems (say sequence labeling), at every time step, the model predicts an output distribution, and given the true distribution, the loss value could be computed and the loss could be computed using the cross-entropy loss and then the total loss value for a training instance would be the summation of loss over all time steps

As discussed in this article, the derivative of the loss function (at time step “t”) with respect to “W” can be written as below:

where

Let’s say the loss is to be computed at the 4ᵗʰ time step (L₄), there might be multiple paths that lead from L₄ to W, and there would be a need, to sum up, all the derivative values along all these possible paths (article)

Vanishing and Exploding Gradients

Taking a closer look at the derivative with respect to the weight matrix (W)

Written by Parveen Khurana

No responses yet