Member-only story

Vanishing and Exploding Gradients

Parveen Khurana
13 min readJul 16, 2019

--

In the last article, recurrent neural networks and 6 jars of machine learning with respect to recurrent neural networks are discussed.

In this article, we touch upon the problem of the Vanishing and Exploding Gradients that sets the context for the use of LSTM and GRUs

Taking a closer look at the derivative with respect to the weight matrix (W)

In sequence-based problems (say sequence labeling), at every time step, the model predicts an output distribution, and given the true distribution, the loss value could be computed and the loss could be computed using the cross-entropy loss and then the total loss value for a training instance would be the summation of loss over all time steps

As discussed in this article, the derivative of the loss function (at time step “t) with respect to “W” can be written as below:

where

Let’s say the loss is to be computed at the 4ᵗʰ time step (L₄), there might be multiple paths that lead from L₄ to W, and there would be a need, to sum up, all the derivative values along all these possible paths (article)

--

--

Parveen Khurana
Parveen Khurana

Written by Parveen Khurana

Writing on Data Science, Philosophy, Emotional Health | Grateful for the little moments and every reader | Nature lover at heart | Follow for reflective musings

No responses yet