Member-only story
Vanishing and Exploding Gradients
In the last article, recurrent neural networks and 6 jars of machine learning with respect to recurrent neural networks are discussed.
In this article, we touch upon the problem of the Vanishing and Exploding Gradients that sets the context for the use of LSTM and GRUs
Taking a closer look at the derivative with respect to the weight matrix (W)
In sequence-based problems (say sequence labeling), at every time step, the model predicts an output distribution, and given the true distribution, the loss value could be computed and the loss could be computed using the cross-entropy loss and then the total loss value for a training instance would be the summation of loss over all time steps
As discussed in this article, the derivative of the loss function (at time step “t”) with respect to “W” can be written as below:
where
Let’s say the loss is to be computed at the 4ᵗʰ time step (L₄), there might be multiple paths that lead from L₄ to W, and there would be a need, to sum up, all the derivative values along all these possible paths (article)