The whiteboard analogy to deal with vanishing and exploding gradients

Parveen Khurana
6 min readMar 20, 2022

--

In the previous article, we briefly touched upon the problem of “Vanishing and Exploding gradients”, especially when dealing with longer sequences. In this article, one strategy termed “selectively read, selectively write and selectively forget” is discussed that would help to deal with the problem of vanishing and exploding gradients

In a recurrent neural network, especially wherein there is a long chain between the input and the output, suppose there is a loss at the 4ᵗʰ time step, and say this high loss value is because s₁ was not computed properly, and because of which s₂, s₃, and s₄ was not computed properly, and s₁ was not computed correctly because the corresponding W on which it depends was not correct

Now this feedback needs to go back to W, that the loss is very high because s₁ is not good enough because W is not configured properly and hence W should change and this information has to flow through a long chain from Loss to s₄ to s₃ to s₂ to s₁ to W

And if there happens to be a long chain between the loss value and s₁, then there could be a problem of vanishing and exploding gradients because the gradient value would be a product of many terms and if all those terms are small in magnitude then the overall product or gradient would vanish whereas if all the terms are large in magnitude, then the product would be large and gradient would explode and to solve this problem, LSTM comes into the picture

Dealing with Longer Sequences

With RNNs, the output at every time step is dependent on the previous input, and the way it works is by maintaining a state vector (sₜ) and it is a function of the current input and the previous state i.e (s-)

At every time step, the information stored in the blue vector (hidden state sₜ) is getting updated basis current step input, previous time step’s hidden state (s-), and the bias term, and therefore the information in the current time step hidden state is getting morphed as compared to the information in the previous time step hidden state

Now if there is a very long sequence of say 15–20 words, then by the time model reaches the last word, the information from the first hidden state would have been completely morphed and in some cases, this might affect the result drastically

So, this is the problem with using RNNs for longer sequences (a typical sentence on Wikipedia is 20–25 words, and therefore what was learned at time step 1 will be vanished or rather morphed by the time it reaches time step 25) and if parameters are updated using the backpropagation with time, it would also cause the problem of Vanishing/exploding gradients

The whiteboard analogy

The analogy to the above-described scenario is this whiteboard analogy

So, assume there is a whiteboard (of fixed width and height) and if the information is written on this say every 10–15 seconds, then after a while, this whiteboard would become so messy and it would become very difficult to figure out what information was published/written at time step 1 say when the current time step is 100

The same problem persists in recurrent neural networks especially when the input is a quite long sequence, and the state vector (s) is also of fixed size(d-dimensional, just as the whiteboard is of fixed size), and this state vector is computed at every time step

After a while, it would become very difficult to know what the first input contributed to this state vector (sₜ) and this would become even more challenging in the case of longer sequences, and ideally, the expectation would be that each time step’s input contributes significantly to the output

The intuition for the solution to this also comes from the whiteboard analogy

Suppose there is a fixed-size whiteboard and the intent is to solve something on it. Typically, the strategy of selectively read, write and forget is followed:

For example, let’s say for a given set of variables, there are some initial values and the intent is to compute the quantity as per the formula mentioned below and note down the steps taken to compute this output (to add a twist, let’s say there is a constraint that only 3 statements could be written on the whiteboard at a time):

The way to proceed with this computation is the following:

So, there are a total of 6 steps but all these 6 steps can not be processed/written at the same time on the whiteboard as the whiteboard has a small and fixed size

The best way would be to compute the below mentioned 3 values (as whiteboard can accommodate only 3 statements at a time, please note “ac”, and “bd” are computed and directly its final value is displayed/noted on the whiteboard):

Now the next computation would be: (ac*(bd + a)) which means “bd” is not needed anymore on the whiteboard and it has already been used for computing “bd + a”, so “bd” is erased from the whiteboard, and this resonates with point 3 of the strategy, which selectively erase some content)

And the value of (ac*(bd + a)) is computed using the value of “ac” and “bd + a” which resonates with the point 2 of the strategy ie. selectively read

Now again the whiteboard is filled up and the next value needed is “ad”, and the value of “ac” and “bd + a” are not needed anymore so either one or both of these two statements could be removed from the whiteboard

The value of “ad” is written directly as 11(and not like 1*11 = 11) which would have taken 2 steps, so this corresponds to point 1 of the strategy which says about selectively write

And when computing the value of “ad” which just takes the value of “a” and “d” from the board which corresponds to the selective read

So, again we selectively erase/forget the value of “ac” from the board and then selectively write the required value which is “ad + ac(bd + a)

This is how we typically deal with the short memory or small memory or small dimension of a whiteboard which can only store so much information.

In the next article, we touch upon how to use this strategy for the case of recurrent neural network

References: PadhAI

--

--