# The whiteboard analogy to deal with vanishing and exploding gradients

In the previous article, we briefly touched upon the problem of “** Vanishing and Exploding gradients**”, especially when dealing with longer sequences. In this article, one strategy termed “

**” is discussed that would help to deal with the problem of vanishing and exploding gradients**

*selectively read, selectively write and selectively forget*In a recurrent neural network, especially wherein there is a long chain between the input and the output, suppose there is a loss at the ** 4ᵗʰ time step**, and say this

**is because**

*high loss value*

*s₁**was not computed properly*, and because of which

*s₂**,*

*s₃**, and*

**was not computed properly, and**

*s₄***was not computed correctly because the corresponding**

*s₁***on which it depends was not correct**

*W*Now this feedback needs to go back to ** W**, that the

**because**

*loss is very high*

*s₁***because**

*is not good enough***and hence**

*W is not configured properly***and this**

*W should change***from Loss to**

*information has to flow through a long chain***to**

*s₄***to**

*s₃***to**

*s₂***to**

*s₁*

*W*And if there happens to be a *long chain*** between** the

*loss value***, then there could be a problem of**

*and s₁***because the gradient value would be a product of many terms and if all those terms are small in magnitude then the overall product or gradient would vanish whereas if all the terms are large in magnitude, then the product would be large and gradient would explode and**

*vanishing and exploding gradients**to solve this problem*

*, LSTM**comes into the picture*

## Dealing with Longer Sequences

With RNNs, the output at every time step is dependent on the previous input, and the way it works is by maintaining a ** state vector** (

**) and it is a**

*sₜ***the**

*function of***and the**

*current input***i.e (**

*previous state*

*s**ₜ*

*-**₁*

**)**

At ** every time step**, the information stored in the blue vector (hidden state

**) is getting updated**

*sₜ***,**

*basis current step input***(**

*previous time step’s hidden state*

*s**ₜ*

*-**₁*

**)**, and the

**, and therefore the**

*bias term**information in the current time step hidden state is getting morphed*as compared to the information in the previous time step hidden state

Now if there is a very *long sequence of say 15–20 words*, then by the time *model reaches the last word*, the *information from the first hidden state would have been completely morphed* and in some cases, this might affect the result drastically

So, this is the problem with using RNNs for longer sequences (a typical sentence on Wikipedia is 20–25 words, and therefore what was learned at time step 1 will be vanished or rather morphed by the time it reaches time step 25) and if *parameters are updated using the backpropagation with time, it *would also cause the *problem of Vanishing/exploding gradients*

## The whiteboard analogy

The analogy to the above-described scenario is this whiteboard analogy

So, assume there is a whiteboard (of fixed width and height) and if the information is written on this say every 10–15 seconds, then after a while, this whiteboard would become so messy and it would become very difficult to figure out what information was published/written at time step 1 say when the current time step is 100

The same problem persists in recurrent neural networks especially when the input is a quite long sequence, and the state vector (*s**ₜ*) is also of fixed size(*d-dimensional*, just as the whiteboard is of fixed size), and this state vector is computed at every time step

After a while, it would become very difficult to know what the first input contributed to this state vector (** sₜ**) and this would become even more challenging in the case of longer sequences, and ideally, the expectation would be that each time step’s input contributes significantly to the output

The intuition for the solution to this also comes from the whiteboard analogy

Suppose there is a fixed-size whiteboard and the intent is to solve something on it. Typically, the strategy of ** selectively read, write and forget** is followed:

For example, let’s say for a given set of variables, there are some initial values and the intent is to compute the quantity as per the formula mentioned below and note down the steps taken to compute this output (to add a twist, let’s say there is a constraint that only 3 statements could be written on the whiteboard at a time):

The way to proceed with this computation is the following:

So, there are a total of 6 steps but all these 6 steps can not be processed/written at the same time on the whiteboard as the whiteboard has a small and fixed size

The best way would be to compute the below mentioned 3 values (as whiteboard can accommodate only 3 statements at a time, please note “*ac*”, and “*bd*” are computed and directly its final value is displayed/noted on the whiteboard):

Now the next computation would be: (*ac*(bd + a)*) which means “*bd*” is not needed anymore on the whiteboard and it has already been used for computing “*bd + a*”, so “*bd*” is erased from the whiteboard, and this resonates with point 3 of the strategy, which ** selectively erase** some content)

And the value of (*ac*(bd + a)*) is computed using the value of “*ac”* and “*bd + a*” which resonates with the point 2 of the strategy ie. **selectively read**

Now again the whiteboard is filled up and the next value needed is “*ad*”, and the value of “*ac*” and “*bd + a*” are not needed anymore so either one or both of these two statements could be removed from the whiteboard

The value of “*ad*” is written directly as 11(and not like 1*11 = 11) which would have taken 2 steps, so this corresponds to point 1 of the strategy which says about **selectively write**

And when computing the value of “*ad*” which just takes the value of “*a*” and “*d*” from the board which corresponds to the **selective read**

So, again we **selectively erase/forget** the value of “*ac*” from the board and then **selectively write** the required value which is “*ad + ac(bd + a)*”

This is how we typically deal with the short memory or small memory or small dimension of a whiteboard which can only store so much information.

In the next article, we touch upon how to use this strategy for the case of recurrent neural network

References: PadhAI