Learning Algorithm - Recurrent Neural Networks(RNNs)

Parveen Khurana
8 min readFeb 13, 2022

In the last article, we discussed the loss function jar for Sequence-based problems. In this article, details about the learning algorithm are discussed:

Learning Algorithm

One question at this point would be “can the gradient descent algorithm be used in the context of a recurrent neural network?

The standard recipe that is in place is the following:

In this sample network, there are just “two parameters: w, and b” but in the case of a “recurrent neural network”, there would be a “bunch of parameters” for example, “W” which consists of “w₁₁, w₁₂, …” all the way up to the size of the w, similarly there is U, V, b, c(bias terms)

There would be bias terms as well in addition to “w”, “v”

The learning algorithm recipe is going to be the same as in the case of logistic regression, the only difference being that instead of updating two parameters (as in Logistic Regression), all parameters needs to be updated

And for the sample network (discussed above), the loss function was a function of W, b; now the “loss function will depend” on “W, U, V, b, c”

Total No. of Parameters

Let’s start with “U

Utakes in “n-dimensional input” (‘n X 1’), and convert it to “d-dimensional output” (‘d X 1’), so the dimension of U would be ‘d X n

Input dimension
Dimension of U

A “d x n” input matrix multiplied with “n x 1” would give a “d x 1” dimensional output

W takes this ‘d’ dimension vector (“d x 1”) and gives back a ‘d’ dimension vector (“d x 1”) (output of recurrent connection) which means the size of “W is going to be ‘d X d

Dimension of W

Now if there are ‘k’ classes, then “V takes in ‘d X 1’ input and gives back a ‘k X 1’ dimensional output, so the dimension of V would be ‘k X d’:

Dimension of V

b is going to be the same size as the hidden vector i.e ‘d X 1

And “c is going to be the same size as the number of classes so essentially its dimension would be ‘k X 1

So, the total number of parameters = “d x n + d x d + k x d + d x 1 + k x 1”

And “for each iteration, all these parameters need to be updated

In the forward pass, sᵢ, (y_hat ᵢ) would be computed leveraging which the loss would be computed using which all the parameters would be updated

And this is what the “learning algorithm” would look like for sequence-based problem

And in practice, instead of gradient descent, any other approach (Adam,…) could also be used as the learning algorithm.

Learning Algorithm — Derivatives w.r.t. V

Let’s briefly look at how the derivatives would be computed for backpropagation steps as part of the learning algorithm

In the case of a simple feed-forward network with just one hidden layer, the network would look like:

Here the hidden representation is denoted by “h” and the weight matrix at 1ˢᵗ and 2ⁿᵈ layer is denoted a “w” and “v” (and pre-activation is denoted by “a” with respective layer subscript)

And leveraging the output representation/distribution, the loss value would be computed as:

And the derivatives are computed using the loss function, in this case, the derivative with respect to “v” would be computed using the chain rule and would be given as:

Now “loss value is a function” of “y_hat”, so it is easy to compute its derivative, and “y_hat” in turn depends on “a₂” which itself is a function of the “v” matrix and the activation value (“h₁”)from the previous step

Let’s see how the same idea translates to a recurrent neural network

The loss function is just the “average of loss value” over “all of the training instances” where “for each instance” the “loss value is accumulated for all the time steps”:

Let’s say the loss value be “L₁, L₂, …, L₅” at 1ˢᵗ time step, 2ⁿᵈ time step, and so on till 5ᵗʰ time step in the above case/figure so the “total loss L would be equal to (L₁+ L₂ + L₃ + L₄ + L₅)

And the derivative of “Loss(L)” with respect to “V would be the summation of the derivative of individual loss (loss value at each time step with respect to “V”)

Let’s consider an individual loss term say the “partial derivative of L₁ with respect to V

It would be computed from the following block (first block in the network)

And this seems exactly like a feed-forward neural network so the process /formula applies (as discussed above for a feed-forward network) here as well

a₂ is the above image refers to the pre-activation at layer 2

In this manner, all these loss terms (L₁, L₂, …, L₅) could be computed independently and summed up to get the overall loss

Since, the entire back-propagation, derivatives depends on the loss value — the “loss could be easily computed” given the “true distribution and the predicted distribution” using a “cross-entropy loss function

Learning Algorithm — Derivatives w.r.t. W

This is what the network looks like where the “model predicts an output at each time step” and “loss is also computed at each time step” using the cross-entropy loss function

And the “total loss value would just be the summation of loss at individual time steps” and this could then be “averaged over the training instances

And for the purpose of back-propagation, the “derivative of loss with respect to w” is to be computed

This derivative can be written as the sum of the derivative of individual loss value (at each time step) with respect to “w”

In this training instance, there are a total of 5-time steps, but let’s compute the derivative of L₄ with respect to “w”

Again, the chain rule of derivatives would come into the picture, and here first the derivative of L₄ is computed with respect to s₄ (red bounding box in the snippet above) and then the derivative of s₄ with respect to “w”

And here comes the trickiest part, now s₄ clearly depends on W but it also depends on s₃ which in turn itself depends on W and s₂ and now this s₂ also depends on W and s₁ and this chain/recurrent connection would continue like this

So, L₄ depends on W via s₄, s₃, s₂, s₁

In short, the derivative of L₄ with respect to W would be the sum of the derivative of all the paths that lead from L₄ to W

And in this case, the number of paths is more because “w” is a recurrent weight and is shared across multiple time steps and each subsequent time step depends on the previous time step

Now from s₄, we have multiple paths that lead to w” so we need to consider all these paths

And the same equation after short-circuiting some of the derivatives could be written as the following:

And the same could be written more compactly as:

Here 4 represent the number of time steps

This is the same as the chain rule only difference being the chain appears to be a bit complicated because of the recurrent connection

Now the derivative of sₜ (tᵗʰ time step) with respect to sₖ would be given as

So, the overall derivative value of loss function at time step 4 (L₄) with respect to w is given as:

And in general, the derivative of tᵗʰ time step with respect to w can be represented using the following formula

And this algorithm is termed as Backpropagation through time (as the loss value or rather the derivative value is summed up across all the time steps)

References: PadhAI