Loss Function — Recurrent Neural Networks(RNNs)

Parveen Khurana
4 min readFeb 10, 2022


In the last article, we discussed the Data Modeling jar for Sequence labeling and sequence classification problems. In this article, we touch upon the loss function the sequence-based problems

Loss Function

Sequence Classification Task

Let’s say there are “2 classes positive and negative” and in this case (snippet below), the “actual label happens to be positive” which means the “entire probability mass is on the positive class/label” (true distribution)

True distribution

Now “y_hat” would be computed using the model equation for Recurrent Neural Networks (RNNs)

And let’s assume that the model predicts the following distribution for this case:

Predicted distribution

As it’s a classification problem and there are two probability distributions, the Cross-Entropy Loss is used to compute the loss value:

Please note that it would be ‘y_hat’ instead of ‘y’ in the second line — it denotes the negative logarithm value of the predicted value corresponding to the true class

Cross entropy loss value — it is a “negative summation” over “all possible values that random variable can take” (in this case the possible values are 0 and 1), and the “product of the true probability” and the “logarithm value of the predicted probability

And the true output has a peculiar form in this scenario where one of the classes has 0 probability mass, and since the loss value is a product of true probability, essentially only one term remains in the summation and that term corresponds to the true class (that has entire probability mass of 1 in true distribution)

One way of looking at this would be to maximize the probability of the correct class

And this loss value is for one training example. The total loss would be the average of this quantity over all the training examples:

Please note that it would be ‘y_hat’ instead of ‘y’

In the image above, ‘i’ is the index for the training example and ‘c’ in ‘y_ic’ is the corresponding index of the correct class

Sequence Labeling Task

Here at every time step, the model makes a prediction that means there would be a true distribution and a predicted distribution at each time step

Therefore, the “loss value” is to be “computed at each time step” for “all the training examples” and accumulated in one variable for the “overall loss

Please note that it would be ‘y_hat’ instead of ‘y’
Please note that it would be ‘y_hat’ instead of ‘y’

For the ‘iᵗʰ’ training example, for the ‘jᵗʰ’ time step, the logarithm value of predicted probability (of the index corresponding to the true class) is computed and the average is taken

Think of it like “for every training example”, the model makes “T” predictions, and the loss value is summed up across these “T predictions” and each of the individual “losses” is going to be the “cross-entropy or the negative log-likelihood” for the corresponding time step

Please note that it would be ‘y_hat’ instead of ‘y’

The overall loss could be averaged over the number of the time steps as well as per the requirement

In a simpler form, say the loss value at “time step ₁ be L₁”, “at time step ₂, it is L₂” and so on and the overall loss value for the given training example would be the summation of loss value computed for each time step

In practice”, the “loss value would be computed with just a call to a pre-defined function” in a framework like PyTorch, Tensorflow

This article is more about the details in terms of behind-the-scenes computation for the loss value. Once the loss value is in place, the learning algorithm could be used to find the best parameters for a model

References: PadhAI