# Loss Function — Recurrent Neural Networks(RNNs)

In the last article, we discussed the Data Modeling jar for Sequence labeling and sequence classification problems. In this article, we touch upon the loss function the sequence-based problems

## Loss Function

Let’s say there are “2 classes positive and negative” and in this case (snippet below), the “actual label happens to be positive” which means the “entire probability mass is on the positive class/label” (true distribution)

Now “y_hat” would be computed using the model equation for Recurrent Neural Networks (RNNs)

And let’s assume that the model predicts the following distribution for this case:

As it’s a classification problem and there are two probability distributions, the Cross-Entropy Loss is used to compute the loss value: Please note that it would be ‘y_hat’ instead of ‘y’ in the second line — it denotes the negative logarithm value of the predicted value corresponding to the true class

Cross entropy loss value — it is a “negative summation” over “all possible values that random variable can take” (in this case the possible values are 0 and 1), and the “product of the true probability” and the “logarithm value of the predicted probability

And the true output has a peculiar form in this scenario where one of the classes has 0 probability mass, and since the loss value is a product of true probability, essentially only one term remains in the summation and that term corresponds to the true class (that has entire probability mass of 1 in true distribution)

One way of looking at this would be to maximize the probability of the correct class

And this loss value is for one training example. The total loss would be the average of this quantity over all the training examples:

In the image above, ‘i’ is the index for the training example and ‘c’ in ‘y_ic’ is the corresponding index of the correct class

Here at every time step, the model makes a prediction that means there would be a true distribution and a predicted distribution at each time step

Therefore, the “loss value” is to be “computed at each time step” for “all the training examples” and accumulated in one variable for the “overall loss

For the ‘iᵗʰ’ training example, for the ‘jᵗʰ’ time step, the logarithm value of predicted probability (of the index corresponding to the true class) is computed and the average is taken

Think of it like “for every training example”, the model makes “T” predictions, and the loss value is summed up across these “T predictions” and each of the individual “losses” is going to be the “cross-entropy or the negative log-likelihood” for the corresponding time step