Loss Function — Recurrent Neural Networks(RNNs)
--
In the last article, we discussed the Data Modeling jar for Sequence labeling and sequence classification problems. In this article, we touch upon the loss function the sequence-based problems
Loss Function
Sequence Classification Task
Let’s say there are “2 classes positive and negative” and in this case (snippet below), the “actual label happens to be positive” which means the “entire probability mass is on the positive class/label” (true distribution)
Now “y_hat” would be computed using the model equation for Recurrent Neural Networks (RNNs)
And let’s assume that the model predicts the following distribution for this case:
As it’s a classification problem and there are two probability distributions, the Cross-Entropy Loss is used to compute the loss value:
Cross entropy loss value — it is a “negative summation” over “all possible values that random variable can take” (in this case the possible values are 0 and 1), and the “product of the true probability” and the “logarithm value of the predicted probability”
And the true output has a peculiar form in this scenario where one of the classes has 0 probability mass, and since the loss value is a product of true probability, essentially only one term remains in the summation and that term corresponds to the true class (that has entire probability mass of 1 in true distribution)
One way of looking at this would be to maximize the probability of the correct class
And this loss value is for one training example. The total loss would be the average of this quantity over all the training examples:
In the image above, ‘i’ is the index for the training example and ‘c’ in ‘y_ic’ is the corresponding index of the correct class
Sequence Labeling Task
Here at every time step, the model makes a prediction that means there would be a true distribution and a predicted distribution at each time step
Therefore, the “loss value” is to be “computed at each time step” for “all the training examples” and accumulated in one variable for the “overall loss”
For the ‘iᵗʰ’ training example, for the ‘jᵗʰ’ time step, the logarithm value of predicted probability (of the index corresponding to the true class) is computed and the average is taken
Think of it like “for every training example”, the model makes “T” predictions, and the loss value is summed up across these “T predictions” and each of the individual “losses” is going to be the “cross-entropy or the negative log-likelihood” for the corresponding time step
The overall loss could be averaged over the number of the time steps as well as per the requirement
In a simpler form, say the loss value at “time step ₁ be L₁”, “at time step ₂, it is L₂” and so on and the overall loss value for the given training example would be the summation of loss value computed for each time step
“In practice”, the “loss value would be computed with just a call to a pre-defined function” in a framework like PyTorch, Tensorflow
This article is more about the details in terms of behind-the-scenes computation for the loss value. Once the loss value is in place, the learning algorithm could be used to find the best parameters for a model
References: PadhAI