This article covers the content discussed in the Sigmoid Neuron and Cross-Entropy module of the Deep Learning course and all the images are taken from the same module.
The situation that we have is that we are given an image, we know the true label for that image if it contains text or not and in the below case since the image contains text, we can say that all the probability mass is on the random variable taking on the value 1 and there is 0 probability mass on the random variable taking on the value No Text.
Of course, in practice, we don't know this true distribution, so we are approximating the same using the sigmoid function, and when we pass this input as x to the sigmoid neuron, we get the output to say 0.7 which we can again interpret as the probability distribution as the probability of the image containing text is 0.7 and the probability of the image not containing text is 0.3.
So, we were computing the difference between these two distributions using the squared error loss but now we have a better metric, something which is grounded in probability theory which is the KL Divergence between these two distributions.
So, now instead of minimizing the squared error loss, we are interested in minimizing the KL Divergence and this minimization would be in respect to the parameters of the model(w, b)
The term highlighted(in yellow) in the below image depends on w, b as per the Sigmoid definition/equation and the term underlined in blue does not depend on w and b
So, our goal is to minimize a quantity with respect to the parameters w, b. And since the blue underlined part in the above image does not depend on w, b; we can think of this as a constant. And our task reduces to just minimizing the first term i.e Cross Entropy with respect to parameters w, b.
So, the value of Cross-Entropy in the above case turns out to be: -log(0.7) which is the same as the -log of y_hat for the true class. (True class, in this case, was 1 i.e image contains text, and y_hat corresponding to this true class is 0.7).
Using Cross-Entropy with Sigmoid Neuron
When the true output is 1, then the Loss function boils down to the below:
And when the true output is 0, the loss function is:
And this is simply because there is 1 term which gets multiplied with 0 and that term would be zero obviously, so what remains is the loss term.
A more simplified way of writing Cross-Entropy Loss function:
Learning Algorithm for Cross-Entropy Loss function
As is clear from the output, we are dealing with the Classification problem(as the possible output is 0, 1). The use of Cross-Entropy Loss only makes sense in the Classification case because that’s when we are trying to interpret the output as a probability. We compute the output for each of the 5 data points and use it to compute the loss function as:
And once we have the Loss, we can compute the delta terms(highlighted in below image) and update the parameters and continue this over multiple iterations over the data:
And we compute these δ(delta) terms by taking the partial derivatives of the loss function with respect to w and b.
Computing partial derivatives with cross-entropy loss:
The loss function(in general) we can write as:
And then we can compute the partial derivative of the loss function with respect to w using chain rule as:
And similarly, the partial derivative of the loss function with respect to b would be:
Changes in the Code for Cross-Entropy Loss function:
Our code with the squared error loss function is as below:
Now three functions would change: first one is error function and the other two are the functions that compute the derivative of the loss function with respect to w and b.
In the error function, earlier we were computing the squared error loss which is now changed to cross-entropy loss:
Other two functions: