This article covers the content discussed in the Optimization Algorithms(Part 2) module of the Deep Learning course and all the images are taken from the same module.
In this article, we would try to answer the below question and we would also look at some other algorithms apart from the Gradient Descent for the update rule.
The idea of a stochastic and mini-batch gradient descent:
Python code for the Gradient Descent algorithm that we have been using so far in the previous articles is as follows:
As is clear from the red-underlined part in the above image, we are making multiple passes over the data and for each iteration/pass, we are looking at all the data points(blue underlined part) and accumulating the gradients for all the points and once we have done one pass over the data(which is also termed as one epoch), we are updating the weights.
Our overall loss value is going to be the sum of the loss value over all the data points and hence the derivative of the overall loss can also be written as the sum of the derivative of the loss over all the data points:
We are updating the parameters once in one pass over the entire data, now say if we have one million data points, then first we are going to compute the partial derivative of the parameters for each of the one million points, accumulate all those in a gradient and then we are going to make one update of the parameters. This looks like computationally expensive.
We could choose to do an approximation over here. Instead of looking at all the data points, we look at a few data points(we will shuffle the data randomly) and we could look say first ‘B’ data points and using that, we compute the partial derivative for each of those ‘B’ data points and accumulate all of those in a gradient and make an update to the parameters.
So, whatever was the total derivative, we have replaced it with fewer points i.e with the sum of derivatives of the ‘B’ points that we are considering at a time:
So, in one pass over the data, we would see many such batches(each of size ‘B’, the last one could be of less than ‘B’) for example for a total of ‘M’ data points and a batch size of ‘B’, we would have nearly M/B batches and we update the weights this many times in just one epoch(one pass over the entire dataset). So, this is termed as mini-batch Gradient Descent as we are looking at mini-batches of data.
We could take this idea to the extreme and compute the derivative for each of the data points and update the parameters for each of the data points. In other words, our batch size is 1. This is termed as Stochastic Gradient Descent.
The advantage of this approach is that we are able to make many updates in one epoch, we would be able to move/converge faster than the vanilla Gradient Descent. The downside is that whether we’re using mini-batch or stochastic, we are doing an approximation, we are not actually computing the true derivative because the true loss is the total loss over all the data points and the true gradient would be the sum of the partial derivatives with respect to all the data points.
Error Surface for Stochastic GD:
Code for GD:
Code for Stochastic GD:
For Vanilla GD, we have the error surface as:
As is clear from the above image, the error curve is a smooth curve with not any oscillations.
For stochastic GD, we have the error surface as:
Now we see a lot of oscillations here(and these oscillations are different from the oscillations in case of Momentum, NAG GD).
Error Surface for Mini-batch GD:
We have set the mini-batch size as 10. We are going over the data, computing the partial derivatives and accumulating them and also keeping a count of the number of data points that we have seen and as soon as we have seen the number of data points equivalent to the mini-batch size, we make an update.
Here, the approximation would be better than stochastic gradient descent as we are looking at a batch of data points.
Once again we can see that there are these oscillations that never occurred in Vanilla GD or in full Batch GD but are occurring both in stochastic and mini-batch GD.
As we increase batch size, we can see some stability in the oscillations.
In practice, we choose batch size like 32, 64 or 128 depending on the amount of data that we have. And that also makes sure that approximations are better than just using stochastic GD.
Epochs and Steps:
The difference between Batch, Mini-batch, stochastic GD is the amount of data that is taken into account when computing the derivatives.
In case of full batch GD, we look at all the data, we compute the partial derivative of the loss function for every data point, then we take the sum and make the updates.
In the case of Stochastic GD, we look at one data point, computes the derivative and updates the parameters and in case of mini-batch, we look at the data in batches of size ‘B’.
We could also have the stochastic and mini-batch for Momentum based and NAG GD.
Everything in the code remains the same except we are making the updates for every data point that we see.
Error surface for the stochastic version of momentum and NAG looks like:
Oscillations are smaller in the stochastic version of Momentum and NAG GD compared to Vanilla GD as there is this history component in this case which helps to be on the track to an extent.
Batch size is a hyper-parameter and in practice, we use mini-batch GD with a batch size of 32, 64 or 128.
Why do we need an adaptive learning rate?
Let’s say we have a sigmoid neuron that takes in many inputs.
So, we have the inputs and their corresponding weights and the sigmoid equation looks like:
The update rule for each of the weights is going to be:
the update is proportional to the derivative and we have already discussed that if the derivatives are small then the weights are not going to change that much.
We have discussed this in an earlier article that the derivative of the loss function with respect to weight is proportional to the input:
The updates are proportional to derivatives and the derivative depends on the input.
In real-world data, many features are sparse i.e they take on the value 0 for most of the training inputs that we have for example if we have one feature say ‘isActorMattDamon’ for a movie classification task(whether we like it or not), say we have data for 1000 movies or so and out of those 1000, Matt Damon would have done 30–40 movies at max. So, that means out of the 1000 entries, the value of the feature ‘isActorMattDamon’ is going to be 0 for about 960 data points. The implication of this is brought out in the below image:
The derivative is dependent on the input ‘x’ and the input would be 0 for most of the data points in the scenario discussed above. That means the corresponding derivative is also going to be 0 and if the derivative is going to be 0 then the weight update is going to be 0 and if the weights are not changing then we have a problem as the parameter and apparently the loss value would remain the same.
So, we want to ensure that in the case where we have sparse features just like in the above scenario where the input is ‘on’(has a value 1) only 30–40 times out of 1000 times, then we need to make sure that for those 30–40 data points the derivative must get boosted with a larger learning rate so that the weights are changed accordingly. And we want a larger learning rate as the weights are not getting updated for all of the data points and when they are getting updated, we are ensuring they get changed by a large amount.
In other words, we want the learning rate to be adaptive to the features that we are dealing with, when we are dealing with sparse features the learning rate should be high and on the other hand, if we are dealing with dense features(features is on most of the times) then we want the learning rate to be small.
The intuition behind Adagrad is that we should try to decay the learning rate in proportion to the update history of the parameter. So, if we have a parameter which is getting updated very frequently(because that input is ‘on’ most of the times, hence the derivative is non-zero most of the times hence the weight is getting updated most of the times) then for such parameters we want to have a smaller learning rate and in contrast, for the parameters which is ‘off’ most of the times, then the derivative is going to be 0 most of the times and for such features, whenever it is ‘on’, we want the update to be faster, so we want to have a higher learning rate for such features.
Adagrad helps to have this adaptive learning rate by dividing the learning rate by the history of the updates of the feature.
If we focus on the yellow highlighted terms in the below image, it is just the update as for the vanilla GD, we are dividing the learning rate eta with some quantity(in red in the below image)
And we are hoping that by doing this we need to ensure two things: one is that for the features which have received a lot of updates, the denominator term(in red) in the above image should be high so that the effective learning rate becomes small and we want this denominator to be small if we have received very few updates in the past so that the overall learning rate is boosted.
The denominator term ‘vt’ is the running sum of the squares of the derivatives:
Say we are going over the data points one at a time and for the first 10 data points, the feature value was 1, in that case, the derivative would not be zero for any of the data points and consequently, we have some high value for ‘vt’ as we are taking the square of the derivative in the equation for vt. So, for the 11th data point, as the denominator is large now, the effective learning rate(in green in the below image) would become small now which is exactly what we want for a dense feature.
In the case of a sparse feature, most of the derivatives would be 0 and say for k’th data point that derivative is non-zero(and the derivative is 0 for all (k-1) data points), so here accumulated ‘vt’ would be 0 as all the derivatives were 0 till this k’th data point and the learning rate, when divided by this small history, would become very large as compared to what it was for the dense features.
The black curve in the above image corresponds to GD, the red one corresponds to Momentum based GD and the blue one corresponds to NAG GD.
We started with the point marked in the below image and we should have reached the minima which is at the center of the valley
Our b value must increase up to the black circled point in the below image and w must increase up to the blue circled point(overall they both are represented by the red circled point in the below image)
Ideally, both the parameters should have changed parallelly and followed the path in yellow in the below image to reach the minima:
Instead what we see is that initially the ‘w’ is not changing at all and only ‘b’ is changing and has changed from -2 to 0 whereas ‘w’ is still almost the same.
This is for the case where ‘w’ is associated with a sparse feature and ‘b’ is always going to have some value so we can consider it as a dense feature and that’s why it was able to move where ever it had to move. Now at this point, we have reached a good value for ‘b’, there is nothing we can do by changing ‘b’ anymore. And the only way to reach the minima is to change the value of ‘w’.
And whenever we see the value of data point corresponding to feature associated with ‘w’, there would be a change and if we train for a large number of epochs, eventually ‘w’ would change from a value of -2 to value 0 where the overall loss would be minimal.
So, for the initial 30–40 epochs, the only ‘b’ is changing here with almost no change in the value of ‘w’ and thereafter a large no. of epochs are still required to achieve the optimum value of ‘w’. This is the problem with the other algorithms that we discussed in previous articles.
Now let’s see what the moment looks like with Adagrad:
The green curve represents the moment for Adagrad. Now we can see that both the parameters are changing simultaneously which is the result of an Adaptive Learning Rate.
Limitation of Adagrad:
In the case of Momentum Based GD and NAG, after some 100 iterations we were able to reach the minima even though for the first 30–40 iterations, they were just moving in the direction of ‘b’, they were not making any changes to ‘w’ but what happens in the case of Adagrad is that since the learning rate is decaying aggressively for the frequent parameter, as ‘b’ gets updated frequently, its history component increases by a lot because of which the effective learning rate could go close to 0.
So, now what happens is, after a while when we have made a lot of updates to ‘b’, we are not able to make any more updates to ‘b’, so even if we run this algorithm for 100- 200 iterations, it just reaches near the minima but it is not able to reach the exact minima and to reach the minima, it has to move in the direction of ‘b’, it has to move up and it's not able to move in the direction of ‘b’ as the effective learning rate for ‘b’ has become very close to 0 and hence no updates are happening to ‘b’.
We wanted to increase the learning rate for ‘w’, blow up its learning rate which has happened but in the process, we have also happened to decrease the learning rate for ‘b’ and after a while, it does not converge.
RMSProp is another algorithm which operates on the below intuition:
So, the history component is different in RMSProp compared to Adagrad.
In Adagrad, we are adding the square of every derivative whereas, in RMSProp, we are again taking the exponentially decaying sum of the square of the derivatives.
‘v4’ in case of Adagrad would look like:
whereas in case of RMSProp, it would look like:
We can see that the ‘v4’ would be much smaller in the case of RMSProp, even though we are accumulating the history of gradients, we are not allowing it to blow up as we are multiplying it by decay ratios.
There would still be a relative difference between the value of ‘vt’ for dense features and for sparse features.
The brown curve represents the moment for RMSProp and we can see that it eventually reaches the minima as opposed to the Adagrad(green curve).
We have seen Momentum based GD and RMSProp GD and both of these were using a history component, while Momentum based GD uses history of the gradient whereas RMSProp uses a history of the square of the gradients and we have seen in momentum-based GD that having this cumulative history actually helps in moving fast out of the flat surfaces.
In momentum-based GD, the history component is used to compute the current update whereas, in RMSProp, history is used to adjust the learning rate. So, Adam algorithm combines these two ideas:
Adam combines the ideas of the two algorithms and ensures that the learning rate is adaptive and also ensures that the second term in the update is not the current derivative but a history of derivatives so that we can move fast out of flat regions.
In practice, Adam also does bias correction(to ensure that the gradients are more stable at the beginning) where instead of using ‘mt’ and ‘vt’, we have:
So, first, it computes ‘mt’ and ‘vt’ using the below equation and then divide by the quantity(does the bias correction) as in the above image:
The Cyan colored curve corresponds to Adam.
As this algorithm has some part/features of momentum, so it behaves just like that in the sense that it overshoots the minima, makes some oscillations and reaches the valley.
We have seen six different algorithms:
with the last three having an adaptive learning rate. And we have seen three different strategies which can be used in combination with any of the Algorithms
In practice, Adam with Mini-Batch is the most popular algorithm. The advantage of this combination is that it does not depend too much on the initial learning rate because anyways learning rate is going to adapt and because of this other property of having the momentum, it kinds of combines the advantage of Momentum based GD and RMSProp.