Member-only story
Optimization Algorithms — Part 2
This article covers the content discussed in the Optimization Algorithms(Part 2) module of the Deep Learning course and all the images are taken from the same module.
In this article, we would try to answer the below question and we would also look at some other algorithms apart from the Gradient Descent for the update rule.
The idea of a stochastic and mini-batch gradient descent:
Python code for the Gradient Descent algorithm that we have been using so far in the previous articles is as follows:
As is clear from the red-underlined part in the above image, we are making multiple passes over the data and for each iteration/pass, we are looking at all the data points(blue underlined part) and accumulating the gradients for all the points and once we have done one pass over the data(which is also termed as one epoch), we are updating the weights.
Our overall loss value is going to be the sum of the loss value over all the data points and hence the derivative of the overall loss can also be written as the sum of the derivative of the loss over all the data points: