This article covers the content discussed in the Sigmoid Neuron module of the Deep Learning course and all the images are taken from the same module.
So far, we have discussed the 4 jars of ML with respect to the Sigmoid Neuron Model. In this article, we will discuss the Learning Algorithm and Evaluation Metric for Sigmoid Neuron.
Sigmoid: Learning Algorithm
Intro to Learning Algorithm:
Let’s say the below is the input data
And our model’s equation is:
The parameters of the model are w, b and by changing the values of w, b, we can get different sigmoid functions for example in the below plot, a wide variety of sigmoid functions are plotted each having a different combination of w, b.
So, by changing w and b we change two things: one is the slope and the second is the position of the sigmoid plot on the x-axis(discussed in this article how the sigmoid plot shifts with the change in the value of the parameters).
The objective of the learning is to find the parameters of the function in such a way that after plugging in the value of the input, the predicted output is very close to the true output and the loss is minimized.
If we plot the below input data, it looks like:
So, we want the values of w, b in such a way the sigmoid passes through all of the blue cross points in the above image
We start with some random values for w, b; we go over all the training points, compute the loss and try to update the values for w, b in such a way that loss is minimized and we continue this process until we get the desired accuracy.
Learning by Guessing:
We will look at a particular training point, compute the predicted output for that point and based on that we decide how(increase, decrease) to update parameters so that the sigmoid with new parameters are closer to the true sigmoid(true output).
We start with w, b as 0.
For w and b as 0, the function would reduce to the value 0.5:
With w, b as 0, the predicted output is nowhere close to the true output, so we just try to increase w(keeping b as a constant) and see the output:
Now we can notice that at least the slope of the above sigmoid function is somewhat better because in the ideal scenario when the plot passes through all the data points, we need a positive slope:
So, we know that increasing w is a step in the right direction, so we increase w a little bit more and see the plot:
Let’s keep the value of ‘w’ as 3 for now and experiment with b.
We want the sigmoid plot(with ‘w’ as 3) to move towards the right and we know that to shift it towards the right, we need to decrease the value of b. So, we have the following:
We are very close to the solution at this point, now we again see if changing w helps us here:
But decreasing w from 3 to 2 and keeping the b same overshot the point, so let’s see if it helps to keep w same and decreasing b:
And now we are able to exactly fit the points.
We were able to plot this 1 dimensional and see for ourselves how changing the values of the parameters affects the plot but in practice, data would be high dimensional and we would not be able to plot it out and get insights from it that easily.
We started with the parameter values as 0:
In the next iteration, we changed w to 1 and kept b as 0:
And we kept doing this till we got the exact plot.
The only problem with this algorithm is that the update values(δw and δb)(delta w and delta b) are coming from guesses which we can’t do in real-world problems as it might take a very very long time to converge to a solution:
So, what we need is a more principle approach guided by Loss function so that we update the parameters in the right direction and with every update, the loss values decreases.
And with some random guesswork, it would be the case that loss is decreasing at some point and increasing at the very next iteration as we are just randomly trying out different parameter combinations and a change in the parameter value changes the predicted output which in turn changes the loss value.
We can also plot out the loss value for different values of w and b:
Let’s see what we were doing in the guesswork algorithm:
When w = 1 and b = 0, the loss is changed
w = 2, b = 0
And in the guesswork algorithm we were randomly changing the values of w and b and were randomly moving on the error surface:
In practice, what we want is to start at a random point(in yellow in below image) and we should be able to move to the point having minimum loss(in pink in the below image) such that we don’t make any random movements where we are increasing, decreasing w, b and the loss is also increasing, decreasing on the error surface, we must make movements in such a way that on this error surface we are constantly decreasing the value of the loss function which was not the case with the random guess algorithm.
So, our main aim as of now is that we need a principled way of changing w and b based on the loss function.
We can consider the parameters as a vector θ and then our goal is to find the optimal value of θ for which the loss is minimized. And the way we do that is we start with some random value for θ and we are going to change it iteratively
and in every iteration, we are going to make it as the following and we keep updating it till satisfied with the loss value:
Geometrically, we have the θ and the change in θ as a vector which we can represent as:
The new value for θ can be represented as:
One thing that we note is that there is this large change in θ, so instead of updating by delta theta, we can update the θ with a smaller quantity(which is going to be in the same direction as delta theta)
And in this case, the new value for θ would look like:
The quantity highlighted in yellow in the above image is known as the learning rate and usually, it is very small. We want to learn the parameters but at a small rate and would not like to change the parameters drastically.
What to do to update the parameters?
So, we have the data given to us, we have the model that we have chosen
The way learning proceeds is that we compute y_hat(output) for each of the data points and once we have the predicted output for each of the points, we can compute the loss value as:
And once we have the overall loss value, we can also compute the average loss value. Now based on the loss value, we want to know the right step(direction of change, the magnitude of change) to change the parameters and the mathematical details of the same are discussed in this article. There are functions in frameworks like PyTorch, TensorFlow which automatically computes the delta values for the parameters. And once we update the parameters, we again go through the entire data, compute the predicted output, compute the loss and then update the parameters and we iteratively go through this loop till satisfied meaning till we have achieved a good accuracy or loss is less than a pre-defined threshold or we can just iterate a pre-fixed number of times or we can compare the parameter's value in two consecutive iterations and if there is not large change between the parameter values, we can stop iterating.
As discussed in the article conveying mathematical details, we can write the partial derivatives with respect to w and b for the 5 data points that we have in this case as:
Writing the Code:
We have two inputs(data points) represented in the form of an array
We would approximate the relationship between the input and the output using a Sigmoid function.
Then we have the function which takes in the input and the parameter values and returns the value of the sigmoid function at that point(input)
We have another function to compute the error. This function takes the parameter values as the input, it goes over the data points, compute the predicted output for each of the data points, computes the squared error loss for each of the data points and add to the main loss and then, in the end, it returns this main loss.
We can compute the derivative with respect to w and b as:
Now we look at the main function in which we iteratively go through the data and update the parameters:
We start with random values for w, b, and we choose some value for learning rate, we set some value for the number of epochs(number of times to iterate over the data) and for each pass, we compute the derivatives and update the weights
And we can plot this out to see for ourselves how the loss is changing for different parameter values:
To start with, we have ‘w’ as 0 and b as -8
After a few iterations, the situation is like:
The error is constantly decreasing(not jumping up and down like the way it was in random guess algorithm), it is never increasing at any point and after a few iterations, we have reached the dark blue region where the error is close to 0.
Dealing with more than 2 parameters
Let’s take the scenario when we have more than 2 parameters or to say more than one input:
Let’s say we are trying to predict the output PoorCare using the 5 input features that we have in the above image:
The data matrix is usually denoted by X and the output vector by Y.
Each row in the input and output data matrix/vector refers to one data point.
The number in yellow in the below image:
we can write it as x₁₂ meaning the 2nd input feature for the first data point/row.
In general, we can write it as ‘xᵢⱼ’ where ‘i’ refers to the row index and ‘j’ refers to the column index of the matrix.
So, for the i’th data item, the way we are going to compute the sum as:
And then we can pass this as the input to the sigmoid function
So, we have a total of 6 parameters in this case -> 5 weight terms and 1 bias term. And the way we are going to update all of them is by using the same gradient descent algorithm.
If we have one input and one weight parameter, then the derivative of the loss function with respect to that weight is given as:
Now using the same derivation that we have done to compute the derivative with respect to parameter w, we can show that the derivative of the loss function with respect to first weight/parameter is given as:
So, this value depends on the first column for the i’th input and we are going to sum it over all the inputs.
Similarly, we can write:
The same changes we have in the code as:
Earlier we had the function as:
And now we have:
And there would be one change in computing the sigmoid value for the given input.
Earlier we had:
Now we have:
There will changes in the gradient descent function also where instead of updating one weight, we now have to update all the weights.
So, we have the data given to us, the task is to predict some value between 0 to 1 because we are trying to regress a probability, the model that we choose is the Sigmoid model and we have the Loss function as the squared error loss function. Now we will see how to evaluate the model for the test data.
We have the test data which we can pass through our model and get the predicted output and compare the true output with the predicted output
We can compute the Root Mean Squared Error(RMSE) which is typically used for regression problems.
The smaller the RMSE the better.
And if we are doing the classification task, and we have to give a discreet output either 0 or 1, we can choose a threshold value and map the predicted output then to either 0 or 1.
We will take any value less than 0.5 as 0 and value greater than 0.5 as 1.
And we can compute the accuracy using the standard formula that we have:
Data: We are dealing with Real numbered inputs.
Task: We can do the Classification as well as Regression Task using the Sigmoid Neuron.
Model: Our approximation between the input and the output is a non-linear function which helps us to get the graded output which we can express as a probability.
Loss: We are still using the Squared Error loss function.
Learning Algorithm: We used the Gradient Descent algorithm which ensured that every step that we take with the updating of parameters, it points in the right direction.
Evaluation: We can use the RMSE metric for the regression problem and we can use the Accuracy metric for classification tasks.