Feedforward Neural Networks — Part 2

This article covers the content discussed in the Feedforward Neural Networks module of the Deep Learning course and all the images are taken from the same module.

In the previous article, we discussed the Data, Tasks, Model jars of ML with respect to Feed Forward Neural Networks, we looked at how to understand the dimensions of the different weight matrix, how to compute the output. In this article, we look at how to decide the output layer of the network, how we learn the weights of the network, how to evaluate the network.

How to decide the Output Layer?

Image for post
Image for post

And two main tasks that we will be dealing within most of the problems are Classification and Regression:

Image for post
Image for post
Classification

The above image represents the case for Multi-Class Classification. We are given an image as the input and we are passing it through all the layers and then finally predict output and we know that the true output, in this case, is a probability distribution where the entire mass is focussed on one outcome. The network will also produce a probability distribution, it will predict four values in the above case such that these 4 values sum up to 1 and each of the 4 values is greater than equal to 0.

Image for post
Image for post
Regression

The other problem we can look at is that we are given some features of a movie and we are interested in predicting multiple metrics for example for the above case we are interested in predicting IMDB rating, critic rating, RT rating, Box Office collection. As all of these are real numbers, this is a regression problem, and in this case, also, we are trying to regress 4 values at the output. And in general, we can say that we have ‘k’ outputs in the regression case where we are trying to regress ‘k’ values and the output in case of Classification problem could be ‘k’ values where we are predicting the probabilities corresponding to each of the ‘k’ classes.

Let’s see the Output Layer for the Classification problem:

Image for post
Image for post
Image for post
Image for post

True output is a probability distribution and we want the predicted output also to be a probability distribution and we have passed the input up to the last layer so, in the above image of the network, we have computed till h2, so using h2 and the weights in the next layer which is W3, we can compute a3.

Image for post
Image for post

So, we have the pre-activation values for all the neurons in the output layer, in this case, we want our function to be applied to each of the 4 values that we have in a way that we get 4 probability outputs/distribution such that they sum to 1 and each of the values is greater than 0.

Image for post
Image for post

Let’s assume for this case, we have the a3 values as:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

If we do the above(which is just normalization), each of the entries is going to be greater than 0 and the sum of all the values would be 1.

So, this is one way of converting the output to a probability distribution but there is a problem with this. Let’s say the a3 values for a particular case are:

Image for post
Image for post

a3 is computed using below equation and since all of those values are real numbers, we can get any real output, it could be positive or negative. So, in this case, probability values would look like:

Image for post
Image for post

The sum of all the values is still 1 but one of the entries is negative which is not acceptable as all the values represent probabilities and by definition, the probability could not be less than 0.

Another way to convert the output into a probability distribution is to use the Softmax function.

Image for post
Image for post

The exponential function also gives back a positive output even if the input is negative.

Image for post
Image for post

Let’s we have a vector h, then we can compute softmax(h) as:

Image for post
Image for post
Image for post
Image for post

Since we are using the exponential function in softmax, it ensures that both the numerator as well as denominators are positive for all the quantities. And also the sum of the values is still 1.

Image for post
Image for post

So, the predicted output is going to be a probability distribution by using the Softmax function.

How to choose the right network configuration?

Let’s say we have two input features and the final output is a function of both the inputs:

Image for post
Image for post
Image for post
Image for post

We know that we can use a DNN to deal with this kind of data. But we don’t know the configuration of this DNN which will help us to deal with this data.

Another case for 2D where the data is not linearly separable is:

Image for post
Image for post

Even for 3 inputs we could plot the data and see there is some non-linearity:

Image for post
Image for post

So, up to 3 dimensions we can just plot out the data and see for ourselves if the data is linearly separable or not but for higher dimensions(which is the case for most of the real-world problems), we can not just plot out the data and visualize it.

In practice, we try out different neural networks of different configurations, for example, the below image shows 4 different configurations of the neural network:

Image for post
Image for post

So, many configurations are possible and the way we do it in practice is that we try some of the configurations based on some intuition about how many neurons to use in each layer(which will be discussed in a different article) and then we plot the loss for each of the configurations:

Image for post
Image for post

And now based on the loss value, we can select the appropriate model for the task at hand.

The loss value would depend on several hyper-parameters like the number of layers, the number of neurons in each layer, the learning rate, batch size, optimization algorithms and so on as these parameters affect the final output of the model which in turn would change the loss value. We need to check out different configurations by using different values for these parameters. This is known as Hyperparameter tuning.

Loss Function for Binary Classification:

Our network looks like the below:

Image for post
Image for post

We have some inputs then the hidden layer and finally, we have this a2(in yellow) which aggregates all the inputs and then we have the final output.

Either we could have two neurons in the output layer on which we can apply the softmax function and based on the output of the softmax we can decide the final output or we can just have one neuron in the output layer and this neuron would be a sigmoid neuron that means it is going to have a value between 0 to 1 and we can treat this value as the probability of the output being class A(say out of two classes A and B). We can also compute the probability of the output belonging to class B as both the values would have the sum as 1.

Image for post
Image for post

So, we can simply apply the Sigmoid(Logistics) function to a2 and it would give the output which we can treat as the probability of belonging to class A out of two classes A and B.

Let’s say the network parameters values are as below:

Image for post
Image for post

And we want to compute the loss for a particular input(for which we already know the true output), we pass this input through all the layers of the network and compute the final output:

Image for post
Image for post

As this is a classification problem, we can treat the outcome as a probability distribution and we can use the Cross-Entropy Loss to compute the loss value:

Image for post
Image for post

And using the loss value we can quantify how well the network is doing on the input.

Let’s take another example:

Image for post
Image for post
Image for post
Image for post

In this case, loss value is greater than that in the previous case; as in this case, the true output is 0 and the predicted is 0.7152 which is close to 1 and hence the logical explanation for why the loss is greater in this case compared to the previous example.

Loss function for multi-class classification:

Our network, in this case, would look like:

Image for post
Image for post

In this case, we have 3 neurons in the output layer on which we can apply the softmax function which would give us the probability distribution.

Let’s say the parameters have the following value at some particular iteration:

Image for post
Image for post

The input and true output(all the probability mass is on the second class) are as follows:

Image for post
Image for post

In practice, we would be given the output as Class B, which we have contributed to the probability distribution in the above case.

We will pass the input through all the layers of the network and compute all the intermediate values as well as the final value:

Image for post
Image for post

So, again, in this case, we have both the true output as well as the predicted output as a probability distribution and again, in this case, we can use the Cross-Entropy formula.

Image for post
Image for post

The probability mass on class A and class C are 0, so in effect, we have the loss value as:

Image for post
Image for post

Let’s consider another example:

Image for post
Image for post

So, in essence, for a given input we know how to compute the forward pass or forward propagation(computing all the intermediate as well as the final value for the given input) and given the true output and the predicted output, we know how to compute the Loss value.

Image for post
Image for post

So, all the above we can do on the assumption that weight values are provided to us. Let’s see how we compute the weights in the Learning Algorithm.

Learning Algorithm(Non-Mathy Version):

The general recipe for learning the parameters of the model is:

Image for post
Image for post

We initialize the parameters randomly, we compute the predicted output value, using which we then compute the loss value and we can feed this loss value to the frameworks like PyTorch or TensorFlow, and the framework will give us back the delta value of the change that is required.

Image for post
Image for post

Earlier we had just two weights w, b. Now we have many more weights and biases, all the weights across all the layers. And earlier the loss function was dependent on w,b but in the case of a DNN, the loss function depends on all the parameters(all the weights, all the biases).

Image for post
Image for post

And we can use the same recipe for the Learning Algorithm in the case of a DNN as in case of Sigmoid Neuron:

Image for post
Image for post

We initialize all the weights in the network and all the biases with some random value, now using this parameter configuration, we are going to iterate over all the training data that is given to us, we compute the final output for all the training inputs. Once we have the predicted outputs, we can compute the loss value using the Cross-Entropy formula. And once we have the loss value, we can feed it to frameworks that would then tell us how to update all of the parameters appropriately in a way such that the overall loss decreases. These parameters are just going to be the partial derivative of the loss function with respect to each parameter and the way to compute this is discussed in another article.

Evaluation:

We are given some data with true labels and we can pass each data point through the model and compute the predicted output for each data point.

Image for post
Image for post
Image for post
Image for post

We can compute the Accuracy as:

Image for post
Image for post

The only thing to note is that how are we going to get the predicted output as 0 or 1. In the case of Binary Classification, where we have the sigmoid neuron at the output layer, we can have some threshold value to decide this for example. let say sigmoid’s output is 0.6 and we can have threshold as 0.5 meaning anything greater than 0.5 would be treated as 1 and anything less than equal to 0.5 would be considered as 0. Another way of looking at this would be: say sigmoid’s output is 0.6 which can be written as the probability distribution: [0.4 0.6], so the label that we are going to assign as the predicted output is the index of the maximum value in the probability distribution.

And the same thing is applicable for Multi-class classification also:

Image for post
Image for post
Image for post
Image for post

In the case of Multi-class classification, we could also look for per-class accuracy meaning to say for example: of all the 1’s images that we have in the test set, how did the model do on that, so we check the output for those images and compute the accuracy using the same formula as in the above image.

For example: for the case of 1’s images, we have a total of 2 images out of which one has been classified correctly as 1 giving a total accuracy of 1/2 i.e 50%.

Image for post
Image for post

In this way, we can compute the class-wise accuracy for all the classes and that helps in analysis sometimes for ex. for the above case, overall accuracy is 60% but for 1’s images, the accuracy was 50%. Let’s say for class 9 it is 80%; so on average it is 60% but there are some classes on which we the model is doing pretty well but for some classes, it is doing poorly. Based on this analysis, we could do something special for that class like data augmentation wherein we supply more images for the 1’s class.

Summary:

Data: Inputs are going to be real numbers.

Task: We are dealing with Binary Classification, Multi-Class Classification, Regression tasks using a DNN of which the Classification part is covered in this article.

Model: We are dealing with non-linear model and we can now deal with not linearly separable data.

Loss function: We use the Cross-Entropy loss function for the Classification task, we compute the cross-entropy loss for each data point and then just average it.

Learning Algorithm: the same recipe of Gradient Descent is used for training a DNN.

Evaluation: We use the Accuracy metric. Sometimes it makes sense to compute per class accuracy in case of Multi-class classification tasks.

Image for post
Image for post

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store