This article covers the content discusses in the Object Detection module of the Deep Learning course offered on the website: https://padhai.onefourthlabs.in
In this article, we will discuss the task of Object Detection using the YOLO(You Look Only Once) framework/algorithm.
YOLO has a different way of looking at the task of Object Detection as opposed to the RCNN algorithm which is discussed in this article. It avoids the Region Proposal Phase.
YOLO takes the image and divides it into equal-sized grids.
For each grid, we first compute the confidence that this grid lies in the center/inside an object or there is an object around it. If there is an object, then the next point is how we adjust this grid(again the same four values w, h, x, y of the true box).
The true box would look something like this:
So, once again we need to predict the displacement from the co-ordinate that we have here(red box in the above image represents the true region and the blue box represents the proposed region). So, from the blue box in the above image, we need to find out this displacement that needs to be done in the width and the height so that the entire dog(object/entity) is inside the box both horizontally and vertically and then also how much should we move the origin so what could be new co-ordinates for origin which is x, y.
So, this is again a combination of classification and regression problem. First, we predict whether this grid/box lies at the center of an object or inside an object or if there is an object around it or not. If yes, then we predict the four coordinates of the box. Then suppose we are dealing with 20 different classes, then we also do a classification/softmax over these 20 classes. And we know that the true label, which in this case, is ‘dog’ and the model would predict something, so we find the difference between the true label and the predicted label and that would become the loss function.
The true label can be represented as a probability distribution where all the probability mass is on the true class(dog in this case), and the output is also predicted distribution.
So, we have 49 grids(input image is divided into 7 X 7 grids) and for each of the grid we are first computing the confidence value, then the next four values corresponding to the coordinates and then the k values(which are the probability for the k classes) that we have. So, for every grid, we predict (k+5) values.
So, that’s how YOLO looks at the task of Object Detection. There are no multiple region proposals, we just take equal-sized grids and then try to stretch all of them to be able to fit the right box.
The training data in this looks like:
So, these are the true things given to us in this case/image. We know that there are 4 objects in this case.
Now for the box in yellow in the below image:
We should be able to predict that c = 0 i.e we have 0 confidence that this grid lies at the inside an object or there is an object around it and once c is 0 then we don’t care about the remaining things/values.
For the yellow box in the below image, we should be able to say with high confidence that this grid lies inside an object and now we need to stretch the co-ordinates to give the exact box and it should also be able to predict that the class is a cycle.
So, in total, we need to make ( 49*(1 + 4 + k) ) computations, so these many values we need to predict at the output.
Let’s look at the cell/grid in blue in the below image:
We look at the 15 values(assuming k = 10, k represents the no. of output classes in this case) of this grid, the first entry would tell the confidence that this grid lies inside an object or contains an object around it. Then the next four values/co-ordinates can be used to get the refined box and the thickness of this box is directly proportional to the confidence:
Let’s look at another cell/grid:
Around this cell, no object is there so,
For the shaded green grid, since the confidence is low we are not interested in calculating the remaining values, the label of the box.
Similarly, we would do this for all the 49 boxes/grids
So, we do it for all the 49 grids
And most of these grids would have very low confidence.
We retain only those boxes which has high confidence and for each of those boxes we retain that label which has the maximum confidence.
So, we have an image as the input which we pass through a large Convolutional Network having many layers/convolutional operations, max-pooling layers and we get some final volume and after which it would have few fully connected layers and then we have the output layer which would be of the size 49 X (5 + k) where ‘k’ is the number of classes that we need to predict.
And the network has these parameters: the convolutional filters and the fully connected layers parameters as well as the parameters in the final output layer. And to train this network we use the Backpropagation Algorithm which means first we need to compute the Loss Function.
So, the different things that we are trying to predict are the confidence which is a value between 0 to 1. Then we try to regress the 4 values corresponding to the width, height and the coordinates of one end of the bounding box rectangle and we want them to as close to the true values of the corresponding variables; and then we compute the probability distribution over the ‘k’ classes that we have
So, these are the 6 quantities(1 corresponding to the confidence c_hat in the above image, 4 correspondings to the dimensions and 1 corresponding to the probability distribution vector) that we are predicting. For each of the quantities, we need a Loss Function.
And the Loss Function is just going to be the two Losses: one is the Cross-Entropy Loss and the other is Squared Error Loss so the overall loss is going to be a combination of these two loss values.
The output values for this cell(red boundary in the above image)
We want the confidence for this cell to be 1, we want w and h to be the following values:
And we want the x, y coordinates to be the coordinates of the left lower end corner of the bounding box.
And we want the probability distribution to have the entire probability mass on the label ‘dog’ and everything else to be 0.
So, our true output would look like:
This output vector has some classification output and some regression output, so we consider the weight for both.
We are computing this for one particular cell, the same is applicable for the remaining 48 cells/grid as well.
We want the confidence to be 1 and the model predicts the confidence as c_hat, so the loss for this would be(we want to minimize the difference between the two, in other words, this loss would be minimized when c_hat is 1), so the loss function is this Squared Error Loss:
We could also represent the confidence as a probability distribution vector where the entire mass is on yes(corresponding to the case when the cell/grid lies at the center of an object or there is an object around this grid) for the true distribution and then the model would predict some value for confidence let’s say 0.7 and that also can be represented as a probability distribution and we could then compute the Cross-Entropy Loss:
So, either we can use the Squared Error Loss or we could the Cross-Entropy Loss for the confidence prediction vector; both are okay in this case.
The second loss term is going to be:
we want the difference between w and w_hat to be minimum.
At a high level, we want w to be very close to w_hat, so we could have minimized the difference between w and w_hat; we could have done what we did in RCNN where we just look at the displacement, we know what the true displacement is and what the predicted displacement is and we could try to minimize the difference between them.
So, the Authors of the YOLO paper, they directly minimized the difference of w and w_hat, they did not try to predict the displacement and they also found that using the square root values instead of the actual values works well.
Similarly, we would have one term for the height:
One more term for the x coordinate:
And more term for the y coordinate:
There would be one more term for the loss and that would the Cross-Entropy Loss for the Label. So, the true output of the label we think of as one-hot vector with all the probability on the true class and the output is also going to to be a probability distribution; so we compute the Cross-Entropy Loss.
So, the overall loss would the summation of all these 6 loss values for all the training examples:
We have already discussed the parameters of the network, its a large CNN with some fully connected layers at the output, so we backpropagate the loss through all the weights. So, that’s how the training is going to be.
For this box, the confidence would be very low, close to 0 and once we know that the confidence is close to 0, we don’t predict the remaining values
So, what we predict here does not even matter because if this grid is not at the center of the true box then we don’t even know what the true w, h, x, and y are and what is the true class. We don’t know any of these true values. So, in this case, the only quantity that we care about is the confidence and we want the true confidence to be 0, so we can minimize the squared error between 0(true confidence) and whatever be the value of c_hat
And this quantity would be 0 only when c_hat = 0.
So, this is what happens when the grid does not lie at the inside any object, then we just need to predict the output to be as close to 0 and that’s the only loss that we can backpropagate because for computing any of the remaining losses we need both the true value and the predicted value but for remaining quantities we don’t know what the true value is so we can not backpropagate that loss, so we only backpropagate the loss for the confidence.
So, given an image, we want to find the bounding boxes around the object in the image as well as the class of the object that lies inside the bounding box.
YOLO just considers grids in the input image and for every grid, it finds out if that lies inside an object and if so it then finds out the coordinates of the true box and also does a probability distribution of the object over all the possible classes/outputs.
All the images used in this article is taken from the content covered in the Vanishing and Exploding Gradients and LSTMs module of the Deep Learning Course on the site: padhai.onefourthlabs.in