We have seen that there are four challenges that typically any large scale rule-based system actually faces:
i.) Lots of data
ii.) A lot of rules which are hard to write down
iii.) And sometimes the rules themselves are inexpressible we can’t even write them as a Boolean condition.
iv.) Rules being unknown.
To overcome these limitations, we look at Machine Learning.
The idea here is that instead of having a human who comes with the rules(of expert-based systems) in their head and then tries to express these rules as programs, can we remove the human of the loop and introduce a machine that directly accesses this data and learns the program(function) on its own.
So, instead of a human coding out the function/program, the machine has to somehow figure out this function, and not only that, it has to figure out some parameters as well.
So, this is what a typical machine learning set up looks like. We are given a lot of data. And now we have this function which depends on all of this data(all the rows all the features), and now our decision relies on this function, we just don’t know what this function is, so unlike the other case where we were able to write this function as series of if-else conditions, here we don’t even know what the function is. And unlike the other case where we were able to give certain weights to certain inputs(whether Java is assigned more weight age than C and so on) and so on, here we don’t know what the rules are and what are the weight age assigned to different parameters are.
Now we can tell a machine that we think there is some relation between the input and the output, we think the relation comes from this family of functions(either the Linear family or polynomial family or from Non-Linear family, etc.) and we could just tell the machine, look at this data look at the inputs, look at the outputs, try out all these functions with different parameters and figure out which functions and with what parameters best explains the relationship between the input and the output.
So, this task is what we are outsourcing to the machine. And once the machine gives us that function, we just need to code up that function and for any new input, we could derive the output just by plugging in the input to that function. So, that’s what Machine Learning enables us to do.
So, now the crux is how do you define these functions? We can’t just try arbitrary functions, so what are the good set of functions to try and once the function is available, how do we estimate the values of these parameters.
Six Elements of Machine Learning:
v.) Learning Algorithm
Data is the fuel of Machine Learning. Today we have data everywhere whether be it on social media platforms like Facebook where we have text data, video data, audio data or be it on product websites like Amazon where we have something known as structured data which is in form of a table or we have reviews or product picture or some videos to describe how to use the product and so on, or you could have something like gaana.com where we have a lot of audio or speech data. So, all forms of data are available on the internet. Just because so much data is available does not mean we can do/use machine learning, to be able to do/implement machine learning, we need a specific type of data. Let’s look at what type of data we need and one application:
So, the application is here that we are given a medical image and based on that input we want to be able to train a machine to find out whether there is an anomaly or not in that image.
The data here is two parts: one is the input data(x) which is the image and the second is the image output part(y) that tells whether the input has a certain characteristic, in this case, it tells whether the input has an anomaly or not(maybe like brain tumor or any other sort of medical anomaly). So, for this, we need data of the type (x, y) where x is the input and y is the output. Only if we have abundant data of form (x, y) then only we can train a machine to learn a mapping between x and y. And then if we are given a new image, we should be able to predict the outcome for this. But if the only x is being provided, then there is not much that we can do.
And if the input contains an anomaly, then we would also like to locate the exact position of the anomaly example:
And y, in this case, output would not be just 0 or 1, it would the co-ordinates of the bottom left corner and top right corner. And for this particular case, we would require the data in this form where we are provided with samples(image as the input) for which we have the coordinates of the bounding box(bottom left corner and top right corner) where the anomaly lies so that we can train the machine accordingly and the machine’s job is to find the relationship between the input and the coordinates of the bounding box which would be the output in this case.
The second thing that we need to understand is that the data should be in a machine-readable format. So, in the case of an image, it's not very difficult, the image(medical scan) could be in the form of a 30X30 image which means it has 900 pixels, each of the pixels tells the RGB value of that particular cell, so we have 900 values which we enumerate as a single array, so that’s what out input is and what we are trying to learn is a relationship which takes you from this input of 900 numbers to the output number.
So, there should be two things as far as data is concerned:
i.) Data should be in the form of (x, y).
ii.) It should be in a machine-readable format(All the data is encoded as numbers).
And usually, the data is high dimensional, in this case, we have 30X30 image or 900 pixels, that means we have 900 numbers describing each data point/sample.
Let’s take another example where we have a document and we would like to do sentiment analysis on that document to know whether the reviewer is talking something positive about the product or something negative about the product:
So, here again, we need data in the form of (x, y) where x is the document and y is the label marked by a human. Now if have a lot of such data, we could build a machine learning agent to understand the relation between x and y such that if you give me a new document as the input at the test time, our agent would be able to output y using the function it has learned.
Secondly, in this case, as we have text data, we should have a way to represent this data in some form of numerical quantity because the function that we are going to learn is of the form
y = f(x)
where x, y both are both going to be numerical quantities, they are going to be a set of real numbers or a single real number or whatever but they are going to be some numbers.
And the data just need not be text or image, it could a structured table, for example, it could be the past record of an employee like what was his appraisal ratings for years, how many teams did he work with, how many projects did we work on, what was his client satisfaction for every module he delivered, what was his salary at different levels, how many promotions he/she has got etc. So, all of these can be represented as structured data.
We could have text data, product reviews discussed above is an example.
Then we have the image data where we take the pixel values to represent it as numeric data. Even a video could be represented as numeric data as the video is just a collection of images/pictures.
And lastly, we have the speech data which we could represent as of different amplitudes, frequencies and so on.
Now the question is where do I get the data from?
Data Curation: We have to curate this data.
There are lots of open data set repository available online:
There are some online platforms where we curate the data where for example we could upload a piece of an image and we can ask workers all around the world and they label/draw out the bounding box where the word appears.
For signboard translation problem, one way we could create data is, we could create/get the empty signboard(using image editing tool we could create various templates of signboards), then we could take various names in Hindi, and then we use an image editing tool which takes these two things and pastes one on top of another. so, the net effect is that we get a lot of this signboard data.
We could also get a lot of data on Wikidata.
A task is nothing but a crisp definition of what your input is and what your output is.
Once we have the data, what do we do with the data?
Below is a typical example of the data. We have various things over here, it is a product page from Amazon. Let’s look at all the data that we have here.
We have image data, we have data in the structured description of the product, then we also have unstructured data as the reviews about the product, then we have the product description given by the Vendor which is a bit unstructured but also may have structure. Then we have the information about the related products which are similar to this product. Then we FAQs and other questions asked by users and their answers.
So, we have a lot of data and we are not sure what to do with this data. So, the first thing that we need to figure out like a machine learning practitioner is that, when data is thrown at you what are the different tasks that you can do with this data. For the above case, we could have many different tasks. One task could be:
Given the data in the red box in the above image as the input which is some unstructured information about the product provided by the vendor, can we train a machine to fill in a table(green box in the above image) which would be a structured data? So, this is one of the tasks which could be accomplished using an abundant of the above data cases.
Another task could be:
We could take the red highlighted boxes in the above image which is the reviews of the product and also the specifications of the product, this could now be our x(a combination of these two), and using this we also have information about the FAQs that are important for this product. So, now if we are provided with new product reviews and specifications, the task is to come up with FAQs of this product.
Yet another task we could have is:
Given the information about the reviews, product specification, and FAQs, can we answer any question that a random user is asking?
Another task that we could do here is that we have information about the user, we know what product page the user is looking at currently, what are the different attributes of that product and based on that we could recommend new products to this user based on his purchase history also.
Let’s take another example:
Here again, we have been provided with a lot of data. Several different tasks could be defined over here as well like:
To identify people from the image:
To identify activities from an image:
To identify location/place from a given image:
Another task would be: based on a facebook post information, we could recommend similar posts or similar activities.
Now, the tasks could of various categories. The first important category of tasks is the Supervised set of tasks.
We can build a classifier that learns the relationship between x and y and the output of the classifier is 0 or 1 depending on whether the given input image contains text or it does not contain text.
Another category of the task is Regression:
Another category of tasks comes under Unsupervised Tasks and one particular task here is clustering:
Another task is Generation:
So far we have discussed the Data and Task jar.
Data could be anything for example: let’s say we have an image as the input and the output could be one of below mentioned 5 classes
Now we know that there is some true relationship that exists between the input and the output. The problem is that in most cases we don’t know what this function is. What we typically do in machine learning is that we look at a lot of such data where we have x and y and in all of these cases we realize that there is some relation between x and y
Now what we do is, we say that we don’t know what the True relation is(between the input and the output), we come up with some function which we believe best approximates the relation between x and y
Let’s understand this in much more detail.
We take one-dimensional data and plot it out
True relation between x and y is:
But we do not know the true function and will only have the data and using the data we need to come up with a function. So, we start with a very simple function say a linear function.
The parameters m and c are learned using all the data that we have. We can learn these two parameters even if we have just 2 data points but the problem here is that no matter how we adjust these parameters m and c, we can not come up with a line that passes through all of these points.
So, a linear function is not the best function for this data. We can try a polynomial function of degree 2 and now our job is to learn these parameters a, b and c from the data.
Here also we see that no matter what different values of a, b and c we try, we are not able to get a situation where for any given x here the output of the function(green curve) is the same as the red point(corresponding to the same x). So, this function is also not complex enough to capture the relationship between x and y.
The machine is trying to find the values of the parameters very efficiently in such a way that the value of error is minimum that means we should be as close to the predictions as possible.
So, we try again with degree 3 polynomial.
And once again the same story repeats, we try to adjust the values of a, b, c and d such that the green curve is as close to the red dots as possible.
And we keep trying this with higher degree polynomials until we reach to degree 25 polynomial which is coming very close not exactly overlapping with the red points but still very close and captures the relationship between x and y.
So, these functions that we try are known as the models as these functions explain or try to model the relationship between y and x.
And in most of the real-world problems, data would be a high dimensional meaning we could not just plot this out and visualize which function would be the best fit for this data set.
One question that arises now is that why not try a very complex model from the very beginning?
The reason for this is that suppose the true relation between x and y is very close to a line.
And what if we try it to approximate using a very complex function which is say a 100-degree polynomial
In this case, we could say that the machine should be able to find all the parameters such that all the parameters expect m and c are zero, that means we could still start with a very complex function and the machine could learn to ignore it and come up with a very simple function. But if we try this, then we would get stuck at a point where it would be very difficult for the machine to learn all these 98 parameters to be 0. Although it seems easy for a human being, the machine has to try out all these values and out of these infinite values, it has to find out this exact peculiar case where all the parameters are 0 except m and c. So, in practice, it would be very difficult for the machine to do that.
How do we know which model is better?
Let’s say we have the following data:
And the following is the true relation between x and y which is something very close to the sine function along with some noise.
Someone approximates this function as
Now another person has come up with let’s say the below function:
Now the third person came up with the following function:
All these 3 functions seem to be very close to the sine function. Now the question is which of these functions is better?
If we try to visually inspect, it's not very clear as all the curves are very close to each other and to the true curve. So, instead of looking at the curves, we could look at some numbers and decide which function is better.
For each input, we know the true output, we also know the values which each of the three functions gave for the input. So, we sum up the square(we take square so that +ve value of error does not cancel out with the -ve value of error and there is also a calculus-based reason for this) of the difference between the true output and the predicted values for all the data points.
In a similar way, we could compute this error value for all the three functions that we approximated.
And now based on this error value we could say that the first approximation was better than the other two. So, a loss function helps us decide how good or bad our model is or how good or bad our parameters are. And it also helps to decide the better model among a given set of models.
We have been provided with the Data and the task as well:
We have also proposed a model as the following
Let’s say somehow we also got the parameters of this model as:
Then we can check the error value of this model as using the below:
Till now we have the Data, Task, Model, and Loss function. We don’t have a way to learn the parameters of the model.
Learning Algorithm helps us to learn the parameter of the model.
We can think of this parameter estimation as a search problem. To start off, we could just assume that the values of the parameters lie between -20 to 20.
We could then fix the value of the a and b both to -20 and then search across c(try out different value from -20 to 20) and plug these values into the Loss function(plug into model equation, then compute the predicted output and from there we compute the loss) and see what value of c gives the best result. And then we repeat the process and do the same thing and find the value of b for which the loss is minimum. In other words, in this entire space of -20 to 20, for all values of a, b and c, we compute the loss value by plugging in these values of a, b and c in the model/function and using in the Loss function equation. We also keep track of the minimum loss.
At the end of this search process, we will get some value of a, b and c for which the loss is minimized. And we will just output those values. And that would be the parameter configuration for which the loss is minimized.
So, this is a very brute force approach. In practice, this brute force approach would not be feasible because we would not only have just 3 parameters, we would have thousands of values. So, what we need is an efficient way of computing these parameters.
So, now this is converted to an optimization problem. We need to find values of a, b and c for which the Loss value is minimized. So, we have a function(Loss function) which depends on the values of a, b and c and we want to find the values of a, b, and c such that the value of that function is minimized.
And the way we go about computing the parameters which give the minimum value of the Loss function lies in the Calculus where we take the partial derivative of the function with respect to the parameter(s) and equate this to 0 to get the value of the parameter(s) for which the function's value would be minimum.
How do we evaluate our model?
Let’s say we are building an image classifier. We pass the data through the model, predict the output and we also know the true values/labels
And we can just compute the accuracy of the model. The way we do this is that for each instance we compare the output predicted by the model with the true output. So, in the above image, out of the total 7 cases provided, the model predicted the correct output in 4 cases. And the accuracy, in this case, would be:
So, this is one way of evaluating the models. In many practical situations, we might look at the top three outputs given by the model and we are okay as long as one of those three matches the true output. And this is termed as Top-3 accuracy or in general Top-k accuracy if we are looking at the top k outputs.
Now the question arises is that, how this is different from the Loss Function?
Accuracy is much more interpretable for example if we say accuracy is 60% that means out of 100 cases the machine would predict 60 if them as correct whereas if we are told that the Square Error loss is 0.4, it’s hard for us to judge how good or bad our model is.
And here is also some practical intuition why we should have a different measure for Loss function and different measures for evaluating the model:
So, think of this practical application where we building an autonomous driving car and one module would be to deciding whether to apply the brake or not depending on if there is an obstacle(like a dog) or not.
So, the model’s job is to apply break whenever there is an obstacle and if there is no dog(let’s say the dog is the only obstacle on the road), then the car should just keep going. The standard evaluation metric for this case is something known as precision and recall and we calculate/understand it using the Venn diagram.
In the below image, the highlighted(circled) portion indicates all the cases when there was an obstacle on the road.
And the highlighted portion(in black) in the below image represents all the cases when the model decided to apply the brake.
In a perfect world, we would like the two circles to be perfectly overlapping meaning every time there was a dog we applied the brake, and whenever we applied the brake there was some obstacle. We never applied the brake when there was no obstacle or dog.
Brake circle represents the total number of times we applied the brake and the intersection region represents the number of times when we correctly applied the brakes. So, the precision is the value of this intersection(when we correctly applied the brakes) divided by the total number of times we applied the brakes.
And recall in this case would be the total number of times there was a dog and we applied the brake(intersection part) divided by the total number of times the dog was there. So, precision is very intuitive, that the number of times you took the correct action out of the total number of times you took an action, of all the times you took an action how many times were you correct.
And recall is of how many times you were supposed to take any action, how many times did you actually take an action.
So, this Precision and recall we could as Evaluation metic.
But if we were to train this model, we could use a Loss function which tries to maximize the distance between the car and the obstacle(we don’t want to apply the brake very close to the dog, there might be a possibility of hurting the dog or some unknown event there). So, that’s how we would like to train our model to not just learn when to apply the brake but to also maximize the distance between the car and obstacle.
So, as is clear from this example the Loss function and Evaluation metrics are very different.
All the work till now be it Data, Task definition, training the model so that Loss is minimized was on the training data. Once the model is trained on the training data, it needs to perform well on the test set as well. So, the evaluation is never on the training data, it’s always on the test data.
So, with this, we complete the jars of Machine Learning.
This article covers the content covered in the Expert Systems — 6 Jars module of the Deep Learning course and all the images are taken from the same module.