Uses of Histograms in Machine Learning

In the last few articles, we have discussed histogram in great detail, and it's clear by now that histogram reveals a lot of trends in the data, it helps us understand interesting patterns in the data. Histograms are also used in Machine Learning for various purposes and in this article, we discuss the uses of histograms in Machine Learning. So, let’s get started.

  1. Identifying Discriminatory features

Let’s say we are trying to build a machine learning system that takes a lot of information about a patient such as Age, Height, Weight, Cholesterol, Sugar Level, and so on and this Machine Learning system decides if the patient has health risk or not.

If we go to a Doctor, the Doctor will not ask so many readings, he might ask those which are the few important readings to know about maybe Cholesterol is important in deciding if a patient has Heart risk or any other health condition or not.

So, a Doctor might know what are the important factors/attributes to look for but a Machine Learning system does not know this in advance meaning what the important attributes are the system doesn’t know in advance and the features that help us identify the health risk in this case or in general the features that help us understand the output as a function of the input are termed as Discriminatory features.

The data set would look like the below, where each column is one attribute and each row represent the data for one patient, their past records, we know whether had a health condition or not(let’s say we got this data from some hospital we collaborated with)

Now if we want to understand which are the features(attributes) we should really look at or in other words, which are the features that really matter in deciding if someone has a health risk or not

One thing we could do is to split this data set into two sets, one is for the patients having a risk(past records) and the other is about the patients not having a risk; for each of these datasets we could individually draw the histogram or frequency polygon like the one depicted below:

From the above plot, we can say that for people who don’t have risk, at least the max heart rate for them appears to be on the lower side, and for people who are at risk, the max heart rate seems to be at the higher end.

Just like we have the frequency polygon drawn for the Heart Rate attribute in the above plot, we could have it for any other attribute as well such as Cholesterol or Sugar or anything of the sort for that matter. If that attribute really matters in predicting the output, then the histogram or the frequency polygon would look like the above plot — where for one set say no risk patients, the values cluster at one end, and for the other set, the values cluster at another end. And the corresponding feature would be a good input feature to the machine learning system.

If we plot the frequency polygons for the Height attribute, say the plot looks like this:

It is not necessary that the people who have health risk or does not have the health risk are taller or shorter. It might be true for some cases may be for ‘arthritis’ or something of that sort where people having certain height might be more susceptible but for a heart condition it might not really matter and in that scenario if we see a histogram like the above one where for both the sets there is no clear distinction between the trends, they see more or less similar, that means Height may not be a good feature to help the Machine Learning algorithm to identify or segregating or discriminating patients with risk from patients with no risk.

2. Analysing output scores

Histograms also help in analyzing output scores of the Machine Learning system. Nowadays everybody wants to build a Chatbot and unfortunately, despite the hype around the chatbot, they are nowhere being close to satisfactory. So, if we ask a Chatbot “What’s the temperature outside?”, we want it to say something like “It is very hot, the temperature is around 33 degrees celsius” and not give back some random response.

If we chat with some chatbot for 2–3 turns, we start getting some random responses, they are not able to maintain context longer than 3 or 4 turns and they are not really at a level where they can replace humans.

Now if we want to deploy a Chatbot in a call center environment and we know that the first 2–3 turns it handles okay but beyond that, it may or may not handle it properly.

So, what would be useful there is to develop a Machine Learning system that looks at the answer given by the Chatbot and the ML system should be able to decide if the Bot’s response is good or not. If it is good, then let the Bot continue talking to the customer and if the ML system says it’s a bad response then a human should take over and not let the Bot continue because it has lost context, it is not been able to reply correctly.

Suppose someone develops such a system, now the question is how to check if it is good or not?

We take some bad responses and some good responses and we see what is the score that the Machine Learning system assigns to it. So, let’s assume that the Machine Learning algorithm assigns it a score from 1 to 5, so this is what we expect the histogram to ideally look like:

For all the bad responses, we want the score to be less than or equal to 2 whereas, for all the good responses, we want the score to be greater than 3, maximum of them we would like them to get a high score. This is the ideal situation that we expect from a Machine Learning system.

Let’s say we get the below type of histogram:

The red bars in the above histogram is for the bad responses and the green bars is for the good responses and from the above plot we can see say that the Algorithm for which we got this plot does not have a good discriminatory power because there are some good responses which got a score in the range of 0 to 2, that means the Algorithm is classifying good responses as bad responses and there are some bad responses which got a very high score in the range of 4 to 5, and if we look closely, for most responses, the tallest bar for the good as well as the bad responses are in the middle range of score of 2.5 to 3.5 which is just like saying okay for everything and that’s what the system/algorithm is doing.

So, this is a very interesting way of analyzing what the machine learning system is outputting for different kinds of inputs that we have or the score that it is generating.

References: PadhAI