Histogram
In the previous article, we looked at how Histogram is the preferred way of representing Quantitative Data and how to select the perfect bin size. In this article, we look at how to answer questions on percentages /proportions— “In what percentage of matches did Sachin score less than 10 runs?”. As with the case with a frequency table, frequency plot, questions related to percentages are a bit hard to answer directly from the histogram.
Below is a histogram of runs scored by Sachin Tendulkar in all the ODIs he played taking a bin size of 10
Suppose we want to answer about percentages like “In what percentage of matches did Sachin score less than 10 runs?”.
As was the case with the frequency tables and frequency plots the same is true with Histograms, questions about percentages are a bit hard to answer. Again, the solution that we had for frequency charts, the same solution would apply here also, so what we have is Relative Frequency Histogram.
So, again we are interested in plotting a histogram, we can compute the frequency of each class interval(that would be the sum of the frequencies of all the values in the interval) and we also compute the total frequency(across all bins/categories) and we just divide the frequency count for each interval by the total frequency
Once the relative frequency is computed, we can plot the histogram again, the x-axis does not changes, its the same class intervals as before and now the heights of the bars are proportional to the percentages or the fractional quantities from 0 to 1 and not the actual frequencies
Now it’s easy for us to answer questions like “How many times he scored less than 10?”. So, that’s what relative frequencies histograms are used for. And it also becomes easier to answer questions comparing two sets of data.
Here we have one data set for Sachin Tendulkar and the other data set for Ricky Ponting and we have used the absolute counts in the above plot and the values on the y-axis represent the number of matches(frequency) in which the respective x-axis interval values appears(score) and look like Sachin had more low-scores(if we count low score as the one which lies from 0 to 20).
But Sachin also played a lot more ODIs than Ricky Ponting. So, comparison of the absolute frequency numbers is not really fair and that’s the problem with any absolute quantity
Sachin scored 0 to 10 in roughly ~140 matches, how good or bad this number is, depends on the total number of matches he played.
If 140 is out of say 1000 matches, then it’s not very bad whereas 140 out of 200 matches would not reflect good performance. So, the total number of matches or the percentage becomes important.
And hence when we want to compare datasets, it becomes very important to compare the relative values and not the absolute values.
Here are the relative frequency histograms for the above data.
From this relative frequency plot, we can now see that there is not much difference between the percentage of matches in which both Sachin and Ponting scored between 0 to 20
Looks like both Sachin and Ponting, have about 40% of their scores less than 20. We know that Sachin’s average was around 44 and if we bring Virat Kohli’s data into the picture, we know that his ODIs average is around 59.
So, the question is what would the trend look like for Virat, would it be the case that he has more peaks in the center say in the range 40 to 60, and very few counts in the initial region of 0 to 20 because his overall average is high compared to Sachin, so is that the case, if not, then what would the histogram for Virat’s score look like?
And here is the answer to this question:
From the above plot, we can say that Virat also has a lot of scores in the 0 to 20 range, it’s about 37%. So, looks like all batsmen are vulnerable in their initial period and it’s easy to get them out when they are on the score in between 0 to 20 and once they cross that score, they tend to score much better.
If we look at Virat’s plot, it’s clear that he has a consistent score in all intervals i.e 60 to 70, 70 to 80, all these bars have almost equal height, he has a lot of high scores that’s why his average is bit higher but he also has a lot of low scores(not very different from what Sachin had, and they played in a different era). So, a relative frequency histogram immediately brings out the comparison between different datasets, helps to quickly understand the trends across datasets/categories in a better manner.
The following is the procedure for drawing a relative frequency histogram:
What if want to compare more players or say multiple histograms?
One option would be to draw a separate histogram for each player as we depicted below and compare them to test some hypotheses
Again, the same interesting pattern repeats here, all 4 players have a high proportion of the number of low scores(0 to 20).
We can plot out individual histograms and compare the plots but as the number of players increases, it becomes a bit cumbersome and is a bit hard to visualize.
The other option is to draw all the histograms in one plot:
Here we have different colored bars corresponding to different players. Now what is happening is that(because we have overlayed the bars), different bars are at a different depth, so looks like the cyan or light blue colored bar is at the top and everything else is below it, now that is hiding the bars behind it and it’s not very easy for us to understand what is happening here and it becomes very difficult to distinguish between different players.
The other option is to have something similar to a grouped bar chart.
So, we have a different color bar for every player, and for each score interval, we plot the bar for all of the players. It seems okay but the problem here is that we have lost the sense of individual trends, earlier it was easy for us to see the trend for Sachin’s scores but using the above plot, we are struggling to identify the individual patterns. So, this may be a reasonable way of comparing multiple histograms but it’s still not very very easy to visualize.
The best option is to use the Frequency Polygon.
In the below plot, we have the histogram for Sachin, on top of the histogram, we have also drawn frequency polygon(we have drawn a dot at the midpoint of the interval and we have connected the dots for all the intervals), so we still get to see the overall trend and for each interval, we can still read the value as we have this big circle/dot on top of the trend/bar and we can look at the y-axis corresponding to that dot and see what the actual value is.
So, this is what the frequency polygon looks like.
Procedure to draw frequency polygon:
Ideally, we should start the graph/plot from value 5(mid-point of the 0 to 10 interval) but we have connected it to 0 as well(below plot)
This frequency polygon is much neater than histogram in some sense as we are not seeing the big bars now and we are still being able to see all the information that we can get from the histogram, we are getting the overall trend and we are also able to read the value for each of the class interval.
So, in terms of what we get from this plot, there is nothing much different from what a Histogram conveys and this frequency polygon occupies less space compared to a histogram.
This allows us to plot multiple frequency polygons at the same time, so now we have the frequency polygons for all these players and we are able to see individual trends as well as to compare multiple plots or players at the same time.
And now because of this difference in the total number of matches played, it seems like Virat Kohli was dismissed for a score of 0 to 10 only 60 times as compared to about 140 times for Sachin Tendulkar but that’s not the right comparison, we should plot out a relative frequency polygon. And here is what the relative frequency polygon plot looks like:
From this plot, we can see that all the 5 players have low scores in the range of 27 to 33% and we can see that for Virat Kohli at least i.e the green line, the curve has moved up for all the high scoring ranges say for the scores above 60, the green line is above all the other 4 lines, that means that he has more percentage of high scores as compared all the other 4 players and hence his average is higher
We can have the frequency polygons for Continuous quantitative data as well. Here is one example from the agriculture domain depicting the same.
Frequency polygon helps to understand the trends and to compare different datasets, categories as well.
To answer questions of the form where the analyst is interested in say — the total number of matches in which Sachin scored less than a specified value/score say 60? then the Cumulative Frequency Polygons is the way to go.
In Cumulative Frequency Polygons, for each class interval, we also add the sum of the frequencies of all class intervals before it.
Now it becomes very easy to answer questions like in how many matches the player scored less than 50 or less than 60 and so on. We can just look at that interval and read the point corresponding to it, so if we want the number of matches in which he scored less than 80, we just look at the interval 70 to 80 and the value for that interval in the cumulative frequency polygon is close to about 370 so that’s the answer to the question.
The y-axis value corresponding to the last interval would be just the length of the dataset which in this case would be the number of all ODIs Sachin played.
And if the question is to tell in what percentage of matches did he score less than say 60, then for that, as always we shift to a plot reflecting relative values in this case plot would be Cumulative Relative Frequency Polygon Plot
And to compare multiple sets of data we can have multiple cumulative relative frequency polygons
Here once again an interesting pattern emerges for Virat Kohli which is the bottom-most curve in the above figure and if we look at the class interval 40 to 50 and we can see that in near about 60% of the matches, Virat has scored less than 50 runs, for other players this percentage is higher and if we compare it to the topmost curve which is for Sehwag, it turns out that he has scored less than 50 runs in about 80% of the matches he has played, so that means he was less consistent in scoring high scores.
So, Virat has scored more than 50 runs in 40% of the matches he played whereas Sehwag has scored more than 50 runs only in 20% of matches he played. This sort of trend becomes very easy to read out from a cumulative relative frequency polygon.
In this article, we discussed that relative frequency plots are useful to answer questions related to the percentage, proportion, and to compare different sets of data, and then how the frequency plot, relative frequency plots nad relative cumulative frequency plot helps us to see the trend across different categories, multiple datasets where a histogram would have presented a very complicated picture.
References: PadhAI