In the previous article, we discussed the relationship between measures of centrality for different types of distributions. In this article, we discuss how to compute the mean, median, mode from a histogram.
Say we are provided with just the histogram and not the dataset, and we are asked to compute the values of these measures of centrality within some approximate range of the actual values.
We can prepare a frequency table from a histogram for example for the range 0–10 we have the frequency as roughly ~125 from the histogram, for the interval 10–20 the frequency is ~58, and so on for other intervals. So, as a first step, we take the histogram and convert it into a frequency table
Let’s first see how to compute the ‘median’ value:
- First, we compute the cumulative frequency(for every interval, we add the frequency of that interval and all the intervals before it, for the very first interval, the cumulative frequency is the same as the frequency of the interval). This way the cumulative value at the last interval corresponds to all the data points that we have, so this tells us the total number of data points in the dataset.
2. Now that we have the value of the number of data points, we can find the location of the median. In this case, the median locations are (452/2 = 226, and 226+1 = 227).
3. Once we have the central locations, we locate the interval containing this central location(location 226 and 227 in this case). So, it will be that interval such that the cumulative frequency of the interval before this interval was less than the value(s) of the central location and the cumulative frequency of this interval is greater than these two values.
We can see that the interval 20–30 contains all points from 186(points up to 185 were considered in 0–20 interval) to 229 and hence the center of the data(which is location 226, 227) lies in the interval 20–30.
So, we have located that the median lies in this interval, we are not sure of the 44 points(as per frequency) that lies in this interval.
We can only compute the median approximately and we say that the median is actually the mid-point of this interval, so we say that the median is 25 ( as the average of 20 and 30 i.e the interval endpoints).
Let’s look at the intuition behind this approach:
For this particular case, we have the true data and if we compute the median value, we get the value as 28. Here is the same data as a histogram:
And we know that the cumulative frequency looks like this up to this interval and the location of the median value is 226 and 227
So, the situation as follows:
There are 185 elements, data points in the interval 0–20, 44 elements including the median in the interval 20–30, and the remaining 223 elements are in the interval 30–200.
For this dataset, we know exactly the 44 elements that belong to the interval 20–30, so if we zoom into these 44 elements and we know that this interval ends at location 229, we can get the values at location 226, 227 and we could compute the median.
Now these 44 points could have been very different(remember we are trying to estimate median from a histogram and we are not given the actual data points) and considering if the 44 points correspond to the below data points then the median would be 26(average of 25 and 27)
So, the median value would vary depending on the actual data but in absence of data, the best we could do is to guess that the median is at the center of this interval i.e 25 but it is possible that in some cases we would have over-estimated or we could have under-estimated it but we will still be close to the true median and we won’t be very wrong in doing it.
And since the true median, in this case, is 28 and we estimated it as 25, that means we made an error of ~10% in this case.
There are still some questions that are unanswered, in the above case the intervals of size 10 and that’s why we knew that if we compute the mid-point, the maximum we can go wrong is by 4–5 elements, now if the intervals are bigger what impact it would have on the error value. To understand this, let’s look at another example from the agriculture domain.
Here on the x-axis, we have the total yield in units, and on the y-axis, we have the number of farms that had that much yields(in that interval) and this is a right-skewed distribution and the bin size is 10000.
We can start with the standard procedure of computing the cumulative frequency and plot it in this case as we have a total of 60 bins
So, every point on this chart corresponds to the interval below it on the x-axis and not only includes the frequency of that interval but all the intervals before it. So, the last element tells us the total number of points in the data.
Once we have the number of data points, we can compute the central location and we can also compute the median value from the true data.
We have the central location as 806, so we need to look for the interval such that its cumulative frequency is greater than 806 and the cumulative frequency of the interval before this interval is less than 806. From the chart, we can say that interval happens to be 90000–100000.
Now our estimation of the median would be the mid-point of this interval which comes out as 95000.
The true median value is 96080 and the estimated value comes out as 95000, so in absolute terms, we have made an error of ~1000 units but in terms of %, we have the value as ~1% error.
So, if the histogram intervals are bigger, we would large errors in terms of absolute values but the relative error would be small.
Computing mean value from a histogram
Given the histogram, we can prepare a frequency table
Here is the procedure to compute the mean value:
- We compute the mid-point of each interval.
- We then multiply the mid-point of each interval by the frequency of the interval.
- We sum up the results of the product of mid-point and frequency(as computed in step 2) for all the intervals. In this case, the sum comes up as 18880
- To compute the mean value, we divide this sum by the total number of data points.
So, applying this procedure for the given histogram, we get the mean value as 41.77 and the mean value as computed from data(say data is available) is 40.76 and the error value is ~2%.
The intuition behind this procedure
To compute the mean value, we sum up all the data points and divide this sum by the number of data points.
Now, we can write the sum of all elements as the (sum of elements in the 1st interval + sum of all the elements in the 2nd interval + …… + sum of all the elements in the last interval)
Let’s take one of the intervals and zoom into it
To deal with this problem, we assume that each element is equal to the mid-point of the interval, so there would be some elements that we are over-estimating and there would be some elements that we are under-estimating. And roughly these over-estimates would balance out the under-estimates and we would not be very off from computing the sum of all the elements in the interval.
Let’s say we have the below 10 points as the actual data for this 8th interval and then the true sum would be 741 and if we approximate all the 10 elements as 75, that means the estimated sum would be 750(75 * 10) and for this interval, the estimated sum is not far off from true sum
The same argument holds for all the other intervals as well and since it holds for all other intervals, it holds for overall sum also, and hence the total estimated sum is not going to be very different from the true sum.
And here also we have the question: what if the class intervals are bigger?
Taking the same agricultural data, we can compute the true mean value and we can estimate the mean value using the procedure we discussed, error is terms of absolute terms is very high but in relative terms, it’s < 1%.
So, if the class intervals are bigger, we would still make a very small error in relative terms.
Note: we have taken the mid-point of any interval as the sum of the values at two ends for example for the interval 70–80 we compute mean as (70 + 80) / 2 = 75, generally the right-hand endpoint is not included in the interval so the actual mean would have been (70 + 79)/ 2 = 74.5
Computing mode value from a histogram
Let’s say we have been given the below histogram and we have to find the mode value
As per the definition, ‘mode’ is the most frequent value in the dataset.
From the histogram, we can say that the 60–70 is the most frequent interval but it could be the case that the most frequent value is contained in some other interval.
If we look at the smallest bar in the interval that corresponds to the interval 0–10 and the frequency of this interval is 2 that means this interval contains only 2 elements(both the elements with a value of 2)
All the other intervals have a higher frequency compared to the 0–10 interval but none of the values in these other intervals repeats in the dataset and appears only once and only the value 2 in the dataset repeats twice which is also the mode value and it lies in the interval which has the smallest bar.
So, the takeaway is that if the class interval or bin size is greater than 1, it is not possible to estimate the mode from histogram because we could have this kind of distribution where the shortest bar actually contains the mode value.
If the bin size is equal to 1, then it is trivial to compute the mode and in that scenario, we can just look at the heights of all the bars and the tallest bar would correspond to the mode value.