In the previous article, we discussed what the term ‘percentile’ means and how to compute the frequently used percentiles. In this article, we discuss why do we need the measures of spread when we already have the measures of centrality in place.
Let’s start with the question “Why do we need measures of spread?”
To answer this question, let’s take two samples and compute their respective ‘mean’ and ‘median’ values
We see that the ‘mean’ and ‘median’ values are the same in both the cases but in the first case, the values, data points are closer to the mean(range of the data is from 81 to 85 and the mean value is 83) so we say that the variability of the data is very low whereas in the second case we have certain data points which are very far off from mean value and we say there is high variability in the data.
If we summarize the data using the ‘mean’ and ‘median’, then we don’t get to know about the variations in the data points.
Measures of centrality don’t tell us anything about the variability or spread in the data and that’s why we need measures of spread.
Let’s start the first measure of spread which is Range.
Range tells us the difference between the maximum value in the data and the minimum value in the data.
Range clearly tells us that the two samples(above case) are not similar even though their measures of centrality are similar, it tells us that the second sample has a higher spread(the difference between the values at two extremes is 104).
Let’ take another example and this time we use the data from the agriculture domain represents the yield of wheat from several farms
We see that most of the values are in the range of roughly 40 to 60 but because of this one outlier value(which could be from a very large farm or a collection of farms under one name), the range gets pulled up to a very high number.
That means, ‘range’ is also sensitive to outliers just like the ‘mean’. And this sometimes presents a different picture. Although it gives us an idea that there are high values in the dataset as in the above case, it does not distinguish between outliers and other data points.
To overcome this sensitivity to outliers, we use something known as ‘Inter-quartile range’ which is the difference between the 75th percentile and the 25th percentile. It again gives us an idea of the spread of the data and it is not sensitive to outliers(see in the below image, the outlier value 633 is in the last part/quartile and this part does not come into picture when computing the interquartile range).
And here is the computation for IQR for this example:
So, the IQR value comes as 13.6, what that means is that 50% of the data lies in the range of 13.6
As mentioned already, IQR is not sensitive to outliers. Let’s see the same by computing the IQR value after dropping the outlier value:
So, earlier we had the IQR value as 13.6, now with the outlier dropped we have the IQR as 13.4 so we can say that the presence of outlier does not change the interquartile range by much.
Let’s look at another measure of spread which is Variance.
The question we are interested in is “how far are the values from the typical value(mean value) in the dataset?” How much variation is in the data, are the values close to the mean value, or are they very far spread out?
One possible solution that might come to mind is to compute the deviation from the mean value for all the data elements and we could compute the overall deviation around the ‘mean’ value and that would give us an idea of how different are the values from the mean. But the catch here is that ‘mean’ value is the center of the gravity and as discussed in this article, the sum of deviations around mean value is 0.
For both the datasets in our case, the sum of deviations around the ‘mean’ value would be 0(in fact it would be 0 for any dataset)
And the reason why deviation comes out as 0 is that the deviations on one end of the seesaw(in below image) cancels out the deviations on the other end. But we do not want this to happen, we want to look at how far is the spread of the data on both the ends, we don’t want to cancel them as both ends tell us about the spread
There are a couple of solutions here, one is that we ignore the sign for the negative deviations and just take the modulus of the value(absolute value of the deviation), the other solution is to use the square of the deviation value and when we sum up this square of the deviation for all the data points and take the average, the quantity we get is called as the Variance.
And here are the formula for variance for a sample and a population(there is a little change in the formula and the reason lies in the probability theory)
Let’s compute the variance for the two datasets that we are working with and we know that the ‘mean’ value for the datasets was same, let’s see what the ‘variance’ tells us in this case:
Firstly, we compute the ‘mean’ value, deviation value for each data point and then we compute the square of the deviation of each point from the ‘mean’ value.
We can see that the variance is very high for the second sample and that’s exactly what we wanted. We wanted a measure of spread which helps us to distinguish between these two samples even though their measures of centrality are the same.
What we observe is that the variance is not measured in the same unit as the original data
For example, if we look at the variance value for the second dataset in the above case, the value comes as 1127 which is very very far from the range of the data. So, variance does not give us a direct intuition of what is happening in the data, hence we have this other measure of spread named ‘Standard Deviation’ which is the square root of the ‘Variance’
Now that we have taken the square root, the units of standard deviation are the same as the units of data.
Here are the standard notations, formulas for commonly used measures
Why we square the deviations?
Let’s discuss at a high level why we choose the square value instead of the absolute value of the deviations.
Here is a plot of both the functions(absolute and square) and we see that the square function is a smooth function and the absolute function is a ‘V’ shaped function and is not a smooth function as the function is not differentiable at (x-x_bar = 0).
As square function magnifies the larger outliers and suppresses the small outliers, we want this larger quantity of variance to show up because it tells us that there are some quantities which are really high as compared to the average value in the data and when the deviation is not that much from the mean value, then we are okay and that is what square function does in that case, it suppresses the value.
What does variance tell us about the data?
Just as the mean is termed as the center of gravity, the variance is a measure of consistency. Let’s understand what this means with an example:
So, let’s compare the runs scored by two players in the Ashes series
The mean value suggests that Alistair Cook performed better than Joe Root but we can observe that Root was more consistent throughout the tournament.
Let’s compute the variance value for both the data sets
We can see that the variance is fairly large for Cook and a lower value of standard deviation which ties back to our notion that Root was more consistent than Cook.
Variance plays a very important role in the manufacturing industries.
Effect of transformations on measures of spread
Let’s take the same example that we used while discussing the effect of transformations on percentiles.
Say we have a dataset reflecting the temperature value in Fahrenheit and we transformed(scaled and shifted) it in Celsius.
Let’s look at the effect on the ‘Range’:
We see that the new range just get scaled up(and no shifting happens)
When the data is transformed, the percentile values also get transformed(scaled and shifted) in a similar manner. With this in mind, let’s look at how the IQR gets changed with the data transformation
So, the new IQR is also a scaled version of the old IQR.
Let’s look at how the variance gets transformed:
So, the variance gets scaled by the square of the quantity by which the data gets scaled up.
And using this, we can see that the standard deviation gets scaled by the same quantity by which the data elements get scaled.
Here is a quick summary of how the various measures get changed with the transformations on data