In the last article, we discussed the measures of centrality. In this article, let’s look at some peculiar characteristics of measures of centrality and the sensitivity of measures of centrality to outliers.
Mean value is termed as the center of gravity. To understand this, let’s first be clear on what the term ‘deviation’ means.
deviation : of a point in the dataset is the difference between the point and the mean value
Let’s understand it from the notations we have been using so far
So, we have the ‘n’ data points as x₁, x₂, ….. all the way up to xₙ, and the mean value is represented by x_bar.
Now the difference of the iᵗʰ data point i.e xᵢ from the mean value is the deviation for the iᵗʰ data point
Now if we calculate the ‘deviation’ for each data point in the dataset and sum it up, then it comes as 0 meaning the sum of ‘deviation’ of all points from the mean is 0.
We can compute the sum of deviations as following:
If we re-arrange the terms, we can write it as:
Now, ‘n’ times ‘x_bar’ would simply be the sum of the ‘n’ data points in the dataset as per the formula for the mean value
Let’s look at the physical interpretation of this result
Let’s understand this better with the help of the below image:
The white line at the bottom is the number line and we have some data here. We have only one square on number 1 that means we have only 1 data point which has value 1, similarly, we have one data point which as value 2, we have four data points with value as 3, two data points which have value 4 and all the way up to three data points which have value 10.
So, we have placed these as the weights on the number line, so assume this number line as the see-saw and suppose the fulcrum of the see-saw is at number 4, then what will happen is that we have more weights on the right-hand side and see-saw will tilt towards the right side
If we place the fulcrum at position 8, we can see that there are many more weights on the left side and hence the see-saw will go down towards the left side
However, if we place the fulcrum at exactly the mean value which happens to be the value 6 for this data, then the values on the left-hand side will balance the values on the right-hand side and the see-saw would be balanced.
That’s what the center of gravity means: what is happening in the above scenario is that the deviation on the left side cancels out the deviations on the right side and hence the see-saw is balanced at that particular point
Sensitivity of the Measures of Centrality to Outliers
Here we have the histogram for the runs scored by Sachin in ODIs and we can see that the histogram is left-skewed.
We define an outlier as any value which is far off from the other values in the data.
In this case, a score of 200 is an outlier as it is very far away from the typical values in the dataset which are clustered towards the left-hand side of the plot.
To explain the sensitivity of ‘mean’ value to outliers, let’s take the scores of Alistair Cook and Joe Root from the Ashes 2017–18 series.
And based on the scores, if we have to decide which player performed better in the series, Alistair Cook got a double century in the series whereas Joe Root did not get even a century in the series, so looks like Alistair Cook was the better performer in the series.
If we compute the ‘mean’ value and measure the performance based on the ‘mean’ value, looks like Alistair Cook did slightly better than Joe Root.
If we compute the ‘median’ value, we see that the ‘median’ values of the runs scored attribute is 14 for Alistair Cook meaning that of the 50% of the times he played in the series he got a score of less than 14 and in the remaining 50% of the times he got a score greater than 14. So, half of the times he performed not that well(as a score of 14 in a test match is not considered as a good score). And the median score for Joe Root is 51 that means in half of the innings he played, he scored greater than 51 and that is what the median value tells us.
So, the ‘median’ seems to suggest that Alistair Cook performed poorly whereas Joe Root performed well whereas the ‘mean’ value was telling the story the other way around.
The reason for this is that the score of 244 which is very far off from all the other scores that Alistair Cook scored in the series is pulling up the ‘mean’ value.
So, our overall verdict would be that Root performed better than Cook as he was more consistent in the series compared to Cook who had one high score and everything else was not that good.
Now if we drop the outlier value and compute the ‘mean’ and ‘median’ score again for Alistair Cook’s case
We see that the new ‘median’ value is close to the old ‘median’ value whereas the ‘mean’ value has changed a lot, this means that ‘median’ value is not sensitive to outlier but the ‘mean’ value is very much sensitive to the outlier.
Computing the ‘mean’ values with the outliers conveys a very wrong picture from a performance point of view.
Taking the same data to understand the concept of ‘trimmed mean’:
Now if we drop the extreme values(we need to drop the same number of elements from both ends) and compute the ‘mean’ value based on the remaining elements, we get the following:
‘Trimmed mean’ is close to the earlier ‘mean’ value for Joe Root which means there were no outliers there whereas the ‘trimmed mean’ value is very far off the original mean value for Alistair Cook’s case which means there were some outliers there.
Here is another example which reflects the salary of graduate students
So, there are these two students who got a very high package whereas the rest of the students got somewhere near to 14–15 lakhs, and if we compute the average salary considering all the data points we’ll get a very high value and here also ‘median’ value would make much more sense and if we compute the ‘trimmed mean’ value by dropping the two extremes values, we get a value close to the median value.
Let’s talk about the ‘mode’ as well:
Clearly the ‘mode’ is not sensitive to outliers as the ‘mode’ is the most frequent value in the dataset, the only scenario where it will be sensitive to outliers will be when the most frequent value also appears to be the outlier and if we drop the outliers the ‘mode’ will change