Measures of Centrality for different types of distributions
In the previous article, we discussed the measures of centrality and their characteristics. In this article, we discuss the relationship between different measures of centrality for different types of distributions. Let’s get started.
When discussing histograms, we talked about different types of distributions, left-skewed, right-skewed, symmetric, and so on. We are interested in knowing the relations between mean, mode, and median for these types of distributions.
So, let’s start with ‘Perfectly Symmetric Distribution’, this is what the perfectly symmetric distribution looks like:
Informally, we can also define it as something of the following form:
In the above case, the central location is 11, that’s the tallest bar we have, then for every element (x-i), let’s take ‘i’ as 1, then we have (x-i) as 10, so for every value 10 that we have in the dataset, we’ll also have a value (x+i) i.e 12 and hence the bars of values 10 and 12 looks exactly of the same height because we have the same number of 10’s in the dataset as the numbers of 12's.
Similarly, the number of 9’s is exactly the same as the number of 13’s in the dataset.
In such symmetric distributions, the mean value is equal to the median value which in turn is equal to the mode value.
‘Mode’ is the most frequent value that means the tallest bar will correspond to the ‘mode’ which is the central bar in symmetric distributions(uni-modal symmetric distributions where we have one mode in the data).
Now, the ‘median’ also corresponds to the tallest bar because as per the definition of symmetric distribution, so whatever value we have at the center, we have an equal number of elements on either side. This means the tallest bar is at the center of the dataset and that in turn represents the ‘median’ value as per the definition of ‘median’.
‘Mean’ is also equal to the tallest bar as the deviation would be positive for values on one end and it would be negative for values on the other end and overall they will cancel out each other(exactly as the distribution is symmetric). Let’s understand this concept with the help of a toy data set
So, here the central value is 4 and there are three elements with a value of 3(4–1) and there are three elements with a value of 5(4+1), so this is symmetric dataset and we compute the mean value as below and we see that the deviations on the right end side of the middle value cancel out the deviation on the left-hand side of the mean value
Let’s take another example where we have more number of data points
Also, we know that the ‘mean’ is the center of gravity and in the histogram below, we can say that the seesaw will be balanced only if we place the fulcrum at the central location which is the ‘median’ location. Hence, for uni-modal symmetric distributions, the mean is equal to the median which is equal to the mode.
So, here, in this case, we have the values 3 and 7 as the ‘modes’, so here again, since the distribution is symmetric, the mean value is the same as the median value, so the central location is 5 and there is an identical pattern of values on either end and the seesaw would be balanced only if we place the fulcrum at the central location but the ‘mode’ would be different in this case as they there are two ‘modes’ on either side of the central location.
Other multi-modal distributions
So, here in the first figure at the top, we have a bi-modal distribution, where we have two ‘modes’ and the two modes are actually at the center of the data and we have an even number of data points. In this case, the ‘mean’ value would be equal to the ‘median’ value which would be the average of 4 and 5 i.e 4.5, so we can visualize that if we place the fulcrum at 4.5 then the seesaw would be balanced, if we place it at 4, seesaw won’t be balanced, if we place it as 5, it won’t be balanced.
And then in the second figure, we have four-mode values, and here again, the ‘mean’ value would not be equal to the ‘mode’ and the ‘median’ would also be not equal to the ‘mode’. Since it’s symmetric multi-modal distribution, the ‘mean’ would still be equal to the ‘median’.
So, that’s about the symmetric distribution and the key takeaway here is that for perfectly symmetric distributions the ‘mean’ value is equal to the ‘median’ and in case of uni-modal symmetric distributions, ‘mean’ is equal to the ‘median’ which is equal to the ‘mode’.
Left tail distributions — has a long tail towards the left, also called as negative skewed because it skews towards the negative side of the number line.
Right skewed distributions — has a long tail towards the right, also called as positive skewed distributions as it skews towards the positive part of the number line.
We are interested in knowing the relationship between mean, median, the mode for skewed distributions.
Let’s start with the Left skewed distribution, here is what the distribution looks like
On the horizontal axis, we have the units(in this case, the amount of cereals being produced in the farm) and the y-axis depicts the number of farms that produces that much quantity of cereals. There are few farms which are producing larger quantities of cereal and there are farms which produce fewer quantities of cereal.
Now we know that the ‘mean’ value is the center of gravity and if we place the fulcrum at the mean value then the seesaw would be balanced, so in this case, it’s obvious that the mean value should be towards the right where we have the tallest bars for left-skewed distribution
The ‘mode’ would correspond to the tallest bar and in case of left-skewed distributions, it would be towards the right-hand side and turns out that the ‘median’, in this case, lies between the mean and the mode. As median is the point that divides the data into equal halves, and since the tail is weak as there are very few elements in the tail, it’s obvious that the median would be somewhere towards the right side.
Right skewed distributions
So, this is what the histogram looks like
There are many farms that produce a smaller quantity of wheat and there are few farms which are producing high quantities of wheat, so the tail of the data is towards the right.
Since the ‘mean’ is the center of gravity, it’s obvious that it should be towards the left where we have tall bars.
Now, in this case, we have fewer elements towards the right and a lot of elements towards the left and since the ‘median’ is the point that divides the data into two halves, so it’s obvious it should be somewhere towards the left portion of the histogram.
And ‘mode’ is, of course, the leftmost point which corresponds to the tallest bar
This is almost-always true especially for the distributions that are perfectly left-skewed or right-skewed(the tallest bar is towards the right in left-skewed and towards the left for right-skewed).
There are cases where this relationship might not hold. So, here is one example where we have a left-skewed distribution(the definition talks about where the tail is, it does not talk about where is the maximum elements(tallest bar) or the mode of the distribution would be)
In this case, the ‘mode’ value is not the rightmost element, this distribution has a long tail and since the tail is towards the left, that’s a left-skewed distribution but it also has a heavy tail on the right side(so long tail which is towards the left defines the skew of the distribution)
So, this kind of left-skewed distribution which has a right heavy tail is where the general condition, the trend gets violated.
Again, this is not always true, here we have a left-skewed distribution(long left tail) and it has a heavy right tail also, now for this kind of distribution which classifies as left-skewed distribution as per the definition, here again, the general trend holds
So, if we have a perfectly left-skewed distribution where the heights of the bar increase as we go from left to right, the general trend holds true that the mean is less than the median.
There are some peculiar left-skewed distributions wherein we have a heavy tail as well, in that scenario, we don’t know the exact relationship between the ‘mean’ value and the ‘median’ value, there could be distributions where the ‘mean’ is less than the ‘median’, there could be cases where the ‘mean’ is greater than the ‘median’.
This is again a left-skewed distribution as we have tail towards the left but since this is bi-modal distribution, the general trend fails here.
We have similar cases for right-skewed distributions where the general trend gets violated.
Even if the tail is very long(for a uni-modal distribution), we could say that the general trend will still be valid. So, here is this left-skewed distribution and the mean value is less than the median value
Now if the tail becomes longer, we might think that the relation between mean and median might change, median might cross over(as we are adding more elements towards the left, so median will shift towards the left and it might cross the mean value); this won’t happen because as we add more elements towards the left that in turn will also reduce the mean value and mean would also shift more towards the left
So, if we have a perfectly left-skewed distribution where the tallest bar is the rightmost bar and there are no heavy tails and it’s a uni-modal distribution, then the general condition, the trend would not be violated if we add more elements in the tail.