In the previous article, we discussed the different types of data and different types of plots. In this article, we will discuss the next topic in descriptive statistics which is the measures of spread and centrality. We will start with the question “Why do we need measures of spread and centrality?”, then we will discuss the different measures of centrality.
So, let’s start with the first question in the objective which is “Why do we need measures of centrality and spread?”.
In real-world applications, we have access to large amounts of data, even if its structured data in the form of a table as in the below image, we could have large amounts of data(in the order of millions or above). This could be data about patients where we have records of millions of patients across different hospitals for which we have data perhaps the government has collected this data or say as an insurance company we have the data about the customers or the government may have data about citizens or a company may have data about employees.
And we have established that it’s hard to visualize large amounts of data in tabular format and to overcome that limitation, we draw plots which help to better understand the data. But in some situations, we may want to go beyond plots and have an even more succinct summary of the data, so we might want to describe the data by a single or just a few numbers and these are 3 to 4 numbers which give us an idea of what the data looks like.
And now tying this back to things that we have already seen, we have defined the terms ‘population’ and a ‘sample’.
‘Population’ is the entire data or the entire collection of objects under study and often we don’t have access to the entire ‘population’ and hence we work with a sub-group of the population which is called a ‘sample’. And now any numeric property of a population is what we are interested in and that’s called a ‘parameter’ and since we can not estimate from the ‘population’, we typically estimate this from a ‘sample’ in which case it is called a ‘statistic’ and we use this estimate i.e ‘statistic’ as an estimate for the ‘parameter’ of the ‘population’.
So, we will discuss some standard statistics which are known as summary statistics, that’s the objective of this article.
These summary statistics are used for Quantitative data that means columns in the data which contain numbers(this could be continuous or discrete, doesn’t matter). If we have Quantitative Data, these are things that we could do with it, we could compute the measures of centrality(which are mean, mode and median), then we have percentiles(where we could different things like Quartiles, Quintiles, Deciles or any other percentiles), and then we have measures of spread like Range, Inter-quartile range and standard deviation.
So, the goal of the summary statistics is to summarize the overall data to a single number or to a bunch of numbers.
Measures of Centrality
What we mean by the measure of centrality, is that we are interested in the answer to the question: “What is the typical value of an attribute in our dataset?”
To answer this question, let’s take a look at the data of ODIs matches played by Sachin Tendulkar.
If we look at the attribute ‘Runs scored’ which is a quantitative attribute, so we are typically interested in “how many runs would Sachin typically score in a match?”. We know that the range is very high, he has very low scores of 0 also and has a high score of 200, these are extremes, so what we are interested is in is how much we can bank on him like what would be the typical score that we would make and as you would have guessed, it’s like asking about the ‘Average Score’ that he will make.
Similarly, we could be interested in the number of balls he would typically face in a given match. So, these are the kinds of questions we are interested in.
So, we are looking at the Runs scored columns and we are interested in the mean/average value of this column.
Before we proceed further, let’s fix some notations here.
When we are looking at data points, we will typically say that there are ‘n’ data points and as discussed earlier, we will typically be dealing with ‘sample’ and not the entire ‘population’. So, these are the ’n’ data points in our sample and we call them as x₁, x₂, ….. all the way up to xₙ.
So, in this case, we have this data, we have the runs scored for all the 452 ODIs he played.
So, x₁ which corresponds to the first row in the table represents the runs scored in the first ODI, then x₂ represents the runs scored in 2nd ODI, and so on all the way up to x₄₅₂ which represents the runs scored in the 452ⁿᵈ ODI.
So, this is how we will refer to the data and if we are computing the mean from a ‘sample’ then we are going to call it as ‘x_bar’ and if we are going to compute the mean from the ‘population’, then we represent it by the symbol mu.
So, in this case, although we have the data for all the ODI’s, we are saying it could be more ODIs for which we don’t have the data, for example, he could have played in ‘World XI Vs Rest of XI’ or some matches in county cricket, so we are still going to consider these 452 data points as a ‘sample’ as we don’t have data for other matches that he played.
Now, we can compute the ‘mean’ value for this sample using the below-mentioned formula which is:
‘mean’ : we compute the sum of all the elements in the data and divide it by the total number of elements in the data
So, using this formula we compute the ‘mean’ in this case as:
The number is different from the batting average that is reported on the cricket websites because they remove the number of matches in which he was not dismissed i.e they will count the runs scored in the numerator but don’t count it in the denominator, so this way is okay for batting average which is a special statistic for Cricket domain, but if we are computing the ‘mean’ value from a sample, we need to take into account all the data points both in the numerator as well as in denominator.
Let’s look at the next measure of centrality which is ‘median’ and to understand what the ‘median’ value represents, let’s take the runs scored by Shikhar Dhawan in T-20 international matches:
‘median’ : is the value that appears at the center of the data when the data is sorted.
So, to compute the ‘median’ value, the first thing we do is to sort the data and then we look the value at the central location
Center location has an equal number of elements on either side and hence it is the mid-point or central location of the data:
So, the ‘median’ in this case would be 23 i.e the value at the center location.
Let’s look at how do we compute ‘median’ when we have an even number of data points and we can simply finalize on the center location.
So, to understand this, let’s look at Shikhar’s score in 50 T20 matches
Now the data has actually two midpoints:
If we look at the values 15 and 16 together, considering this as one element we have an equal number of elements on its either side and to compute the median, we just take the average of these two values(15, 16).
So, in summary, when ‘n’ is odd then the median value is simply the value at the ‘(n+1)/2’ location in the sorted data.
When ‘n’ is even, the median value is the average of the values at the location ‘n/2’ and ‘(n/2) + 1’
Let’s discuss the third measure of centrality which is ‘mode’:
Once again we look at the 59 sorted scores for Shikhar Dhawan and the ‘mode’ represents the most frequently occurring value in the data.
mode : the most frequently occurring value in the dataset
So, for this data set containing 59 data points, the value 1 appears the most number of times and hence it is the ‘mode’. This is termed as Single-modal distribution where just a single value is the mode.
There could be some special cases where we have multi-modal or bi-modal distributions:
In this dataset, we see that the value 5 and the value 15, both of them appear, 5 times and this is more than the occurrence of any other value in the data set. So, there is a tie here, there are two values which occur the maximum number of times, so there are two modes in this case which are: 5 and 15. This is known as a bi-modal distribution.
We could also have a multi-modal distribution where we have more than 2 values as the mode.
There could also be a situation where we do not have any mode in the data, this would happen if every value appears once in the data, there is no element with a maximum frequency.
In summary, we have looked at three measures of centrality: mean, median, mode.
In this article, we discussed why do we need the measures of centrality and what are the different measures of centrality. In the next article, we discuss the characteristics of these measures of centrality.