Frequently used Percentiles and the effect of transformations on percentiles
In the previous article, we discussed what is meant by the term ‘percentile’, how to compute the pᵗʰ percentile value. In this article, let’s discuss the frequently used percentile values, how to compute the percentile rank of a value in the dataset, and the effect of transformations on percentiles.
Let’s start with quartiles, say we have the dataset containing 20 elements, we sort the data and divide them into four equal parts.
So, the first 25% of the data is in the first part, the next 25% is in the next part, and so on.
If we look at the location which separates the first part from the second part, it is called as 25th percentile and 25% of the data elements are less than the value at this location, it is termed as 25th percentile or ‘Q1’.
Similarly, we have the 50th percentile as ‘Q2’, this value is such that 50% of the data items are less than this value.
Similarly, we have 75th percentile which is also termed as ‘Q3’.
So, quartiles divide the data into four equal parts. Let’s take a look at the computation of percentiles.
Let’s look at Shikhar Dhawan’s T20I scores in 50 matches and say we sort the data in ascending order
Let’s compute the locations of different percentiles value using the approach we discussed in this article:
Now the interesting thing is that the median is the same as the ‘Q2’. Let’s look at the formula of median and Q2
By definition, both the median and Q2 represents that value such that 50% of the values in the dataset are less than it. Although the formulas for the two look different at first glance, they are actually the same. Let’s discuss it for the odd and the even case(when ’n’ is odd and even):
And we can see that the formula for Q2 is the same as the formula for median when ’n’ is odd.
Let’s consider the case when ’n’ is even
The result in the above image is a simplified version by doing the arithmetic operation. We see that the final formula comes out the same as the formula for the median.
The concept is the same as the quartiles, it’s just that the Quintiles divide the data into 5 equal parts.
Here is an example of how to compute the P3(60th percentile) in this case:
Deciles divide the data into 10 equal parts
And again the computation is simple, say we want to compute the value of D3, we can do it like this:
Percentile rank of a value in the dataset
Going back to the university’s test example, say we have the scores of 25 students and one of the students scored 44 and now we are interested in knowing the performance of this student compared to other students or in other words, we are interested in knowing the percentile rank of the student who scored 44.
Percentile Rank of a data point is the percentage of values that are less than or equal to it.
In this case, we are interested in knowing what percentage of values are less than or equal to 44 and that would be the percentile rank of score 44.
Here is the formula for computing the percentile rank
’n’ represents the total number of data points
Here is the data set in the sorted order and color-coded as per the above definitions:
So, cₛ represents the number of values less than 44(in this case) and fₛ represents the number of values, data points which are equal to the value we are interested in(44 in this case), the intuition behind multiplying 0.5 with fₛ is that say there are 3 data points with value as 44, now are we computing the percentile rank of first 44 value or the second data point with value as 44 or the third point(in general) and so on, that is not clear so we just say we compute the percentile rank of the middle of the data points(i.e half of 3 in this case as there are 3 data points with value as 44)
The percentile rank of score 44 is 28 meaning 28% of the data points are less than or equal to 44 and now we can say that this person did not perform that well as 72% of students scored better than this student.
Let’s take another example:
Here is the dataset reflecting the Shikhar Dhawan’s scores in T20I and we are interested in knowing the percentile rank of 32
Effect of transformations on Percentiles
Let’s say we have a dataset reflecting the temperature in Fahrenheit and we convert it into Celsius(this is what we mean by transformations)
Temperature is converted in Celsius using the relation mentioned in the above image.
We are interested in computing the pᵗʰ in this new transformed dataset to see the effect of transformations on the percentiles values.
Firstly, we compute the location of the pᵗʰ percentile using the standard formula(notice that the formula does not have ‘a’ and ‘c’ in it which were used to transform the dataset), this means the location of the pᵗʰ does not change with transformations in the dataset.
Since the location has not changed, we can say that the integer and fractional part of the location of the pᵗʰ percentile will also remain the same.
This means that if we scale and shift the data, the percentile value also gets scaled and shifted by the same transformation.
So, coming back to our example where we have converted the temperature values, we can compute the new percentile values using the old one
In summary, in the last few articles, we have discussed what the term percentile means, how to compute the pᵗʰ percentile, frequently used percentiles, how the median is the same as the 50% percentile, how to compute percentile rank of value and effects of transformations on percentiles.