In the previous article, we discussed the measures of centrality, how to compute them, the relationship between different measures for different types of distributions. In this article, we discuss what is meant by percentiles, how to compute the pᵗʰ percentile value in a given dataset.
So, let’s start with the question: “What are percentiles?”
And here is some intuition for percentiles, suppose someone scores 45 out of 100 on a test, how would we rate the performance?
One way to judge the performance would be to say bad performance as the person scored less than 50% but that would not be the right way as we know that in most of the competitive exams, the highest score is around 60–70% of the maximum total score, we can not decide on the performance based on just a single score
So, what we look in such situations is not just the score but the comparison with other scores, say a person scored 45 out of 100 on a test and there are only 2 students who scored more than 45, than how would we rate the performance in this case? Now the score of 45 looks good on a comparative basis.
So, we have expressed the score or judging the score in terms of percentiles instead of absolute numbers.
Let’s understand what ‘percentile’ means using an example. Say a university conducts a written test and they have set the criteria that the students who score more than the 70th percentile will be called for interviews.
Say the below are the scores of 25 students
To compute the 70th percentile, we sort the numbers/data(in this case, scores in the test), we are looking for a value such that 70% of the data elements are less than this value. That’s the formal definition of a percentile.
p percentile — value such that the p percentage of the values, data elements in the dataset are less than this value
Here is the procedure to compute the percentile value:
- Sort the data
- Compute the location of pᵗʰ percentile(orange box in the above image) and the location is given by the formula as in the below image:
Here, in the formula, ‘p’ is the percentile we are interested in computing and ’n’ represents the total number of data points.
We get the location of 70th percentile as 18.2, we have the position of the elements as 1, 2, …… all the way up to 25, let’s try to understand what the location 18.2 means
We have 25 elements in the dataset as follows along with their locations:
Intuitively, 18.2 lies between the locations 18 and 19, and also it’s closer to location 18.
The 70th percentile value, in this case, is going to be greater than 56(value at the location 18) and less than 59(value at location 19) and the value would be closer to 56 as the location 18.2 is closer to location 18.
So, we start moving from the 18th location to the 19th location and the difference between the values at these two locations is 3 units (59 -56) and we are interested in the 0.2 of those 3 units as that’s where the position for the 70th percentile is in this case(18 + 0.2).
So, 56.6 is that value such that 70% of the values are less than this value.
The intuition behind the procedure for computing percentiles:
The intuition for this formula of computing the location of the pᵗʰ percentile is obvious, we are looking for that location such that p% of the data elements in the dataset are less than the value at that location.
So, when the data is sorted it is numbered(location of the data points) from ‘1’ to ‘n’(total number of data points) and we are interested in say 70% of the ‘n’(for 70th percentile).
Then we take the integer part and the fractional part of this location(that corresponds to the pᵗʰ percentile).
The percentile value would be given as:
Let’s understand this formula better with the help of a number line, so we have the location of 70ᵗʰ percentile as 18.2 which lies between the locations 18 and 19 and we can divide the interval between 18 and 19 into 10 parts of 0.1 each(equally spaced).
Similarly, the values corresponding to the location 18 and 19 are 56 and 59 respectively and we can again have 10 equally spaced bins between this interval from 56 to 59(the difference between two values is 3 units so each of the 10 parts will be at a distance of 0.3 units).
Now with this on how to compute the percentile value, we can answer the original question and the scores in pink in the below image are the ones who qualify for the interview process based on the 70ᵗʰ percentile score
Similarly, if the criteria is changed to select those students who scored more than 80th percentile, we can follow the same procedure and get the results
The question is why did we compute the exact score, we can just say from the location of 80th percentile(20.8) that the elements, scores at the location 21, 22, 23, 24, and 25 will get short-listed for interviews?
And releasing the cut-off score also helps other students to get the idea as by how much they miss the cut-off score.
Let’s consider a special case as well wherein the location of the pᵗʰ percentile corresponds to the position of a data element(meaning the fractional part in the location of the percentile is 0) and eventually the percentile value would be one of the data element itself.
So, the point being made here is that the percentile value could actually be a value in the dataset itself.
In the next article, we discuss two alternative approaches to compute the percentile value.