In the last article, we looked at the histogram as a way of describing and understanding the trends in Quantitative Data. Let’s look at another kind of plot for describing Quantitative Data which is the Stem and Leaf plot. It is an efficient way of describing small to medium data sets.
Let’s say we have the data for the runs scored by Sachin in his last 30 ODIs.
Stem and Leaf plot represents every number/data item in the data(score in each match in this case) in two parts, one is called the stem and the other is called the leaf
Let’s take one number from the above dataset say 69, so in stem and leaf plot, what we do typically is that we have the last digit as the leaf and the other digit as the stem. Let’s take other values from the dataset as well:
So, this is what the data would look like in a stem and leaf plot. The entire dataset in this tabular format would be represented as:
Let’s understand this using the row having the Stem as 1, so we have the Stem as 1 and 0458 as the Leaf, what that means is that in the interval of 10 to 20, there are value 10, 14, 15, 18(i.e stem with each of individual digits from the leaf), so that’s this table captures.
At this point, it might be obvious that this is just like an inverted way of showing a Histogram.
The first row that we see essentially represents the interval 0 to 10 and we have the values in that interval which is the scores of 2, 2, 3, 4, 6, 7, 8 and similarly, if we look at the third row then it’s the interval 20 to 30 and it has values 22, 24, 27, 28. So, that’s how we interpret this tabular format.
We could also use Stem and Leaf plot for continuous data where we have fractional numbers, typically now this convention varies from place to place, in many packages, it just rounds off the number, so 58.9 would convert to 59, so those packages actually convert the continuous data to discrete data and with the discrete data we can draw the stem and leaf plot as discussed above.
So, this tabular format actually looks like a histogram in the sense that it conveys the same information for example the interval 20 to 30 contains 1 value, the interval 30 to 40 contains 2 values, the interval 40 to 50 contains no values in the dataset, and so on.
Now one question is what if we have larger values? Here in both the examples that we took, we had only 2 digit values and hence the decision was easy to use one digit as the stem and the other as the leaf but what if have larger values like in the below image?
If we have bigger values like 6 digit values, let’s look at what happens if we draw the stem and leaf plot as before where we keep the last digit for the leaf and the remaining digits for the stem
This does not look very interesting as we can’t see very interesting patterns here, so if we look at it as an inverted histogram, then we have the numbers on the x-axis as 17952, 18059, and so on and each of these numbers has only one element inside it, it’s just like all the bars are of size 1 and as we know that having such kind of class intervals typically does not help in getting interesting patterns from the data.
So, what we do in this case is that we choose the last four digits as the leaf and the first two digits as the stem
Now it looks like in the interval of 1700000 to 1800000, there is one value which is 9523 so that’s 17900523, in the interval of 19 lakhs to 20 lakhs there are two values i.e 19003750 and 19008738, so that’s how we read this table.
So, if we have larger values, we need to decide what kind of class interval makes sense(how many digits to consider as stem and how many as the leaf).
The key thing to note here is that every row is one class interval and we write the value(s) in that class interval in the leaf part and there are multiple conventions here: one convention is that when we have this large number of digits(like the 6 digit data in the above example), then we just decide what is okay for the stem, is it okay to have a 1 digit stem or a 2 digit stem and we put that in the stem column and then we don’t write out the entire value in leaf, but we just keep one digit there, so in the below image for the first row, we will just keep the digit 9 and it is understood that this is some value in 9000, then similarly for the 3rd row, we will keep the value 3 and 8 which is understood that this is some value in 3000 and 8000. At the same time, it does no harm to keep the full value also, so it depends on whether it's becoming a bit cumbersome and we just want to retain just one digit and have the interpretation that whatever digit is there, we multiply it by 1000 because we are curtailing the last three digits of the number.
The key thing here is that instead of just drawing bars, we are actually writing down the values, so we are having more details here as compared to the histogram.
What if a row has many values?
Now sometimes what happens is that say we have a stem of 43 as depicted in the below image and it has a large number of values, it has values 012, 019 all the way up to 472 and then 537, 551, 577, all the way up to 991. In such cases, we divide this row into two parts, so we write the stem 43 twice and the first row will only contain the leaves starting from 0 to 4 and the second row will contain the leaves starting from 5 to 9.
Stem Plot vs Histogram
Here is the stem and leaf plot and the histogram for the runs scored by Sachin in 30 ODIs and we have the interval 0 to 10, 10 to 20, and so on in the histogram. Similarly, we have the intervals 0, 1, 2, and so on where we know that the interval 2 or the value 2 corresponds to the interval 20 to 30
Now if we think of the Stem and Leaf plot as the inverted histogram, we have these bars throughout, it will look like:
This has the added advantage that we are now also zooming into the bar, in the histogram, we don’t know what are the individual values but here we have the access to the individual values also and that’s how Stem and Leaf plot is different from a histogram because it has more details and whatever patterns we could discover from a histogram which in the above case is that there are a lot of values in the earlier interval and very few values in the later class interval, the same pattern can also be understood from a stem and leaf plot.
Stem and Leaf plot is not preferred for large datasets. Here is another dataset where we have a large number of values, so here we can’t really make sense of all the zoomed-in values that we have, we just treat this as a long bar and not really be able to see all the values inside it
When we have really large datasets, a stem and leaf plot is not preferred, we would rather have a histogram.
The advantage of having a stem and leaf plot is that we have these values inside which makes it very easy to spot certain patterns within each class interval.
For example, if we look at the class interval 60 to 70 here, we see that a lot of values there are 64, so the value 64 appears very frequently as opposed to any other value in that row which is 61 or 65 or 67 or 69.
So, such patterns like what is happening inside the class interval become obvious in a stem and leaf plot.
In the above image, we can also see that the row corresponding to 80 has all numbers which are multiples of 2.
And similarly, if we look at the rows corresponding to interval 20 and 40, we see that the difference between the values is 3. So, we have a value 41, then we have a value 44, then 47 and the same is true for row number 2 or the interval 20–30.
So, such types of patterns becomes clear in a stem and leaf plot because we have this zoomed in version of the data where within the bars, we are also seeing the values while not missing any of the other details that we see in a histogram, the overall trend is still visible here but the only issue is that we can do this for small datasets, for large datasets, its just too much data and we won’t be able to make sense of zoomed in information that we have in the large datasets.
Then, we also have back to back stem and leaf plots which allow us to compare two datasets. Say we want to compare the ODI scores of Sachin with his Test scores, sample data looks like the below
Here, once again we see that there are a lot of values in the lower range, so there are a lot of values below 20.
So, a back to back stem and leaf plot helps us to compare two datasets.