Distribution of a categorical variable
So far we have looked at the distribution of data both with a distribution plot and a box plot — we are not looking at the numbers themselves but a frequency of numbers(how often a certain value is present) but we looked at continuous variables in both the plots.
If we look at the diamonds dataset(available with the ‘seaborn’ package), there are some categorical attributes as well, the below is a snapshot of this dataset
Attributes like ‘cut’, ‘color’, and ‘clarity’ seem to have a finite number of values. So, in this article, we discuss how to plot the distribution of a discrete or a categorical variable.
Bar plot is one of the ways to see the distribution of a categorical variable:
First, we group the dataset with respect to the attribute of interest say ‘cut’, and then apply the ‘.count()’ method on top of it
This command splits the data by different possible values of the ‘cut’ attribute and then for each unique value of ‘cut’, it counts the non-null values for each of the other attributes in the dataset
If we want the count not for all columns but for a few specific columns, then we can specify the column name for example to see the count for all values/data items that have non-null value for the ‘cut’ attribute
So, this command returns a series and the corresponding values denote the frequency for the respective values of ‘cut’
We can take the index of this series and use it for the x-axis and the values for the y-axis
The ‘ideal’ cut seems to be the most dominant one and the ‘fair’ and ‘good’ seems to be significantly smaller than the other two.
We can do this process for other categorical attributes as well
There are some very infrequent values, ‘SI1’ category seems to be quite popular, ‘VS2’ also seems to be popular
The distribution seems to be close to a uniform distribution with respect to the ‘color’ attribute.
So, this is how we plot the distribution of a categorical variable. We leverage the split-apply principle to first group the data based on a categorical attribute and apply the count() method on this grouped dataset and once this is available as a series object, we pull its index on the x-axis(this should be finite otherwise the bar would not make much sense), and the corresponding frequency values on the y-axis.