# Distribution of a categorical variable

--

So far we have looked at the distribution of data both with a distribution plot and a box plot — we are not looking at the numbers themselves but a frequency of numbers(how often a certain value is present) but we looked at continuous variables in both the plots.

If we look at the diamonds dataset(available with the ‘**seaborn**’ package), there are some categorical attributes as well, the below is a snapshot of this dataset

Attributes like ‘**cut**’, ‘**color**’, and ‘**clarity**’ seem to have a finite number of values. So, in this article, we discuss **how to plot the distribution of a discrete or a categorical variable**.

**Bar plot is one of the ways to see the distribution of a categorical variable**:

First, we group the dataset with respect to the attribute of interest say ‘**cut**’, and then apply the ‘.**count()**’ method on top of it

This command splits the data by different possible values of the ‘**cut**’ attribute and then for each unique value of ‘**cut**’, it counts the non-null values for each of the other attributes in the dataset

If we want the count not for all columns but for a few specific columns, then we can specify the column name for example to see the count for all values/data items that have non-null value for the ‘**cut**’ attribute

So, this command returns a series and the corresponding values denote the frequency for the respective values of ‘**cut**’

We can take the **index of this series** and use it for the **x-axis** and the **values** for the **y-axis**

The ‘**ideal**’ cut seems to be the most dominant one and the ‘**fair**’ and ‘**good**’ seems to be significantly smaller than the other two.

We can do this process for other categorical attributes as well

There are some very infrequent values, ‘**SI1**’ category seems to be quite popular, ‘**VS2**’ also seems to be popular

The distribution seems to be close to a uniform distribution with respect to the ‘**color**’ attribute.

So, this is how we plot the distribution of a categorical variable. We leverage the split-apply principle to first group the data based on a categorical attribute and apply the count() method on this grouped dataset and once this is available as a series object, we pull its index on the x-axis(this should be finite otherwise the bar would not make much sense), and the corresponding frequency values on the y-axis.

References: PadhAI