Distribution of Data — Histogram

Parveen Khurana
6 min readJan 9, 2021

--

In the last article, we looked at how to represent the dataset in a tabular format and how to style the tables to get insights at a high level.

In this article, we look at how to understand the distribution of a column i.e how the different values in a particular attribute are distributed.

Here we are interested in taking a sample and plotting its distribution — so essentially how the data is distributed in a given sample. Within this broad class of data distribution, there are several plots to understand the same. We start with the Histogram plot.

Histogram

Let’s first create some random data from a ‘normal distribution’ that contains 1000 data items

And now to plot the distribution of this data — to see how many data points are there in a given range of values, we can use the ‘distplot()’ from seaborn and pass in the input data variable ‘x

So, this one line of code plots the distribution of the data but it also gives a little bit of text(just above the plot) but we can ignore that and the way to do the same is to suppress the warnings with the semi-colon

So, it’s a histogram and tells us the frequency for different values of inputs(x-axis) and this is normalized meaning the area under the graph is 1. We have two parts in the plot — one is the bars and the second is this smooth curve which is a fit on this histogram. We can hide/turn off this smooth curve by specifying an additional argument

We can also specify another argument ‘rug’ as True, what this will do is place tick-marks at the x-axis — it shows a line on the x-axis for each data point, this is a good way to figure out the outliers, see the density at different regions

We can set the color codes as well in the seaborn, it makes the plot more easily readable

We can specify the number of bins as well — for example, if the ‘bins’ argument is specified as 20, then it will create 20 equally spaced bins

If only ‘kde’ plot is required, we can call ‘kdeplot()’ method

The area under this curve would add up to 1 approximately.

shade’ could be passed in which will then shade up the area under the curve. This is very helpful when there are multiple ‘kde’ plots on the same graph

Let’s create a sample dataset of 1000 data items from a uniform distribution

The blue curve is for ‘x’ and the other one is for ‘y’ — it’s a uniform distribution — meaning its roughly flat on the top — each of the values is roughly equally spaced, curve tries to smoother it out, there is a sharp rise on either end.

Let’s load in a dataset from the ‘seaborn’ repository and plot the distribution of attributes of interest

Each record represents one diamond and its respective attributes

cut’, ‘color’, and ‘clarity’ seems to be categorical value, and the remaining attributes seems to be numerical quantities

We can check out the ‘.info()’ method’s summary for this dataset to ensure the data types for respective attributes are correct and to get a sense of non-null values

We can now take individual attributes and plot their distribution

This plot shows the different carat values on the x-axis and the corresponding frequency, histogram on the y-axis. This seems to be a multi-modal distribution as multiple peaks are there

The price seems to have a peak towards the left and it slowly tapers out as we move towards the right. So, this is a right-skewed distribution with a long tail towards larger values

This distribution seems to be between 4 and 8, it has a large number of values initially and then multiple peaks in between as well.

One question at this point is — why the seaborn is plotting the x-axis from 0 to 10+ when the data is between 4 to 8 — so one way to understand this is to specify ‘rug’ as True and it will then plot the tick marks on the x-axis for each data item

So, there is a data item that has ‘x’ value as 0 and it appears to be an outlier as there is no other data item close to this one. Between the range 4–8, there appears to be a solid bar — and this reflects a large number of data items.

One way to avoid plotting too many data points is to plot a sample — so say take out a sample of 1000 items from a total of ~54k data items [d.sample(1000) — returns a dataframe with 1000 sample entries]

There are no outliers in this one, and we are able to see some of the tick marks on the x-axis. We can play around with the number of bins as well to understand the distribution better

We can plot multiple attributes together

the blue and the orange curve seems to be almost overlapping with each other — x and y seem to be very similarly distributed.

In this article, we discussed the distplot() that helps to understand the distribution of an attribute for example for the diamonds dataset, the distplot for price suggests that it a long tail, this plot also helps to get the idea of the outliers and to understand if the multiple attributes have the same sort of distribution or not for example from the last plot, we can say that the attributes ‘x’ and ‘y’ in the diamonds dataset have a similar distribution.

References: PadhAI

--

--

Parveen Khurana
Parveen Khurana

Written by Parveen Khurana

Writing on Data Science, Philosophy, Emotional Health | Grateful for the little moments and every reader | Nature lover at heart | Follow for reflective musings

No responses yet