Distribution of Data — Histogram
In the last article, we looked at how to represent the dataset in a tabular format and how to style the tables to get insights at a high level.
In this article, we look at how to understand the distribution of a column i.e how the different values in a particular attribute are distributed.
Here we are interested in taking a sample and plotting its distribution — so essentially how the data is distributed in a given sample. Within this broad class of data distribution, there are several plots to understand the same. We start with the Histogram plot.
Histogram
Let’s first create some random data from a ‘normal distribution’ that contains 1000 data items
And now to plot the distribution of this data — to see how many data points are there in a given range of values, we can use the ‘distplot()’ from seaborn and pass in the input data variable ‘x’
So, this one line of code plots the distribution of the data but it also gives a little bit of text(just above the plot) but we can ignore that and the way to do the same is to suppress the warnings with the semi-colon
So, it’s a histogram and tells us the frequency for different values of inputs(x-axis) and this is normalized meaning the area under the graph is 1. We have two parts in the plot — one is the bars and the second is this smooth curve which is a fit on this histogram. We can hide/turn off this smooth curve by specifying an additional argument
We can also specify another argument ‘rug’ as True, what this will do is place tick-marks at the x-axis — it shows a line on the x-axis for each data point, this is a good way to figure out the outliers, see the density at different regions
We can set the color codes as well in the seaborn, it makes the plot more easily readable
We can specify the number of bins as well — for example, if the ‘bins’ argument is specified as 20, then it will create 20 equally spaced bins
If only ‘kde’ plot is required, we can call ‘kdeplot()’ method
The area under this curve would add up to 1 approximately.
‘shade’ could be passed in which will then shade up the area under the curve. This is very helpful when there are multiple ‘kde’ plots on the same graph
Let’s create a sample dataset of 1000 data items from a uniform distribution
The blue curve is for ‘x’ and the other one is for ‘y’ — it’s a uniform distribution — meaning its roughly flat on the top — each of the values is roughly equally spaced, curve tries to smoother it out, there is a sharp rise on either end.
Let’s load in a dataset from the ‘seaborn’ repository and plot the distribution of attributes of interest
Each record represents one diamond and its respective attributes
‘cut’, ‘color’, and ‘clarity’ seems to be categorical value, and the remaining attributes seems to be numerical quantities
We can check out the ‘.info()’ method’s summary for this dataset to ensure the data types for respective attributes are correct and to get a sense of non-null values
We can now take individual attributes and plot their distribution
This plot shows the different carat values on the x-axis and the corresponding frequency, histogram on the y-axis. This seems to be a multi-modal distribution as multiple peaks are there
The price seems to have a peak towards the left and it slowly tapers out as we move towards the right. So, this is a right-skewed distribution with a long tail towards larger values
This distribution seems to be between 4 and 8, it has a large number of values initially and then multiple peaks in between as well.
One question at this point is — why the seaborn is plotting the x-axis from 0 to 10+ when the data is between 4 to 8 — so one way to understand this is to specify ‘rug’ as True and it will then plot the tick marks on the x-axis for each data item
So, there is a data item that has ‘x’ value as 0 and it appears to be an outlier as there is no other data item close to this one. Between the range 4–8, there appears to be a solid bar — and this reflects a large number of data items.
One way to avoid plotting too many data points is to plot a sample — so say take out a sample of 1000 items from a total of ~54k data items [d.sample(1000) — returns a dataframe with 1000 sample entries]
There are no outliers in this one, and we are able to see some of the tick marks on the x-axis. We can play around with the number of bins as well to understand the distribution better
We can plot multiple attributes together
the blue and the orange curve seems to be almost overlapping with each other — x and y seem to be very similarly distributed.
In this article, we discussed the distplot() that helps to understand the distribution of an attribute for example for the diamonds dataset, the distplot for price suggests that it a long tail, this plot also helps to get the idea of the outliers and to understand if the multiple attributes have the same sort of distribution or not for example from the last plot, we can say that the attributes ‘x’ and ‘y’ in the diamonds dataset have a similar distribution.
References: PadhAI