Box Plot

Parveen Khurana
5 min readJan 9, 2021

--

In the last article, we discussed the histogram plot that helps to understand the distribution of an attribute. In this article, let’s look at another way of plotting the distribution of data:

Let’s take a sample of size 1000 from a normal distribution to start with

And now we can simply leverage the ‘seaborn’ package to plot out a nice boxplot

The box depicts the median value(the middle line in the box in black) and the inter-quartile range(box’s boundary) and on either end of the box are the whiskers(by default, seaborn takes this distance between the boundary of the box and the whisker on the same side as 1.5*(interquartile range)) and the data points that are outside the whiskers are termed as outliers

So, for this random distribution, the median value is close to 0 which we would expect for normal distribution and then there are certain percentiles on either side of the median line/value that are symmetric which means there is no bias between positive and negative values — they are normally distributed.

Let’s have a box plot for a uniform distribution:

Now that it’s a uniform distribution, the median value is closer to 0.5, it has almost symmetric distribution for the interquartile range and the whiskers go all the way from 0 to 1 and since uniform distribution is between 0 and 1, there are no outliers.

Boxplot is intended to show where the data is accumulated. The box(blue rectangle in the above image) denotes 50% of the data items(and their range) and the remaining 50% are outside the box — how far from the box depends on the specified inter-quartile range(we can change this) and then there are some outliers in the data

Let’s change the distance between the boundary of the box in the boxplot and the corresponding whisker for example

In the above plot, ‘whis’ is specified as 0.2 which means the whiskers are plotted at a distance of 0.2*(inter-quartile range) from the corresponding end/boundary of the box in a boxplot

We can change the whisker distance from the box’s boundary for normal distribution as well

We can see that there are many outliers in this case and most of them are overlapping as well — one way to avoid this overlapping is by specifying the ‘fliersize’ argument

Now the dots are much smaller and it might help in distinguishing data points.

We can orient the boxplot vertically as well by specifying the ‘orient’ argument

Let’s take the diamonds dataset again(pre-loaded in seaborn) and plot the price as a boxplot

And below is the distribution plot of the price attribute

We see that lots of the numbers are towards the right in the long-range whereas most of the data items are accumulated towards the initial price range(say ≤ 5000) and the same distribution takes the following form in a boxplot

The median is closer to the left side and the whisker on the left side is closer to the box’s boundary on the left and the right side whisker is very far away and there are many many outliers towards the right-hand side.

The attribute ‘x’ of this dataset has a multi-modal distribution. Here is the boxplot for the same

We get the idea of the median value, there are not many outliers on the left side, and there are many many outliers on the right side, the left whisker is closer to the box’s boundary on the left where the right whisker is comparatively far away from the box’s boundary and so on but we don’t get a sense of what are the typical modes — a ‘kdeplot’, ‘distplot’ is far more informative in figuring out that aspect

This ‘distplot’ is far more informative than the boxplot to understand the different mode values.

In essence, we can say that the box plot helps to understand the spread around the median(width of the box helps us understand this), a broader distribution has a broader box, we can look at the positions of both the whiskers to understand if they are symmetric around the median or it’s a skewed distribution, we can also see the outliers and we can play with the distance between the box’s boundary and the whisker position to decide on what we call as an outlier.

References: PadhAI

--

--

Parveen Khurana
Parveen Khurana

Written by Parveen Khurana

Writing on Data Science, Philosophy, Emotional Health | Grateful for the little moments and every reader | Nature lover at heart | Follow for reflective musings

Responses (1)