Box Plot
In the last article, we discussed the histogram plot that helps to understand the distribution of an attribute. In this article, let’s look at another way of plotting the distribution of data:
Let’s take a sample of size 1000 from a normal distribution to start with
And now we can simply leverage the ‘seaborn’ package to plot out a nice boxplot
The box depicts the median value(the middle line in the box in black) and the inter-quartile range(box’s boundary) and on either end of the box are the whiskers(by default, seaborn takes this distance between the boundary of the box and the whisker on the same side as 1.5*(interquartile range)) and the data points that are outside the whiskers are termed as outliers
So, for this random distribution, the median value is close to 0 which we would expect for normal distribution and then there are certain percentiles on either side of the median line/value that are symmetric which means there is no bias between positive and negative values — they are normally distributed.
Let’s have a box plot for a uniform distribution:
Now that it’s a uniform distribution, the median value is closer to 0.5, it has almost symmetric distribution for the interquartile range and the whiskers go all the way from 0 to 1 and since uniform distribution is between 0 and 1, there are no outliers.
Boxplot is intended to show where the data is accumulated. The box(blue rectangle in the above image) denotes 50% of the data items(and their range) and the remaining 50% are outside the box — how far from the box depends on the specified inter-quartile range(we can change this) and then there are some outliers in the data
Let’s change the distance between the boundary of the box in the boxplot and the corresponding whisker for example
In the above plot, ‘whis’ is specified as 0.2 which means the whiskers are plotted at a distance of 0.2*(inter-quartile range) from the corresponding end/boundary of the box in a boxplot
We can change the whisker distance from the box’s boundary for normal distribution as well
We can see that there are many outliers in this case and most of them are overlapping as well — one way to avoid this overlapping is by specifying the ‘fliersize’ argument
Now the dots are much smaller and it might help in distinguishing data points.
We can orient the boxplot vertically as well by specifying the ‘orient’ argument
Let’s take the diamonds dataset again(pre-loaded in seaborn) and plot the price as a boxplot
And below is the distribution plot of the price attribute
We see that lots of the numbers are towards the right in the long-range whereas most of the data items are accumulated towards the initial price range(say ≤ 5000) and the same distribution takes the following form in a boxplot
The median is closer to the left side and the whisker on the left side is closer to the box’s boundary on the left and the right side whisker is very far away and there are many many outliers towards the right-hand side.
The attribute ‘x’ of this dataset has a multi-modal distribution. Here is the boxplot for the same
We get the idea of the median value, there are not many outliers on the left side, and there are many many outliers on the right side, the left whisker is closer to the box’s boundary on the left where the right whisker is comparatively far away from the box’s boundary and so on but we don’t get a sense of what are the typical modes — a ‘kdeplot’, ‘distplot’ is far more informative in figuring out that aspect
This ‘distplot’ is far more informative than the boxplot to understand the different mode values.
In essence, we can say that the box plot helps to understand the spread around the median(width of the box helps us understand this), a broader distribution has a broader box, we can look at the positions of both the whiskers to understand if they are symmetric around the median or it’s a skewed distribution, we can also see the outliers and we can play with the distance between the box’s boundary and the whisker position to decide on what we call as an outlier.
References: PadhAI