The joint distribution of two variables

So far we have discussed how to viz. and understand the distribution of an attribute, in this article, we discuss the joint distribution of two variables.

Joint distribution is helpful to understand how two variables are related so if we have ‘x’ and ‘y’ as two variables, we can plot two KDEs but we would not know for instance when ‘x’ is small what is the corresponding distribution of ‘y’ or in general the correlation between the two attributes.

Let’s first define two independent variables(both normally distributed)

And create a dataframe using these two variables

Now we can have a ‘jointplot’ leveraging the ‘sns.jointplot()’ and passing in the ‘x’ and ‘y’ columns of the newly created dataframe

Alternatively, we can directly pass in the ‘x’ and ‘y’ columns and specify the dataframe name as the value of the ‘data’ argument

What we get is a 2D plot where each dot(in the scatter plot) corresponds to one row/data item of the dataframe. It also reflects two histograms — one at the top which denotes the distribution with respect to the attribute on the x-axis that tells us how the data is located as we vary ‘x’; the other histogram is located on the y-axis towards the right of the scatter plot which tells the data distribution with respect to the respective attribute on the y-axis.

So, it plots the distribution of two variables jointly and since the two variables are not related in this case, there is no clear trend here — this distribution is very much like a circle in the middle.

We can also plot a ‘KDE’ by specifying the value of the ‘kind’ argument

Instead of showing the histograms, it depicts a smooth curve/plot, and further, the scatter plot is replaced by a shaded contour plot — the darker the color the denser the region.

Let’s take another data example where the two variables are inter-linked/correlated

Now we see that the circular shape is replaced by this sort of elliptical distributed shape — both the variables ‘x’ and ‘y’ looks like to have a normal distribution and there is a clear correlation(linearly related) shape to the joint distribution in the middle of the plot which means that as we increase ‘x’ and go towards the right, ‘y’ values changes significantly.

Let’s look at the joint distribution of ‘carat’ and ‘price’ attributes from the diamonds dataset(pre-loaded in the ‘seaborn’ package)

First of all along the price axis as we move towards higher prices, there are some diamonds but not as many as there are in the beginning and the for ‘carat’ axis as well, there is a long tail towards the right.

The KDE misses out on some information on the top-right side of the plot/canvas because it’s not trying to plot individual points — it’s like plotting ‘KDE’ for distributions without plotting the rugs, so we are not able to see where the outliers are and the same is there in this KDE.

The most number of data items are in the dark blue region and as the carat size increases, the distribution of price suggests that the price also increases so there is a positive correlation between ‘price’ and ‘carat’.

Still, this does not help us get more insights because most of the data points are missing on the top right side of the ‘KDE’ as it is trying to plot a continuous contour; to get the idea of the same, we can have a scatter plot with a sample of data points

The above plot is based on a sample of 500 data items and we have the histograms for both ‘price’ and ‘carat’ and the scatter plot of these two attributes.

There are a lot of data points towards the bottom left of the plot and as the carat size increases, the price seems to increase so it suggests a positive correlation between the two but as we go farther(increased carat size), this correlation is a bit weaker(as highlighted by circles data items in the below image where a diamond of carat size 2 has a lower price than a diamond of carat size 1)

In a similar manner, we can plot the joint of any two attributes to explore the sort of correlation: below is a joint plot of the attributes ‘x’ and ‘price’ from the diamonds dataset

Again, there are many points in the bottom left of the plot that means many diamonds are small(‘x’ denotes the length of a diamond) and as ‘x’ increases, the price also increases and the relation between these two attributes appears to be a quadratic curve. Here is a KDE plot of the same attributes considering a sample of 500 points

By looking at the KDE, we can say that there is a correlation between these two attributes and the trend seems to be shifting upwards for higher values of ‘x’ that means the larger diamonds are expensive but even more expensive than just being proportionately increasing with size.

In this article, we discussed the joint distribution of two attributes, how to interpret this plot to understand the correlation between the two attributes(if any).

References: PadhAI