How to describe relationships between variables?
In the last article, we discussed how a histogram could help us understand the data distribution for an attribute. In this article, we discuss how to understand the relationship between multiple attributes.
This is a common observation in many real-world datasets where for a given object, we have several attributes describing the object so for example in the Sports domain, in Cricket, we have the Runs scored, Balls faced, Minutes played, Strike Rate, Type of dismissal. Similarly, in agriculture, for every farm, we have information about the State, District, Area, Yield, and similarly, for e-commerce, we have the attributes describing the object say a shirt, and the attributes here are like the Color, Pattern, Size and so on.
When we have such multiple attributes, we are often interested in describing the relationships between these attributes or we expect that certain relationships hold for example we except that the runs scored is related to the balls faced, or more mathematically speaking, we could say that the runs scored is a function of the balls faced. Similarly, we would expect the Total Yield to a function of or related to Total Area and similarly in the e-commerce domain, say if we are buying accessories like bags and so on, the price might be proportional to the size
So, there exist relationships between these different attributes in the data and we want to be able to visualize at least for a pair of attributes what these relationships are.
We could draw the histograms, so here is the histogram of the runs scored and balls faced by Sachin in ODIs
While we can see the individual trend for both the attributes, we cant’ really figure out the relationship between runs scored and balls faced by looking at separate histograms
So, we need a different tool/plot to visualize the relationship between different attributes in the data.
And in particular, if we want to answer questions like how does the score change as the number of balls faced increases? Is it that when he has faced a lot of balls, then he starts scoring even faster, and hence his score would be even more as compared to what it was earlier? So, this kind of trend we want to figure out from the data.
For this, we use a scatter plot which allows us to find relationships between variables
On the x-axis, we have the attribute ‘balls faced’ which in this data set ranges from 0 to 160, and on the y-axis we have the runs scored which is ranging from 0 to 200. Each point has an ‘x coordinate’ and a ‘y co-ordinate’
And from what we see in this plot, it looks like there is almost like a straight line relationship between these attributes, so as the balls faced is increasing the runs scored is also increasing like linearly and towards the end where the number of balls faced is above 140, we almost see an increasing jump over there, we don’t see a linear relationship over there and it makes sense as well, once he has faced a lot of balls, the innings is also coming to an end, he starts scoring much faster and hence when we face more than 140 balls, he scores much higher than what he would score by facing 100 balls and so on.
This is an interesting way of revealing the patterns between two attributes in a data set. This is not for qualitative variables, it can be used for discrete quantitative variables. It could also be used for two continuous variables
So, here on the x-axis we have the total area of the farm and on the y-axis, we have the total yield from the farm and it looks like there is an interesting relationship here, we generally expect that as the size of the farm grows, the production increases linearly, that’s exactly what is happening in the initial range of area in the above plot, but when we have very large farms, it can’t keep growing linearly and become very high, after some point, it kind of flattens
As we have very very large farms, it does not mean that the production keeps on growing because other factors also play a role for example enough water may not be available so even though we have a large area, we might be limited by the amount of water available or fertilizers or pesticides and so on.
It’s also possible that when we have a large farm, we can’t monitor it effectively, hence it gets affected by pests and other diseases, and hence the production is a bit less than what we would expect given the size of the farm.
So, these are again interesting patterns and what this data tells us is that some of the large farms are wasting some of their production efficiency or the capacity that they have. So, this is something which is not a good thing and once a Data Scientist looks at this data, he/she can recommend to the organization/government that these large area farms are not being utilized effectively so please see if you want to split them into smaller farms or if you want to do some other kind of provisions like to make sure that they get enough supplies or is it that these large farms are in very very remote areas or what is the problem over there.
So, such kind of patterns that are not visible from individual histograms gets visible when we plot scatter plots.
We can have a scatter plot for 1 continuous and 1 discrete variable as well
So, here is the scatter plot for runs scored and strike rate and we see a very peculiar relationship here which is as the strike rate increases, the runs scored increases very dramatically in many cases. Of course, there are some cases where the strike rate was very high and runs scored was very low, this means that in the interest of playing fast maybe to score very very fast, he might have got dismissed early and hence the total runs he scored was low.
So, this is an interesting way of deciding whether a player has the ability to play very fast and still score more runs, in the interest of playing fast, he should not get out very early which was typically the case for other players.
Typical trends that we see in scatter plots
Before we check out the typical trends in the scatter plot, let’s do a quick recap of functions
The first function in the above plot corresponds to a line, the second function is a parabolic function and the third function is an exponential function, the output of the exponential function i.e y increases very rapidly as x increases, compare it with the line function where the y values increases in proportion to increase in x. Such functions/trends typically appear in scatter plots.
Here are two scatter plots, one depicting the relationship between balls faced and minutes spend on the crease and we can see that there is an almost linear relationship between these two attributes and the second scatter plot shows the relationship between the minutes on crease versus the runs scored, this one is also almost linear, of course at the end it peaks up a bit which means that if the player has stayed at the crease for so long, she/he can see the ball probably very well and she/he can start hitting much bigger shots and that’s why she/he might score higher, the relationship is almost linear otherwise.
Then in many cases, we might see quadratic relationship also, especially in real-world data where we have physical quantities like density, pressure or temperature and so on and there might be a quadratic relationship naturally between these variables
We might also see this exponential kind of scatter plot also
In some plots, we might see a mixed kind of relationship for example in the below plot, there is this linearly decaying relationship but after some point, it kind of decays exponentially, so it does not go linearly after some point but goes down slowly towards 0
So, this is an example where we have a mix of linear and exponential relationships. Similarly, in the below plot, initially, it is growing fast but then at some point, it decreases.
This kind of plots we see in machine learning, especially the above plot, think of the x-axis as denoting the training data, so initially as the machine sees increasing amounts of data, the model’s accuracy or the performance of the model increases but beyond a point adding more training data does not add much value to the model, so at that point, the performance of the machine starts saturating or kinds of flattens out, and the team can probably stop data collection exercise if it’s ongoing or planned basis the model’s accuracy.
It also quite possible that we are looking for a relationship between two attributes but no relationship clearly exists between them.
So, it's possible that in some attributes we might not see any relationship when we are looking for something and we might get a scatter plot that is all over the place.