In the last article, we started with an overview of Statistics and how the knowledge of statistics is required when selecting a sample, when designing an experiment. In this article, we look at some other tasks typically involved in the Data Science pipeline and see how the knowledge of statistics is required for these tasks.
How to Describe and Summarise data?
Let’s say a Data Scientist at Netflix is analyzing users data, in particular, to see the trend as to how many hours people spend of TV shows, how many hours on movies, on average is there a difference of hours spent on weekends compared to weekdays and so on and say all this data is given to Data Scientist in a tabular format(below image where only 3 columns are depicted but in the real world, it would be a lot more than that).
We are considering some small amount of data(just 4000 rows), even with these 4000 rows when we have data in a tabular format, it becomes very very difficult to answer simple questions. Here are some of the questions that a Data Scientist may be interested in:
What’s the distribution of users in terms of the number of hours they spent on TV shows for example if a large proportion of users lies in the 80–90 hrs. category then that means Netflix has good content to engage the users and if most users are in less than 5 hrs range then that means the content needs to be revised or there is not enough content to keep people engaged. So, we are interested in knowing these trends.
It’s impossible for a Data Scientist to just look at the raw data and answer these questions and that’s where we need good tools for describing and summarising data and that’s where plotting comes in. So, here is the same data represented as a histogram.
So, here we have divided the number of hours in bins and plotted the number of users(count of users who spent that much time on the platform) accordingly for each bin. From this plot, we can say that there is a large number of users who spend near about 50 hours per month on the platform, there is a very small number of users who spent 100 hours and there is a very small number of users who spent 5 hours(the data being plotted here is some random data just to convey the concept). So, from plotting out the data, the trend in the data is clear to us, it's clear that most of the people are clustered around the center, so it looks like they have enough content to keep people engaged for 1.5–2 hrs every day. So, we get a very quick summary by plotting the data which is not that obvious otherwise. Similarly, there are other kinds of plots which help to understand the trend.
Apart from describing data, we could compute summary statistics which are mean, median, mode, variance, and standard deviation. Companies like Netflix might be interested in the question like how many hours do users in the 20–25 age group spend on watching TV shows, they have data of only current users, the population of 20–25 users is much larger than that, what about users who are not on Netflix today, we want to target them, hence we want to make inferences about this larger population and not just the sample of the population which is already there on Netflix, we want to figure out how to get the others users in, hence we are interested in the inference about the larger population. So, that’s where we could compute these statistics from a sample.
Why do we need Probability theory?
The answer to this question ties back to the concept of Population and Sample.
We know that if we take only a homogeneous sample than it’s a bad sample as it is not representative of the entire population and if we take a varied group where we have a good representation of the society then that’s a good sample.
What we mean by ‘chance’ is the probability of something happening. So, if we want to have a good sampling strategy every individual in the population space should have an equal probability of becoming a part of the sample. So, that’s where the knowledge of Probability theory comes into Statistics, at the minimum for creating a good sample but it goes much beyond that.
Let’s take an example where our population size is 10 and for whatever reasons we could take a sample of size 2(say beyond that it might become extensive), so we have 10 ways of selecting the first member of the sample and then we are left with 9 ways of selecting the second member of the sample, overall that means there are 90 ways of selecting a sample size of 2 from a population of size 10.
Now the question is if we observe some trend in this sample then what is the chance that we’ll observe a similar trend in other samples or in general what is the chance that we observe the same trend in the entire population.
This is an important question because in statistics we are always working with the samples which are much smaller than the population and now this uncertainty or variability can be in the samples, we could have different types of samples(for example in the above case, there are 90 different ways of selecting a sample), so if observe something in a sample, what is the chance that we’ll observe it in the entire population. And these are the questions that actually drive a statistician and that’s why we need the knowledge of probability theory.
And magnifying the problem a bit more, where we have a population of 10000 and we want to select a sample of size 10, so there are ‘6.5*10²⁴¹’ ways of doing that. And say we take a sample and compute the mean statistic for a quantity, we are interested in knowing the chance that the mean of the entire population is close to the mean of the selected sample. What is the chance that the statistic value is representative of the entire population? To understand this, we need to understand ‘chance’ for which we need to understand probability theory.
How do we give guarantees for estimates made from the sample?
The next question is: how do we give guarantees for the estimates we make from a sample?
Let’s consider the same example where our total population size is 10 and we can have a sample of size 2. As discussed above, there 90 ways in which we can have the sample. Let’s say we take 5 of the samples from the possible 90 samples. And then we are looking at the mean height for these 5 samples.
There could be a great variance in the mean height computed from different samples. In fact, we can say that the mean itself has a probability distribution.
So, if we take these 90 samples one by one, there could be 3–4 samples where the mean height is 175cms, then there are samples on the other extreme as well for which the mean value is as low as 130 but for a large number of samples, the mean was between say 150 to 154 cms.
And we have this uncertainty because we are computing this mean from a smaller sample as compared to the entire population. So, the mean itself has a distribution and if we know this distribution, then we can start asking questions of the following form:
What we mean by this is say we computed the mean value from a sample comes to be 150cms, now we are interested in knowing the range say from 145cms to 160cms such that we’re 95% sure that the mean of the true population lies in this range.
Whenever we see the opinion polls or some other estimates reported in the newspaper, it’s always reported like party X will win ‘26 +- 5’ seats, they are saying that from their sample the value is 26 but they’re 95% confident that the final number of seats would lie in the range of 21 to 31 and that’s the question a statistician is often asked to answer.
Here again, the concept of probability becomes important because different values of mean have different chances of showing up in the samples, some mean values are more likely to show up compared to very extreme mean values. So, if mean is computed from a single sample it is called as Point Estimate and then if we specify an interval(with 95% confidence) saying that the true mean(of the entire population) lies in this interval, that interval is called the Interval Estimate.
We are interested in computing both point estimates and interval estimates. We compute a point estimate as we always have access to a single sample because having more samples is expensive, we compute an interval estimate because we know that a single sample may not be reliable, there is a chance that the true mean is different from the mean computed from this single sample, hence we give a range that this is the range in which we say 95% sure that the true mean(considering entire population) lies.
What is a hypothesis and how do we test it?
Let’s say the domain of interest is Cricket and we are looking at a sample of bowling speed of Bumrah. And here is the sample for which the bowling speed has been computed(fictionary data)
Now looking at this data we hypothesize that the mean bowling speed is greater than 90mph. This is the hypothesis that we have and now we want to test this hypothesis.
We can’t just average out the entries in the sample and see if the hypothesis is true or not. It ties back to all the concepts of Population and Sample. Population refers to all the deliveries bowled by Bumrah and Sample represents the 6 deliveries we are considering in this case. What if Bumrah got lucky and the sample that we picked up gives the average speed greater than 90mph.
So, here again, the concept of Chance or uncertainty comes in all the estimates that we make as we are dealing with samples, not the entire population. We can make this hypothesis but we need to robustically argue whether this is true or not.
Different samples can have different means, now we must be able to make statements of the below forms:
Let’s understand what this statement means:
What we can understand from the above distribution is that even if the mean speed is around 85mph or something of that sort, there is a 25% chance, there is this region, that means there would be some samples which we could get(and there would be 25% such samples) in which the mean speed is actually greater than 90 even though the true mean of the population is less than 90. So, here again, we are getting into the question of chance that by chance it has happened. And to understand it better, we need to have knowledge of Probability theory which is discussed in another article.
Let’s consider an example where we are dealing with two populations:
Now we have two populations: one is the population of all the farms which use fertilizer X and the other is the population of all the farms which use fertilizer Y. And the hypothesis we have essentially is that one fertilizer is better than the other. Here again, we want to make statements of the following form:
There might be scenarios when we are dealing with multiple populations and we could have a hypothesis of the form that a particular combination is more effective for improving fitness. Here again, the concept of chance comes to be robustically say that the value given by a sample is representative of the entire population:
How to model the relationship between variables?
So, coming back to the Doctor’s example, let’s say the Doctor is interested in knowing “What is the relationship between the number of days of treatment and cholesterol level?” The treatment, in this case, could be the consumption of walnuts or any drug. And that’s where Statistical Modeling comes in:
We are assuming that the relationship is linear and based on this assumption we want to make some statements about the relationships between different attributes.
The first thing we need to do is to estimate parameters(m, c)from the data, we need to estimate these because there are different possibilities, any of the lines(below image) could be depictive of the relationship between the number of days of treatment and decrease in cholesterol and we can see that none of the lines is perfect, we have points that don’t lie on the line. And we have lots of data and we need ways of estimating these parameters from the data such that for most data points, the predictive values are close to the true value.
Once we have the parameters, we again go back to the question of uncertainty where we again have the same situation that we are estimating the parameters from the sample, not from the population, hence, you have a certain error in the estimation and now we are interested in making statements of the following form:
So, we want to give an interval around the estimated values saying that we are 95% sure that the true value of the parameter ‘m’ lies within this interval.
How well does the model fit the data?
We hypothesize that all the five forms(as in the above image) of getting dismissed are equally likely, so what the model is saying that the probability of each of the five forms is equal. Now we want to see whether this hypothesis is right or not.
We take the samples say last 100 dismissals and we look at the frequency of different types of dismissals in this data and here it says that all five are not equally likely(below image), there is some type of dismissals that are more likely than other types of dismissals at least in this particular sample. So, there is a clear difference between the model which says equally likely and the estimated prediction which says not equally likely.
So, the question we are interested now is that “Are the variations observed in the samples significant or due to random chance?”. We clearly see that in the selected sample the five dismissals are not equally likely but we are interested in knowing is it the case that the model fits the population well but due to some random chance we are seeing this result? So, how do we compute the difference between the model and the probabilities that we get from the data and make statements of the form that we are seeing some certain differences from the model and what we are estimating from the data but we are sure that this is due to random chance, and our model is actually correct.
So, here again, the concept of chance and probability theory comes in and there is this Chi-square test(will be discussed in another article) which helps us determine the goodness of the fit of the model.
In this article, we discussed at a high level why we need the knowledge to Statistics to deal with various tasks in Data Science.