In the last article, we looked at an overview of the Data Science and at multiple places in that overview, we called out statistics. So, in this article, we will zoom into statistics, we will see why the stats concepts are important.
What is Statistics?
As discussed in the previous article, there are 5 tasks involved in Data Science. These are collecting, storing, processing, describing, and modeling the data.
Knowledge of statistics is required for collecting, processing, describing, and modeling the data.
So, for collecting data, if we have to design experiments, or if we to have venture out to collect data, then we need knowledge of statistics in the form of Randomized Control Experiments. Similarly, when processing the data, knowledge of statistics is required to standardize, normalize the data.
Describing data is one major area of statistics called Descriptive Statistics so the whole of describing data is going to be descriptive statistics in which we study different plots for example plot like histogram, how to read them, what is the significance of histogram, how do we say what percentage of data lies in a certain range and so on. So, that’s what we do when studying plots for describing data. And similarly, we have summary statistics like mean, mode, median, variance, etc.which will come in descriptive statistics.
And finally, in modeling, we have to decide about statistical modeling vs algorithmic modeling.
The purpose of modeling is to draw inferences about the data(so to answer questions like what is the underlying distribution in the data, underlying relationships between various variables in the data, and so on). So, in statistics, we are going to use the terminology that we need to collect data, describe data and draw inferences from the data(and drawing inferences is largely what statistical modeling is about)
Statistics is the science of collecting data, describing data, and drawing inferences from data. This definition is incomplete without understanding key terms in statistics.
In Statistics, we are always interested in studying a subset of a large collection of people or objects. So, this could be actually the population of all the citizens in a state or country or this could be any object say we want to study the diameter or variance for all nuts and bolts manufactured in a company or we want to look at the average mileage of a car in a factory or we want to look at different drugs and their effects and so on(we are collectively calling all of these as objects). So, that’s what we are interested in studying in statistics.
And we are interested in questions of the form: say out of a ‘population’ of citizens(different types of citizens on basis of education, age group, profession, state and so on), what proportion of the citizen’s support candidate XYZ, so this is what typically opinion poll agencies do. The challenge here is that it is infeasible or expensive to survey all citizens. To answer this question by surveying all citizens is as good as conducting the elections, that’s not what we want to do because that’s expensive and we don’t have the authority to do that but we are still interested in making some predictions as a news agency or as an opinion poll agency about what are the proportions of citizens who are in the favor of a given candidate. So, this is one question that we might want to answer from a population, this is one example.
Another example could be: say of all the cars being manufactured(there could be different units within the same factory where the cars are manufactured) in a particular factory and we are interested in knowing the average mileage of cars produced in a factory. This is again an interesting question to answer because once the car starts getting marketed, we want to say that this is the average mileage, this is better than our competitor’s and so on. But we can not test all the cars that are being manufactured, we can not have all of them run in the test environment and then take readings from all of them and compute an average, this is again going to be expensive. We are interested in this entire population of cars and we clearly see a challenge in doing it.
Another example would be — say there are different types of farms in different regions or are growing different types of crops or using different types of irrigation and so on. And we want to know is there a lot of variation in the yield of paddy farms in a state, so is it that there are some regions which have a very very high yield per acre and there are some regions where the yield is low and maybe we want to take some actions based on that. Here the challenge is that we cannot survey all the farms as we don’t have enough resources. So, again we have a very large ‘population’ that we would like to survey but that’s infeasible and we can’t really survey all of it.
This is a very common problem in statistics and the solution to this is that survey a few elements(that’s what opinion poll agencies do, they take a sample of the population from the constituency where the election is being conducted, ask their opinions and based on that they make there projections that Party X is going to get so many seats and so on, and the same analogy holds for the cars mileage computation and to know the yields of the farms) and draw inferences about all elements from this smaller group.
We want to make statements about the ‘population’, we are not studying the ‘population’(because it’s not feasible, it’s expensive to do that), so we study a ‘sample’ and from this, we want to make inferences about the ‘population’.
We are typically interested in estimation some ‘parameter’ of the ‘population’, this parameter could be anything, in the examples discussed above, it was proportion, average, and variance:
The proportion of citizens in favor of a particular candidate, average mileage of cars produced in a factory, variance in the yield of farms in a state, and so on.
So, these were the parameters of the population we discussed in the above cases.
We are going to take a small ‘sample’ and the same ‘parameter’ quantity estimated from the sample is termed as a statistic.
So, these are the 4 most fundamental concepts in Statistics: population, parameter, sample, and statistic.
How to select a Sample?
The key idea is that we take a ‘sample’ of the population, study that and then draw inferences about the ‘population’. So, the question is how to select a ‘sample’?
Let’s answer this by considering the scenario when we take a homogenous sample for a survey/situation.
Say for some reasons we select only the university students as our sample, let’s assume there is an opinion poll agency and we randomly take our sample size from a nearby university and we ask them which candidate they prefer, and the rationale behind students as the sample size could say we have this feeling that they may be more interested in talking as opposed to us going to different places be it IT Parks, railway station, bus stops, etc. and asking the more heterogeneous population, we decided to ask university students. The question is: is it a good sample or not?
If we just take a homogenous sample(only university students in this case) as depicted in the above image, we are looking at a very very specific age group(say 18 to 30) and now what about all the other people in the population, what about people who do not have a university-level education, who are not in white-collar jobs, or elder citizens might have some problems which university students do not relate to and hence their opinions would be very different from university students, so if we take a homogenous sample then we are not capturing a good representative of the entire population and that’s a problem.
The same thing holds for another case where we want to predict the average mileage, we have multiple cars and each of these cars was manufactured in the same organization say in the same factory but there are 5 units in the factory, let’s say we take all cars from 1 unit(say the testing area is close to this unit) and we report the average based on that.
Now, again the problem is that maybe unit 2, 3,4 5 had a slight difference in the way they were manufacturing things, or maybe the workers over there were different, supervisors were different and maybe something got missed there or something was done better there and hence that population of cars is a bit different from the sample that we have chosen. So, here as well the ‘sample’ is not a good representative of the true ‘population’.
Similarly, for our last example, suppose we want to conduct a survey, ideally, we should go to all the villages, all the different districts in the state and then take farms from everywhere and then try to come up with the average yield, since it’ll be more convenient for the surveyor to reach out to the nearby areas and maybe the surveyor only goes to these nearby farms and talk to the farmers there and draw conclusions about it. Again, this is wrong, just because these farms are near to the city(say surveyors’ place), it could be the case that the nearby farms have better connectivity, they might have access to better fertilizers or seeds or they have a better water supply as compared to the more remote areas in the state and that’s why this sample won’t be a good representative. It’s also possible that this sample has a much higher yield as compared to other farms because of better access to facilities. It could be the case that because of pollution and other factors in the cities affecting these farms, maybe their yield is much lower as compared to other farms which are in the more remote areas and not in the vicinity of the cities and hence more protected from pollution and other environmental factors. It could be either way and in any case, this sample is not a very good representative of the population.
A sample and the resulting statistic will be useful only if it representative of the population.
In this situation, if we take only university students and we find that 80% of them favors candidate XYZ but that won’t be a reflection of what is going to be in the actual elections, because in the actual elections, votes of elder citizens, votes of people who don’t have the university education, people in different professions and so on, they are all going to count and we have not really bothered taking their opinions and just because that candidate XYZ is popular with 80% of the university students, it is quite possible that candidate XYZ is not popular among anyone else, and in the entire population his fraction is just 30% or 40%. This is bound to happen if the sample is not a good representative of the total population.
Since all our studies are going to be based on the sample, it is very very important that the sample is a good representation of the entire population. And in another article, we will learn the different sampling techniques.
How to design an Experiment?
One of the tasks involved in Data Science is collecting data when the data does not exist in the database, is not available online, we need to venture out and collect the data. Here is a situation where we need to collect data:
Let’s say a Doctor is interested in answering the following question “Does eating 5–7 walnuts a day for 3 months help in reducing cholesterol level?” And this data is not readily available and there could be similar questions in other domains like agriculture say “Does using fertilizer X for 5 months helps in improving the yield?”. So, in such scenarios, we need to go out and collect some data.
One way of answering this problem is:
- We need to select a group of volunteers or subjects(we need to select a sample representative of the entire population).
- We will measure their cholesterol level today.
- We ask them/ensure that they consume 5–7 walnuts every day for 3 months(let’s assume that all of them co-operates and do this for the next 3 months).
- We measure their cholesterol levels after 3 months.
And we can just compare the cholesterol level average for the group before and after and then we will draw our conclusions. This procedure seems reasonable but actually is a very flawed way of conducting medical experiments or any kind of experiments where we are trying to test the effect of the treatment.
What’s wrong with the above approach?
Let’s say some members of the sample, in addition to consuming walnuts, they also take up some physical exercise and we know that physical exercise can have an effect on the cholesterol level and at the end of 3 months if their cholesterol level actually decreases that means they will contribute to the overall mean of the sample decreasing, now how do we know that the cholesterol level of these subjects actually decreased because of walnuts or was it because of the physical exercise. How do we separate out the effect of these two?
And let’s take the other extreme as well, say some people from this sample started smoking which leads to an increase in their cholesterol level, so if that happens, can we conclude that eating walnuts lead to the increase or if the cholesterol does not change(gets nullified by increase because of smoking and decrease by walnuts), then do we conclude that walnuts did not help in reducing cholesterol level?
So, the problem here is that we are not able to isolate the effects of confounding variables.
The solution lies in something known as ‘Randomized Control Experiments’. And here is a brief of the Randomized Control Experiments:
We have a control group and a treatment group, to the control group we don’t give any treatment or we give a placebo treatment(pills which do not have any medical properties), we select these two groups in such a way that there are similar people in both groups meaning there are people in both the groups who do the physical exercise, who smokes, old and young people in both the groups. So, in a way, we have nullified the effect, and if after 3 months, we see a particular group does significantly better than the other group, then we know that the only difference between them was walnuts and no walnuts, all other differences were similar. This is known as setting up a Randomized Control Experiment.
It’s good to have knowledge of the above for designing experiments to collect the data and draw inferences from the data. And here again, the knowledge of statistics plays a significant role.
In this article, we touch-based on why the statistics concepts are important in Data Science and how to select a sample, design an experiment.