## Modelling Data

In the previous article, we discussed what Data Science is and what are the different steps involved in a typical data science pipeline, then we zoomed into four of the five tasks involved in the pipeline. In this article, we discuss the fifth and the most important task involved in the Data Science pipeline which is modeling the data. So, let’s get started.

## Statistical Modelling

So far we talked about collecting, storing, processing and describing data, now comes the most important aspect of data scientist job which is modeling data or coming up with a data model. To illustrate this, let’s say we are dealing with some kind of patient data where we have different readings about patients, these could be blood sugar level, cholesterol level and so on. Let’s say blood sugar level is stored under the column named ‘x1’ and we have sorted the data in the ascending order

This data is sort of clustered around the region which is close to 117 and we have some smaller values and some very big values, so, this data looks something like:

And we have a lot of values which are close to the mean(let’s say the mean is somewhere around 117) and most of the values are clustered around it.

Turns out that for most physical quantities or biological readings, we could take a large enough sample and note down their readings, they typically tend to follow a normal distribution(curve in the above image is a normal distribution). Normal distribution means most entries are around the mean and then we have the spread on both sides of the mean. So, this is one kind of model around the data that the data actually follows a normal distribution.

So, coming up with such models of the data and making some inferences on top of those models is a very important part of a Data Scientist job.

Let’s see why is it important to understand this concept with the help of an example:

Say there is a new drug in the market for a certain medical condition and there are Doctors interested in knowing the effectiveness of this new Drug.

We take a sample of patients and now we interested in administrating this drug to the sample of patients and we want to know if the drug is effective or not. So, this is where a Data Scientist steps in and the way it works is that before giving the new drug, we note the biological parameters(say blood sugar level) for all of the patients and Data Scientist could analyze the data and can say that this data follows a normal distribution which means we can completely describe it based on mean and variance

We administer this new drug for a time duration of say 3 months and we note down the readings again and say it again follows a normal distribution as depicted in the below image and now we can see that the mean has shifted towards the lower side:

Earlier, the mean was ‘150’ and now it is ‘130’ that means the drug was effective. So, this could be one way of concluding things and this is the wrong way and not the statistical way of computing things. And here’s why for the same:

The conclusion that we are drawing is from a given sample of patients and this sample was not covering all the patients that were/are suffering from this medical condition. Now, it could be possible for various reasons just by random chance and so happened that there were 5–10 patients in the sample for which the drug lead to drastic reduce say 30–40 points and for most of the patients the decrease was only for a few points say 3–4 points and for some patients maybe it increased but on average it was towards the lower end.

So, without knowing about the distribution of the data or making these assumptions, we cannot robustly argue that this drug is very effective because of the randomness in the small sample of the patient set we are looking at.

A Data Scientist should be able to make robust statements of the form that he/she is 99% sure that the drug is effective. And here 99% does not mean the way we use 99% in our everyday life like I’m 99% sure that this is going to happen. Here, we want the statistical guarantee that means we need to quantify something and tell that this is going to happen with 99% guarantee and how we do that will be covered in subsequent articles but the point to note is that we want to be able to do a robust analysis of whatever data we have, we want to come up with very simple models and based on that we want to make certain robust arguments about the data that we have.

In some situations, we might also be interested in knowing the underlying relationships in the data, let’s the new drug has been administered and the doctor might be interested in knowing the relationship between blood sugar level and the number of days of treatment.

Here again, the Data Scientist steps in and let’s say provides a linear model/relationship between the blood sugar level and the no. of days of treatment and let’s say in this very scenario, blood sugar level decreases with the increase in the no. of days of treatment.

This is a very-very simple model and that’s the cornerstone of the statistical modeling. We come up with very-very simple models to explain the relationships between various variables in the data and the idea to propose a simple model is that, now we can do very very robust statistical analysis on that, how we do that is covered in another article. A Data Scientist should be able to make the following type of statements based on statistical analysis:

The above-mentioned statement is a very very robust statement and a statement like this can help a Doctor clearly advocate whether this Drug should be used or should not be used. This kind of robust analysis is only possible when we have very very simple models. The model used in the above scenario is:

y = mx + c

where ‘x’ is the number of days of treatment

and ‘y’ is the blood sugar level

Model is a formal mathematical way of writing a relationship between two variables.

Hypotheses in the above case was that the Drug is effective or more formal hypotheses would have been the average blood level before and after the drug or rather the average blood level after the drug was lower than that before the drug.

Then we have goodness-of-fit-tests, so we came up with the model, we are interested in knowing how well the model fits the data, how do we quantify that, that is something we cover in the goodness-of-fit tests. And all of these are statistically driven and will be discussed in subsequent articles.

The other way of modeling the data is Algorithmic Modeling which is discussed below.

## Algorithmic Modelling or Machine Learning:

In statistical modeling, we made very very simple assumptions, and we came up with very simplified models of the data.

The advantage of doing that is we can give some statistical guarantees which are useful in all sensitive fields like Agriculture, Healthcare, Finance and so on. But because we wanted these statistical guarantees, we were limited by the models that we can use, we cannot use very very complex models because the moment we do that, we can’t do any statistical analysis on the model.

It’s possible in many real-world cases that the relationship is much more complex than the linear models or the relationship depends on many factors which we are not considering in simple models. So, in such cases, we do an alternative approach which is to build complex models and that’s where we start going into the domain of Machine Learning.

In the previous example, we were interested in knowing what happens to the blood sugar level after 30 days or let’s just say how it varies with the number of days of treatment and now we can not rely on just one input ‘x’ which is the number of days.

We know that the blood sugar level not only depends on the number of days, it could also depend on the age of the person, the weight of the person, height, current blood pressure, family history, other diseases that the person is diagnosed with and so on. There are a lot of such inputs on which this output could depend. We can say that there is some relation between the input and the output but we don’t know what that relationship is and we are interested in coming up with a model for this relationship and it should be very clear by now that this can not be a very simple model like a linear relationship(as in below image), and the actual relationship, in this case, would be very very complex.

Machine Learning allows a large family of complex functions and using machine learning, we can model the relationships using complex functions. The goal of Machine Learning is to learn the function ‘f’ using the data that we have and the optimization techniques that we have and once we have the function, we can just plug in the new input value ‘x’ and get the output ‘y’. So, the focus now is on prediction, we are not interested in knowing answers to questions like “How much does the blood sugar level depends on weight or age and so on?”, we don’t care about that, we care about that final prediction should be very very close to the true answer.

Let’s say we are trying to predict stock price, and now we have the situation that the stock price depends on various factors, it might depend on the number of seats that the ruling party has in the parliament or the number of days US President spent with the Indian PM or the amount of foreign investments happening in India and so on. At the end of all of this, we are interested in knowing whether the stock is going up or not, we don’t care how it is related to the number of days spent by US President in India or the number of seats ruling party has in Parliament, we don’t care about knowing how these factors affect the stock price, all we care about is the final answer whether the stock is increasing or not. This is different from the situation we had in the medical domain where we care about what the reason is, why is the blood sugar level decreasing, is it because of the weight of the person or the height of the person or what are these underlying relations in the data. So, in applications where that is important(knowing underlying relationships between different attributes), statistical models are at the foreground and in applications where we don’t really care about why certain things are happening and we only care about is the final prediction, that’s where we take this algorithmic approach or machine learning approach.

To have a quick summary of Statistical Model v/s Algorithmic Models:

1. In statistical modeling, we use very simple, intuitive models whereas in algorithmic modeling, we are not restricting ourselves to simple models, we want the model to be as complex as possible as long as it captures the real relationship between the input and the output variable(or the predictor and the response variable).
2. Statistical modeling is more suited for low-dimensional data. Algorithmic modeling is typically used with high dimensional data say an image of size ‘300 X 300’ would have 90000 columns and based on that we want to make a certain decision and that’s where we can’t do this robust statistical analysis and we just rely on machine learning or algorithmic modeling for that.
3. In statistical modeling, the focus is on interpretability, we really want to know by how much does the blood sugar level increase or decrease based on some other factor may be the cholesterol level or the number of days the drug is given or the quantity of drug that was given and so on whereas in algorithmic modeling, the focus is on prediction, we don’t really care about the underlying relations between different attributes, just pass in all the attributes to the model and tell finally if blood sugar level increases or decreases.
4. Statistical Models are data lean models, we don’t need large amounts of data to train these models as they are simple models, we just need to estimate the slope and the intercept whereas algorithmic models are data-hungry models because we are dealing with large high dimensional data, it needs large amounts of data to learn certain relations between input and the output.

Typically, when we are doing Statistical Modeling we have simple models like Linear Regression, Logistic Regression, Linear Discriminant Analysis, these models also come under the Algorithmic Modeling bucket but there are other models as well in Algorithmic Modeling

Linear Regression, Logistic Regression, Linear Discriminant Analysis — these models are statistical models and hence they are very amenable to statistical analysis, very robust analysis whereas models like Decision Trees, K-NNs, etc. are not simple models but these models also have the form

y = f(x)

and that is what the crux of machine learning is to come up with different functions and then estimate these functions using data. We typically use complex models because usually, the relationships between the different attributes are much more complex. And multilayered neural networks are useful when we have large amounts of high dimensional data and we want to learn very complex relationships between the input and the output.

Very simple models are typically used with narrow data or low dimensional data, models like KNNs, Decision Trees are used with medium dimensional data and Neural Networks really become effective when we have large amounts of data and high dimensional data.

Here is one example illustrating the order of data involved in Deep Learning:

We will get this image as the input, now image is a high dimensional data, it would be say ‘256 X 256’ or ‘300 X 300’, that means we are looking at order of ‘90000’ numbers/columns and say we have a lot of past cases/records for this data, so we will have a very high dimensional data and lots of such data entries, so this is a classic example where Deep Learning can really make a difference.

For modeling data which is the crux of the Data Scientist’s job, the following skills are required

Conclusion:

In this article, we zoomed into the fifth and the most important task involved in the Data Science pipeline which is modeling data. We looked in detail about the Statistical and the Algorithmic way of modeling the data, what are the objectives we have when we use either of the ways to model the data, how the two differ from each other and which one is more suitable based on the dimensionality of the data and the amount of data.