In the previous article, we discussed how to do systems thinking in data science and engineering data science systems. Then we looked at a high-level overview of the components involved in the CRISP-DM process.
In this article, let’s zoom into each of the components involved in the CRISP-DM model.
1. Business Understanding
The first thing really is “What are the business objectives?”, What are we trying to optimize, are we trying to optimize revenue, customer satisfaction, the improvement of the performance of my team, the health of my patients, and so on. These objectives need to be clearly established. Though it might seem very trivial, often this is not captured in a systematic way and leads to a lot of confusion later on.
The next question is “Can Data Science actually achieve these objectives?” There is a lot of hype about Data Science, AI and so on and some people believe that Data Science is a solution for many problems. It turns out that is some situations, Data Science can do nothing else but reveals some of the existing problems in the organization which need to be solved elsewhere. So, it’s very important to ask the question “Is Data Science the right way to address these objectives?”.
The next thing is “How do we define success metrics?” Typically an organization starts a project on Data Science, it can take several months, it is very important at the outset to define what it means to be successful, only then these metrics can be made robust and all choices that need to be made around the modeling part, data collection part, etc. can be made based on improving this metric.
The next thing is to understand “if there are ethical considerations in data usage?”. Data is available in plenty but some data might have privacy considerations, some data can not be legally used for some projects, all these need to be carefully listed out.
The next step is to go and look outside to see what people have done already in this area. Data Science is becoming explosively popular in different domains so it pays to actually go and see if there is any existing work that may be in the research domain, maybe in the business consulting world where a particular competitor or another industry altogether has applied data to solve such problems. This is typically in the research domain finding out the state of the art(SOTA), establishing SOTA is very important before we start a project.
So, these are the five steps that are done iteratively within the step of business understanding, this is where Agile development comes in. Typically, a project must be started out and then we start with point 1 which is to identify the objectives, slowly we have to refine them to look at how to achieve them, whether data science is the right tool, how to define success metrics, these are the refinements on the initial problem itself.
2. Data Understanding
The first question here is “What are the sources of the data?”, “Is the data available within the company or not, if it is within the company which department has it and if it is outside the company, do we need to buy it from somebody or its available in the public domain, all these go into the first question the sources of data.
The next question is “Do we need to collect new data?” Typically collecting new data is often expensive and slow for example in the medical domain, when new drugs are tried out, lots of money is spent by pharma companies in collecting data, what happens to a patient if we administer a new drug becomes very crucial information.
The third question is “What is the quantity and quality of data available?” After we have collected the data, we need to understand that not all data is clean data so we need to understand what is the quality of the data that is available, we need to do a quick check, sometimes data might have errors, all these needs to be identified at this step.
After this, the next question is “How do we represent different data items?” It might be the case we have a big table with us say having 30–40 columns and it is often not clear what a particular column represents, the data for a particular column is in what units, it is very important that these needs to be clearly established, errors are made even in interpreting columns of data wrongly.
The final question is “Which data is relevant for the business objective?” We need to ask the question amongst this large set of data that we have, which data sources or subsets are relevant for the objectives. So, we need to select data from within this large class.
3. Data Preparation
We need to prepare the data so that we can pass it to a statistician.
The first question here is “What are the different data formats in which the data is available?” Different organizations have different ways of storing the data, firstly we need to understand the different formats in which the data is available and often this is a major challenge because not only different organizations but different departments within an organization might store data is very different formats and one needs to identify how to bring them together.
The next question to ask is “Is there a need for annotating data?” Annotating data means taking data and asking a human to go through the data and maybe add some labels, such annotation become very important for machine learning. This is typically a very expensive process that needs to carefully designed considering the objectives of the Data Science project.
The next question is “How can the data be extracted, transformed, and loaded(ETL)?” Data is available in different formats, we need to write scripts to extract data from them, transform them to some standard form, and load it on the machine on which we want to do the analyses. All these three steps are necessary to ensure that a Data Scientist can come in and start processing it directly. Again a lot of hacking skills, programming skills are required for this.
The next step is “How do we standardize or normalize the data?” Standardizing means ensuring that the data has unit mean and zero variance and normalizing is ensuring that data is between 0 and 1 where minimum and maximum are mapped to 0 and 1.
The next question is “How do we store data efficiently for analysis?” Typically we have a large amount of data and it may not fit into single computer’s RAM, we need mechanisms by which data can be stored efficiently so that we can do analysis on it on the go.
4. Data Modeling
Once we have the data all ready and prepared, a statistician can come in and do the data modeling itself. Even this can be broken down into multiple steps.
The first question here is “What are the assumptions to be made for modeling?” Sometimes one can make an assumption that this data is linear or they may make an assumption that these two variables are independent, so these assumptions need to be made, specified, they can be compared with the existing domain knowledge and then accepted.
The next step is to decide between “Statistical or Algorithmic modeling?” Statistical modeling is when we want to use a simple set of models but give very concrete statistical evidence for what we find whereas Algorithmic modeling is where one can through a bunch more complex models and expect the machine to optimize them using algorithms. Which of the two makes sense depends on the business context.
The third step is an important one, at this point someone has given some data to a data scientist and they are expected to model it but it’s important to specify if the amount of clean data that is there is actually sufficient for modeling, at this point, it’s all right to say/understand that we don’t sufficient data and let us go back to data preparation or maybe even further back.
Once we have sufficient data, the next question to ask is “Is the compute budget that we have is sufficient for modeling?”, “Do we have enough compute resources within our organization to actually learn and test these models”. sometimes at this point, organizations may buy GPUs, Cloud resources to ensure that there is sufficient computing power.
The next question is that “After the modeling is complete, once we have more data, once we have sufficient compute power, we have modeled it”, the question to ask is “If the results are statistically significant?” Before we go back to the business, it’s important that mathematically we are able to argue that the model we have learned is actually statistically relevant.
Once the data scientist has actually finished working on the model and proven that it is statistically relevant, they pass it on to Evaluation.
The first thing we do in evaluation is to test the model against the test data. It is common practice to hold out some part of data as a test dataset and the modeling step does not see this data step and on that dataset, we would like to find metrics like what is accuracy for instance.
After this, we try to see that the business objectives that we set out at the very beginning are actually met with the model that we have. Step 1 and 2 may not always be the same, business objectives may be a bit more abstract and we would like to ensure that they are also being met.
The third question is “does the model meet performance requirements?” By performance what we mean here is that the model must be deployed in some context and does it meet the requirements of those contexts.
The fourth question is “Is the model unbiased and robust?” It is entirely possible that the model can have unintentional biases which we see in the dataset, for instance, consider you are a bank and you are using Data Science to decide whether to give or not loan to a particular customer. Now it might be that you could have unintentional biases for the gender of the person or the educational qualification of the person and so on and so forth. It is important to check if such biases exist and for the ways to remove them. The next thing related to this is robustness, typically the data model is tested on a particular dataset, what happens when you have small changes in the dataset, for instance, say a new drug has been tested for the age group of 40 to 60, what happens if we give the drug to a 38-year-old, this is the issue of robustness, does the model robustly hold for he slight variations in the dataset, again this need to carefully tested out and ensure that there is robustness.
The final step is to find ways to improve the model, at the evaluation stage we are taking stock of what we have done so far, and it’s important to ask the question can we do the things slightly differently to have better accuracy, meet the objectives better and so on and so forth.
This is the phase when we have actually decided that our model meets all the requirements and is ready to be put in use in real life.
The first question over here is “Where is the model to be deployed?” Is it going to run on my organization’s server, or is it going to run on Cloud or on the phone of users or on an embedded device let’s say like a drone and so on? So, this needs to be understood here to give context to the remaining steps here.
The next question is “What is the Hardware, Software stack for deployment?” — Depending on where the model is to be deployed, we might have very different constraints something like a drone might support only C++ language, it needs to be written in a way that it is very very efficient whereas something on the cloud say on AWS can be written language we want. So, finding out what this stack is very important for deployment.
The next question is “Does it meets performance requirements?” Let’s say on a drone, we may want to write models that can use the battery in a limited way because the drone has limited energy availability on the other hand, on the cloud we can have lots of complex code and hence be very efficient. On the other hand, let’s say you are trading company, you need to ensure that the latency of the model is very very less because if there is large latency then the market can actually change and you may miss the opportunity.
The fourth question is “Does it violates any privacy requirements?” Depending on where you install a model, you need to ensure that the privacy of the user using the model is not violated.
The fifth important question is “Does it meets users’ requirements?” The entire Data Science exercise has released a final product which the user is using, so it’s very important to have a user’s study, ask users whether they like something, ask users whether it actually improves their experience or their performance on a particular task, etc.
We need to run this iteratively, the first point over here is the notion of MVP which stands for Minimum Viable Product, it means that instead of trying to build a very complex thing, try to build simple things and then iterate. At each stage, the simple thing that you have built is actually a complete product. It is often found that Data Science is an iterative process and we need to follow the MVP method where at least a simple solution is deployed.
The second thing is about revising expectation of success and the value of data science, there is a lot of miscommunication about what data science can and cannot do, and these cycles must be used to level set so that the entire company has a clear understanding of what data science can actually deliver.
The third step is to understand how to upgrade existing human and hardware resources.
In this article, we zoomed into the components involved in the CRISP-DM model step to understand it in a better manner.