# Engineering Data Science Systems

We use the word “**system**” very often in our day to day lives for example we say that we are doing it systematically. Let us try to formally understand what “**system**” means and in particular “**what is systems thinking?**”.

What we mean when we use the word “**system**” in our day to day lives for example when we say we are doing it systemically, we say in a way that we are following a procedure, we are looking at it in a structured way and then doing it. So, this is our day to day understanding of the word “**system**”. Here is one potential definition:

Here as the diagram shows, even if we are building a small part of a large system, it is important that we have in mind what the larger context is and doing this is systems thinking. Let’s take an example: considering we are building a sorting algorithm that takes in an array of numbers and sorts them in the ascending order. Bubble sort is a very common way of doing this, it turns out though this is a good algorithm if implement it on hardware(processors), these processors have caches(caches are to ensure that the memory access is efficient), it turns out that the bubble sort does not take advantage of the caching system that we have, another algorithm quick-sort is much faster-considering cache. If we do systems thinking and ask what is the broader context for this sorting algorithm, we might realize quick-sort is the algorithm to choose(out of the two).

So, this is really the essence of systems thinking — find the bigger context and then build and make choices around the design that we are building.

Let’s apply this to data science, if we are designing the system for Data Science, the first question would be what is this bigger encompassing global view. Clearly, data science involves data but what else is there, how do we fill in the blanks, what are the components that go to engineer a bigger system in data science. Let’s answer this question by understanding the system perspective of data science.

**System Perspective of Data Science**

Let’s see what are the different roles that need to come together for Data Science to work. The first is to under the context which generates the data or in other words, the domain knowledge, the next component is, of course, the mathematics and the statistical knowledge and the third piece is the hacking skills(in this broad term we are including everything we do for example in programming how efficiently do we store data, how do we collect more data, how do we efficiently move data between different regions, what is the format to be chosen, how do we make modeling part faster — all of these engineering skills we are placing under the hacking skills). And data science remains in the middle of the below Venn diagram.

Typically, in an organization, there are different job roles and not just the role of a Data Scientist, the below picture depicts the skill set typically required for a data science project. There are different regions where these bubbles intersect and each intersection reveals a skill combination, the most obvious one is when all these skills are together which is this bubble of Data Science.

There are other profiles as well:

**Data Analyst** — is the one who understands the business context, able to do statistical programming(say in Excel or R), and is able to communicate the results very effectively, sometimes called as Consultants, and play a very important role in the initial stages of a Data Science project.

**Research Engineer **— someone who is very good at programming, statistics and also is able to communicate results but does not necessarily have large know-how about the business itself, they are people who can accelerate innovation in finding out new models or in finding out different practices in storing data and optimizing processes and so on.

**Data Engineer **— someone who understands the business context and the programming environment, they play a very important role to ensure that the IT system that is required for Data Science to work is effectively managed for instance a Company might be getting data from different sources, data needs to stored, collected and cleaned up over time. Data engineers can play a significant role in that.

Data Science is not a unidimensional thing, there is a broad context to it, and finding out what this context is and how it combines is an important picture to complete the systems thinking for data science.

**Engineering Systems for Data Science**

So far, we have seen what it is to do systems thinking for data science and also seen a few roles that are common in Data Science especially in practically industry settings. Let us understand more about the components that form the engineering systems for data science.

Engineering Systems for Data Science involves two components:

- Process
- Programming

It is not enough to know what the programming language is and how we use that but to know what is the broader process and how what we are doing fits into the process. If we zoom into the process itself, there are typically two components:

- The process is a flow of steps, we do step 1 then we do step 2, if something doesn’t go well we come back to step 1, that’s typically the flow of steps.
- It involves Agile improvement where the idea is — we try to build small things and we keep improving them in loops.

These two form the key to understand the process component of the engineering systems for Data Science.

Let’s see an example process that one can follow for Data Science. This process is called as **Crisp-DM**. Let’s see what the steps are(1st part of the process is the flow of steps):

Data — Clearly Data Science requires data that remains at the heart of this process. After that, the first important aspect is Business understanding.

**Business Understanding**— One needs to understand what is the context they are working in and be able to specify the business problems that they are trying to solve.

2. **Data Understanding** — Here one tries to understand what data we have, what does it represent, can it solve our business objective, and so on and so forth. As depicted in the below image, the interaction between these two steps is shown by a bi-directional arrow

Only at this step when a business person says “I have this problem to solve, how do we do it?” and a data engineer can then say we have this data available and tries to provide some justification that it can solve the problem. In this process itself, the back and forth happens, the question gets innovated upon, it gets improved upon and the new data is also being searched for. So, these two steps in itself are iterative.

3. **Data Preparation **— Here at this step, the data we have needs to be prepared for it to be ready to do some statistical analysis afterward.

4. **Data Modeling** — This is where the core statistical ideas come into play, how do we model data, how do we analyze if our hypothesis is correct, and so on and so forth. Here also we have a bi-directional arrow between step 3 and step 4. Typically a model is tried out, we might find out that this model is not enough, maybe it needs some more data or maybe it needs a different kind of data, we need to back to data preparation and so on and so forth.

5. **Evaluation** — After the modeling is done, we need to evaluate the model, there are multiple criteria only one of which is the accuracy of the model, and depending on how the evaluation went, we can either deploy the system for a real use case by the users(**step 6. Deployment**) or we can go back to the drawing board so to speak and ask again the questions like “Should I look at different business problems”, or “Should I look at different ways of solving those problems”.

So, this is the idea that we start off with business understanding, then data understanding, prepare data, model data(as per the problem at hand), evaluate the model and if all goes well then deploy it. And at times, this entire process might not work and we go back to the drawing board. It’s important to see that this entire process is iterative.

In this article, we discussed what is systems thinking and the components involved in the engineering systems for data science. Then we looked at all the components form the part of the standard model, process(CRISP-DM) for the Data Science problems.

References: PadhAI