In this article, we discuss one of the most important task of the data science pipeline which is Data Viz. We’ll understand what is meant by data viz., what is the need of data viz., and what are the common pitfalls in data viz. So, let’s get started:
As discussed in this article, there are essentially 5 steps which constitute the process of Data Science
Data Visualization is about describing data.
Let’s say we have a dataset of 1000 rows and 40 columns, if we extract from it a subset of numbers, then that is an act of compressing it, we can do such compressions by computing the statistics.
The other way of describing data is to visualize the data which is to take all the numbers and show the same numbers graphically.
In Data Visualization, we try to encode the data visually or graphically.
So, if we have a file with many rows and columns and perhaps many such files, then we want to encode all such data in charts like bar charts, line charts, and so on.
Here is an example of such encoding
The problem with the above encoding is that we can’t easily decode it, we are more interested in the visualizations, encoding, representations that can be decoded by humans.
Let’s see the different visual elements(which we can be somehow perceived by humans) that we can use to encode the data
For example, we can use length to represents numbers, a longer bar would represent a larger number and vice-versa, then we have the ‘slope’ we can use to represent positive and negative numbers and their magnitude, we can represent different numbers using different colors and so on.
These elements are combined in some way to create visualizations in a way that the perceptual error is reduced(we want to encode visual elements but some of the elements might be hard for us to perceive and the goal of the visualizations is to make it easy to get the insights from the graphics so we want to reduce the perceptual error). Here is an example, so we have 2 bars A and B and we are using the length to encode numbers and if we ask which is longer — A or B and consequently which variable is larger, then it might a bit of time to come to the conclusion.
What this means is that the amount of effort one should put in to decode the information in this chart, seems to be quite large, and we want to reduce this effort.
One way is to draw boxes(of the same size) around A and B and now things become very clear, we can easily see that the bar for B is longer and variable B is therefore larger.
So, our goal in visualization to make the decoding easier. The definition of visualization is to encode information graphically but also the art of doing it is to ensure that this process is efficient for humans.
Let’s see why we need visualizations:
- To discover insights from the data, let’s say we have a dataset loaded in as the pandas dataframe and it has almost ~54k entries and it might take a large time to scroll through all the entries computing mean, median, and so on and understand what is happening but that’s a long process.
Instead of that, consider plotting something like this:
This is a pair-plot and it takes the pair of numerical columns and plots it out and helps to quickly get an idea of the relation between different columns.
The second reason to study data visualization is that it helps in communicating the insights effectively.
As discussed in the CRISP-DM model, at the end of the process, the goal is to communicate the insights to others(what we found from the data science process) and this is directly linked to the business objectives laid out(double arrow between communicating results and business objectives) and helps to decide on next steps, so it's essential that the insights are communicated in an effective manner.
The audience may not know about data viz., different graphs, and so on therefore the ability to still be able to communicate insights is very important.
So two main objectives for which Data Viz. is used are: Data exploration, to discover insights from data, and secondly for communicating the insights.
Let’s look at some of the pitfalls in Data viz.:
Believing that the data viz. is not that important — the notion that we can do statistics, complex inferential statistics and that is going to drive ML models and so on and people think that data viz. is a soft thing to do and is not that important which is definitely not the case
The next pitfall in data viz. is to pack as much information as we can in one image
The idea of the viz. is to make insight derivation efficient and not to pack information.
Data viz. is not impressing people with cool graphical skills, the puprpose is to convey insights as efficiently as possible.
The other pitfall is that the aesthetic things like color don’t matter
Although the color legends are specified, it becomes a little hard for someone to decode it because these colors does not make any logical sense, it becomes very hard to understand what the scale is.
Data viz. is an important part of the data science pipeline which is generally underrated but if done in the right manner, it can help to reduce the efforts to a great extent and always helps in communicating results effectively to the audience.