In this article, we discuss one of the most important task of the data science pipeline which is Data Viz. We’ll understand what is meant by data viz., what is the need of data viz., and what are the common pitfalls in data viz. So, let’s get started:

As discussed in this article, there are essentially 5 steps which constitute the process of Data Science

Image for post
Image for post

Data Visualization is about describing data.

Image for post
Image for post

Let’s say we have a dataset of 1000 rows and 40 columns, if we extract from it a subset of numbers, then that is an act of compressing it, we can do such compressions by computing the statistics.

The other way of describing data is to visualize the data which is to take all the numbers and show the same numbers graphically.

In Data Visualization, we try to encode the data visually or graphically.

So, if we have a file with many rows and columns and perhaps many such files, then we want to encode all such data in charts like bar charts, line charts, and so on.

Image for post
Image for post

Here is an example of such encoding

Image for post
Image for post

The problem with the above encoding is that we can’t easily decode it, we are more interested in the visualizations, encoding, representations that can be decoded by humans.

Image for post
Image for post

Let’s see the different visual elements(which we can be somehow perceived by humans) that we can use to encode the data

Image for post
Image for post

For example, we can use length to represents numbers, a longer bar would represent a larger number and vice-versa, then we have the ‘slope’ we can use to represent positive and negative numbers and their magnitude, we can represent different numbers using different colors and so on.

These elements are combined in some way to create visualizations in a way that the perceptual error is reduced(we want to encode visual elements but some of the elements might be hard for us to perceive and the goal of the visualizations is to make it easy to get the insights from the graphics so we want to reduce the perceptual error). Here is an example, so we have 2 bars A and B and we are using the length to encode numbers and if we ask which is longer — A or B and consequently which variable is larger, then it might a bit of time to come to the conclusion.

Image for post
Image for post

What this means is that the amount of effort one should put in to decode the information in this chart, seems to be quite large, and we want to reduce this effort.

One way is to draw boxes(of the same size) around A and B and now things become very clear, we can easily see that the bar for B is longer and variable B is therefore larger.

So, our goal in visualization to make the decoding easier. The definition of visualization is to encode information graphically but also the art of doing it is to ensure that this process is efficient for humans.

Let’s see why we need visualizations:

  1. To discover insights from the data, let’s say we have a dataset loaded in as the pandas dataframe and it has almost ~54k entries and it might take a large time to scroll through all the entries computing mean, median, and so on and understand what is happening but that’s a long process.
Image for post
Image for post

Instead of that, consider plotting something like this:

Image for post
Image for post

This is a pair-plot and it takes the pair of numerical columns and plots it out and helps to quickly get an idea of the relation between different columns.

The second reason to study data visualization is that it helps in communicating the insights effectively.

Image for post
Image for post

As discussed in the CRISP-DM model, at the end of the process, the goal is to communicate the insights to others(what we found from the data science process) and this is directly linked to the business objectives laid out(double arrow between communicating results and business objectives) and helps to decide on next steps, so it's essential that the insights are communicated in an effective manner.

Image for post
Image for post

The audience may not know about data viz., different graphs, and so on therefore the ability to still be able to communicate insights is very important.

So two main objectives for which Data Viz. is used are: Data exploration, to discover insights from data, and secondly for communicating the insights.

Let’s look at some of the pitfalls in Data viz.:

Believing that the data viz. is not that important — the notion that we can do statistics, complex inferential statistics and that is going to drive ML models and so on and people think that data viz. is a soft thing to do and is not that important which is definitely not the case

Image for post
Image for post

The next pitfall in data viz. is to pack as much information as we can in one image

Image for post
Image for post

The idea of the viz. is to make insight derivation efficient and not to pack information.

Image for post
Image for post

Data viz. is not impressing people with cool graphical skills, the puprpose is to convey insights as efficiently as possible.

Image for post
Image for post

The other pitfall is that the aesthetic things like color don’t matter

Image for post
Image for post

Although the color legends are specified, it becomes a little hard for someone to decode it because these colors does not make any logical sense, it becomes very hard to understand what the scale is.

Data viz. is an important part of the data science pipeline which is generally underrated but if done in the right manner, it can help to reduce the efforts to a great extent and always helps in communicating results effectively to the audience.

References: PadhAI

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store