In the last article, we discussed that Data Viz. is an important task of the data science pipeline and if done in the right manner, it could reduce the overall effort to a great extent and is very helpful in communicating the insights to the audience. In this article, we discuss the tabulation viz., how to design the table content.

We start by importing the relevant libraries

matplotlib’ and ‘seaborn’ are the two important libraries for data visualization. In particular, we use the ‘pyplot’ package from the ‘matplotlib’ library.

Under this format, we visualize the data through tables. So, tables of rows and columns and filling in the cells with values. In fact, in most research papers, results are presented in form of tables and typically some important values might be highlighted in the table and so on.

The dataset used in the code is leveraged from the covid 19 India website API and this data tells us the number of active, recovered, deceases cases on daily basis across Indian states. This data is in the JSON format and below is a snippet of the dataset:

So, the data file contains three records categories for each date and the numbers correspond to different states and this is available in the JSON format.

We store the data API link as a variable in the program and then import the ‘request’ package from the ‘urllib’ library that helps to read in data from URLs

urlretrieve’ method is used and the API datalink is passed as the first parameter and this data is then stored in a file on the system and the file name is passed as the second parameter to this method

The output that we get under cell 4 in the above image can be suppressed by putting a semi-colon at the end

Now that we have this data available as a JSON file on the run environment, we can read it into pandas dataframe using the ‘read_json()’ method.

If we print the dataframe, then it seems that it has just one column and this column contains all the content within it

And this is clearly not in the format that we can use right away(we would need each state as a different column along with the numbers for a given date).

The way to deal with this is to load in the JSON data as a dictionary, perform some operations on top of it and then load it as pandas dataframe:

Firstly we import the ‘json’ library then we open the file and load the JSON data as a dictionary

If we see the structure of the JSON again we find that there is this top-level key called ‘states_daily’ and all the data is contained within this key

So, we pull in the data corresponding to this key and store it back in the data variable

Now that we have the data in the form of a dictionary, we can load it to a dataframe using ‘json.normalize()’ and passing in the dictionary and if we check the dataframe after this, it will have all the states as columns, there is also a ‘date’ column and ‘status’ column corresponding to different categories(Confirmed, Recovered, Deceased) as in the dataset

The same code cell tells us that there are 321 rows meaning 107 days of data(3 records for each date corresponding to each unique status value) is recorded in this dataframe and in total there are 41 columns.

Now that we have the data in a dataframe, we’ll see how to have it in a tabular form where the index is given by ‘date’ and we’ll confine to the ‘Confirmed’ cases for the time being

Firstly, we convert the ‘date’ column to datetime type instead of the default string type, then we confine the data to have only the rows corresponding to the ‘Confirmed’ status

After this we could drop the status column as anyways it just contains one value i.e ‘Confirmed’ and then we set the index to be the ‘date’ column using the ‘set_index()’ method

And now if we check the dataframe, it’s in the required format with the ‘date’ as the index and all states available as the columns and the numbers reflect the ‘Confirmed’ cases in the respective state for a given date

We can use the ‘df.info()’ to see the data types of the columns, the number of non-null values in each column

As the data is read from a JSON file, all of these columns have been assigned a dtype as ‘object’, as the data reflects the number of cases(which would always be an integer), we convert these columns to have integer type. This is how it could be done for one column at a time:

To apply to all the columns, we use the ‘.apply()’ method and pass it the function we want to apply which in this case is ‘pd.to_numeric()

And now we use the command ‘df.info()’, it will reflect integer data type for all columns

Let’s take only a small portion of the data from the original dataset/dataframe.

Now the command to style the tabulation is ‘df.style

Let’s say we want to highlight negative values — we can define a function that has a local variable named color that takes on value as ‘red’ or ‘white’. This function takes in one argument/value and if that value is negative it assigns color as ‘red’ otherwise it assigns the color as ‘white’.

Now we can use the ‘df.style.applymap(function_name)’ and this command applies this function to each cell in the dataframe, takes the value returned the function for the respective cell, and uses the returned value to style that particular cell

So, for this dataset for most of the cells, we have the text color as white, for a couple of values we got the text in red meaning they contain some negative value.

Since these values reflect the Confirmed cases that can not be negative, we could either drop this entire column, drop the rows corresponding to these entries, set these entries to blank. In general, the decision to take will be dependent on the business objective at hand, data availability at hand.

Let’s say in this case, we drop this particular column

We could also apply in-built functions as well: for example ‘highlight_max(color = color_name)’ will highlight the maximum value under each column with the specified color

For most of the states, the red color is towards the bottom meaning the maximum number of cases were reported in recent days, for a few states for example for ‘ld’, across multiple dates, the value is the same and that’s why multiple entries are highlighted.

This again helps to take the next steps in the data aspect for example in this case, we might decide to drop the states with 0 cases

We can also chain together commands for example the below code line will

The above line of code highlights the maximum value in each column as red and the minimum value as green. This helps to better understand the situation for example in the below image for the state of ‘ut’, it's going from a large number to a smaller number which is better(since data is sorted in ascending order based on date), we can say that the situation is improving whereas in the state of ‘wb’, we had the number 370 earlier but then it increased to 572 indicating the surge in the number of cases.

The column titled ‘tt’ seems to represent the total value across all states and as such we can delete this column

Let’s write one more function that highlights/bolds the maximum value in a column:

So, this function takes in a column and the first line in the code checks each entry in the column with the maximum value in the column and this will result in a boolean result corresponding to each entry/value and this is stored as an array, now we have a list comprehension on top of this list which specifies the style as ‘bold’ for maximum value in the array/column

.applymap() is applicable to all cells

.apply() is used to apply a function across some axis

We can again chain in together a couple of functions

So, this line of code reflects the largest in bold and the smallest in the green background

Instead of finding the minimum value in a state, one might be interested in finding the minimum on a given date across all states, we can do that by specifying the axis as 1

Similarly, on a given date we could highlight maximum value across all states

So, with this, we have the bold value for column-wise maximum and the red highlighted for row-wise maximum

State ‘mh’ seems to have a maximum value on any given day except for a couple of records.

Instead of a single color, we could have a gradient as well(we need to specify the color choice, this is done by setting the ‘cmap’ argument

This takes each column and color it going from light color denoting small numbers to dark colors denoting large numbers.

We could do this for each row, so we can specify the axis as 1 and this would show the gradient for each record/row

Based on the initial analysis, we can focus on a few states and we can specify as such under the ‘subset’ argument

This will show the gradient only for the specified columns/states and we could then analyze how the gradient is changing across these states

Let’s see how to draw bars:

This shows a little bar for each cell, where the length of the bar gives the idea of the magnitude/value relative to other values in that column for example for the state ‘mh’, the length of the bar seems to be increasing all the way down except for the last down which indicates that the cases have increased continuously in the state.

We can specify the subset over here as well

Another way to do the analysis on the subset is to first filter the data for these states and then apply the style method

This now becomes a bit easier to interpret.

We could even specify different colors for different columns for example

So, we have seen something interesting here, just in tabulating data itself, we can color the data cells, bold the values, draw bars where the bar length gives the idea of the magnitude of the value, have gradients, and so on all of which helps to understand the data in a better manner.

References: PadhAI