Styling Tabulation
In the last article, we discussed that Data Viz. is an important task of the data science pipeline and if done in the right manner, it could reduce the overall effort to a great extent and is very helpful in communicating the insights to the audience. In this article, we discuss the tabulation viz., how to design the table content.
We start by importing the relevant libraries
‘matplotlib’ and ‘seaborn’ are the two important libraries for data visualization. In particular, we use the ‘pyplot’ package from the ‘matplotlib’ library.
Tabulation
Under this format, we visualize the data through tables. So, tables of rows and columns and filling in the cells with values. In fact, in most research papers, results are presented in form of tables and typically some important values might be highlighted in the table and so on.
The dataset used in the code is leveraged from the covid 19 India website API and this data tells us the number of active, recovered, deceases cases on daily basis across Indian states. This data is in the JSON format and below is a snippet of the dataset:
So, the data file contains three records categories for each date and the numbers correspond to different states and this is available in the JSON format.
We store the data API link as a variable in the program and then import the ‘request’ package from the ‘urllib’ library that helps to read in data from URLs
‘urlretrieve’ method is used and the API datalink is passed as the first parameter and this data is then stored in a file on the system and the file name is passed as the second parameter to this method
The output that we get under cell 4 in the above image can be suppressed by putting a semi-colon at the end
Now that we have this data available as a JSON file on the run environment, we can read it into pandas dataframe using the ‘read_json()’ method.
If we print the dataframe, then it seems that it has just one column and this column contains all the content within it
And this is clearly not in the format that we can use right away(we would need each state as a different column along with the numbers for a given date).
The way to deal with this is to load in the JSON data as a dictionary, perform some operations on top of it and then load it as pandas dataframe:
Firstly we import the ‘json’ library then we open the file and load the JSON data as a dictionary
If we see the structure of the JSON again we find that there is this top-level key called ‘states_daily’ and all the data is contained within this key
So, we pull in the data corresponding to this key and store it back in the data variable
Now that we have the data in the form of a dictionary, we can load it to a dataframe using ‘json.normalize()’ and passing in the dictionary and if we check the dataframe after this, it will have all the states as columns, there is also a ‘date’ column and ‘status’ column corresponding to different categories(Confirmed, Recovered, Deceased) as in the dataset
The same code cell tells us that there are 321 rows meaning 107 days of data(3 records for each date corresponding to each unique status value) is recorded in this dataframe and in total there are 41 columns.
Now that we have the data in a dataframe, we’ll see how to have it in a tabular form where the index is given by ‘date’ and we’ll confine to the ‘Confirmed’ cases for the time being
Firstly, we convert the ‘date’ column to datetime type instead of the default string type, then we confine the data to have only the rows corresponding to the ‘Confirmed’ status
After this we could drop the status column as anyways it just contains one value i.e ‘Confirmed’ and then we set the index to be the ‘date’ column using the ‘set_index()’ method
And now if we check the dataframe, it’s in the required format with the ‘date’ as the index and all states available as the columns and the numbers reflect the ‘Confirmed’ cases in the respective state for a given date
We can use the ‘df.info()’ to see the data types of the columns, the number of non-null values in each column
As the data is read from a JSON file, all of these columns have been assigned a dtype as ‘object’, as the data reflects the number of cases(which would always be an integer), we convert these columns to have integer type. This is how it could be done for one column at a time:
To apply to all the columns, we use the ‘.apply()’ method and pass it the function we want to apply which in this case is ‘pd.to_numeric()’
And now we use the command ‘df.info()’, it will reflect integer data type for all columns
Styling Tabulation
Let’s take only a small portion of the data from the original dataset/dataframe.
Now the command to style the tabulation is ‘df.style’
Let’s say we want to highlight negative values — we can define a function that has a local variable named color that takes on value as ‘red’ or ‘white’. This function takes in one argument/value and if that value is negative it assigns color as ‘red’ otherwise it assigns the color as ‘white’.
Now we can use the ‘df.style.applymap(function_name)’ and this command applies this function to each cell in the dataframe, takes the value returned the function for the respective cell, and uses the returned value to style that particular cell
So, for this dataset for most of the cells, we have the text color as white, for a couple of values we got the text in red meaning they contain some negative value.
Since these values reflect the Confirmed cases that can not be negative, we could either drop this entire column, drop the rows corresponding to these entries, set these entries to blank. In general, the decision to take will be dependent on the business objective at hand, data availability at hand.
Let’s say in this case, we drop this particular column
We could also apply in-built functions as well: for example ‘highlight_max(color = color_name)’ will highlight the maximum value under each column with the specified color
For most of the states, the red color is towards the bottom meaning the maximum number of cases were reported in recent days, for a few states for example for ‘ld’, across multiple dates, the value is the same and that’s why multiple entries are highlighted.
This again helps to take the next steps in the data aspect for example in this case, we might decide to drop the states with 0 cases
We can also chain together commands for example the below code line will
The above line of code highlights the maximum value in each column as red and the minimum value as green. This helps to better understand the situation for example in the below image for the state of ‘ut’, it's going from a large number to a smaller number which is better(since data is sorted in ascending order based on date), we can say that the situation is improving whereas in the state of ‘wb’, we had the number 370 earlier but then it increased to 572 indicating the surge in the number of cases.
The column titled ‘tt’ seems to represent the total value across all states and as such we can delete this column
Let’s write one more function that highlights/bolds the maximum value in a column:
So, this function takes in a column and the first line in the code checks each entry in the column with the maximum value in the column and this will result in a boolean result corresponding to each entry/value and this is stored as an array, now we have a list comprehension on top of this list which specifies the style as ‘bold’ for maximum value in the array/column
.applymap() is applicable to all cells
.apply() is used to apply a function across some axis
We can again chain in together a couple of functions
So, this line of code reflects the largest in bold and the smallest in the green background
Instead of finding the minimum value in a state, one might be interested in finding the minimum on a given date across all states, we can do that by specifying the axis as 1
Similarly, on a given date we could highlight maximum value across all states
So, with this, we have the bold value for column-wise maximum and the red highlighted for row-wise maximum
State ‘mh’ seems to have a maximum value on any given day except for a couple of records.
Instead of a single color, we could have a gradient as well(we need to specify the color choice, this is done by setting the ‘cmap’ argument
This takes each column and color it going from light color denoting small numbers to dark colors denoting large numbers.
We could do this for each row, so we can specify the axis as 1 and this would show the gradient for each record/row
Based on the initial analysis, we can focus on a few states and we can specify as such under the ‘subset’ argument
This will show the gradient only for the specified columns/states and we could then analyze how the gradient is changing across these states
Let’s see how to draw bars:
This shows a little bar for each cell, where the length of the bar gives the idea of the magnitude/value relative to other values in that column for example for the state ‘mh’, the length of the bar seems to be increasing all the way down except for the last down which indicates that the cases have increased continuously in the state.
We can specify the subset over here as well
Another way to do the analysis on the subset is to first filter the data for these states and then apply the style method
This now becomes a bit easier to interpret.
We could even specify different colors for different columns for example
So, we have seen something interesting here, just in tabulating data itself, we can color the data cells, bold the values, draw bars where the bar length gives the idea of the magnitude of the value, have gradients, and so on all of which helps to understand the data in a better manner.
References: PadhAI