In the previous article, we discussed why do we need pandas, and then we looked at the Series object and some of the operations we can perform on them and how to put it to use by doing a case study on the NIFTY time series data. In this article, let’s discuss the ‘dataframe’ object of the pandas which is a generalization of the series object and there is where most of the actions of the happens in pandas.
Let’s first import the NumPy and the pandas package and then see how to create a dataframe object:
Let’s create the dataframe from a 2D NumPy array. To create a 2D NumPy array, we use the ‘np.random.randint()’ functionality passing in the range and the dimension of the 2D array.
So, we have a 2D array of shape (5, 3) and all the numbers lie between 0 and 10.
We can pass this array to the ‘pd.DataFrame()’ and it will create a dataframe object from this 2D array
So, the dataframe object contains the 2-dimensional arrangement of numbers and each row has a name, index, and each column also has a name. Since we have not specified any names when creating this object, the implicit values starting from 0 are taken by default.
dataframe.values — gives back the data, values as a numpy array
dataframe.index — gives the row names
dataframe.columns — gives the column names
We can iterate over the iterator that the value returned by ‘.index’ or ‘.columns’ attributes, for example, we can do something like this:
As mentioned, ‘.values’ attribute gives us the data/values as a NumPy array and we can simply index this array
It is useful to have useful names for the rows and columns index
We can access the values using the two ways of indexing(we talked about this in when performing the same operations on series, .loc and .iloc), so from the above table if we want to access the value at R3 and C2 we can do it like this:
.loc refers to the explicit index and .iloc refers to the implicit index.
Similarly, we can do the slicing and it returns a dataframe object
We can use the .loc[row_start:row_end, col_start:col_end] as well to do the slicing, the only difference is that the end value, index is also included in the result set
We can also take slices which return us a series object, so if we access the first row from the dataframe object, it contains three numbers but those numbers have a label assigned with them
What this means is that we can think of this dataframe as a collection of series on either of the axis meaning each row and each column is a series in itself
We can access the first column of the dataframe as below and the type of the result set is again a series in this case
.shape — this attribute tells us the number of rows and columns
.T — attribute gives the transpose of the dataframe, it flips the rows and columns
We can define a function that takes in the value of the number of rows and columns, the maximum value that the elements can take and creates a dataframe with that many rows and columns with some random values of the elements.
We can create the random arrays using the ‘np.random.randit()’ functionality specifying the max. value and the size of the array as the number of rows and columns specified when calling this function.
Once we have the array created, we can use it to create a dataframe object. Then to create the labels for rows labels, we just create a list of numbers starting from 1 to the number of rows and we prepend each of the numbers with ‘R’ to essentially give us values like ‘R1’, ‘R2’, …. so on.
Similarly, we can create the column labels and we can then specify that these labels are to be used for the dataframe object and we return the dataframe object.
And now we can call the function like this:
Creating a dataframe using series object
We can also create a dataframe object from multiple series objects, so we have two series objects one reflecting the mass of the different planets and the other one reflecting the diameter value of the different planets.
We have the same indices for the two series objects, it’s just that we have one more index in the ‘diameter’ object corresponding to ‘moon’ object.
Also, we can note that the ‘mass’ contains ‘float’ values and ‘diameter’ contains ‘int’ values.
To create a dataframe using series objects, we call the ‘pd.DataFrame()’ function and pass in a dictionary and the keys of the dictionary correspond to the column names(of the dataframe) and the value corresponding to the key would be the series name.
We get a dataframe with two columns: mass and diameter and the indices contain the union of the indices of the two series objects.
‘Moon’ was available in diameter and not in ‘mass’ and we can see that we get a ‘NaN’ for ‘Moon’ under the mass column.
pd.DataFrame() is an overloaded function and it can take different types of arguments, depending on what the arguments are, pandas would do different things.
We can access any column of the dataframe using the square bracket notation and the value would be returned as a series object
We can not do something like ‘df[‘Earth’]’ as Earth is not a key in the column space.
To get the mass of the earth, we can first specify the column name i.e ‘Mass’ which returns a series object and from this series object we want to pick the value corresponding to the key named ‘Earth’
We can also use the shorthand notation
We can create a new column in the dataframe using the square bracket notation and assigning it a list of values, an array or just a single value(which would then reflect for all the indices, broadcasting property)
We can change in the value for a specific entry as below:
To access a row as series, we can use the .loc functionality and pass in the index
In a similar manner, we can specify that we want all the rows and just the ‘Mass’ column, then we can do it this way:
To access a column, we can either use the square brackets notation or the dot notation, both works are the same
If we do the same thing for the ‘pop’ column, we get a False
The reason for this is that ‘pop’ is an in-built function that is available for the dataframe objects and when we use the dot notation, it points to the function and not the column. This is one of the pitfalls of using the dot notation.
We can now create a function which given a dataframe creates a new row with the row label as ‘mean’ and values as the mean value of the respective columns.
Let’s look at the individual parts to solve this task and then we can put it in a functional form.
We can add a new row by using the ‘.loc’ and specifying the row index and the default value
Now that we have the new row added to our dataframe, we need to update its corresponding columns with the mean value.
As discussed earlier, we can think of the dataframe as a composition of series objects and so if we access any of the columns, it would be of type ‘series’ and we can perform NumPy operations on top of it.
We have already added the row corresponding to the ‘Col_Mean’, it will effect the mean value, so we need to drop it first, then compute the mean value and add it back again with the correct value of the mean.
To delete a row, we can use the ‘df.drop()’ functionality and pass it in the row index.
We can delete the columns in the same way, we just need to specify the value of the axis argument as 1
None of the changes actually take place in the original dataframe and new dataframes are being created for each operation. To make changes to the original one, we can specify inplace argument as True or we can assign the new dataframe to the original variable.
And now we can create a new row which reflects the mean value using the .loc functionality and passing in a list of values which corresponds to the mean value of the respective columns:
Now to write this thing as a function, we need to be more generic, we can’t just manually plug in the mean values of all the columns.
So, we iterate over the list of columns and for each possible value of the column, we compute its mean and store in the list
And we can now use this function to achieve the same results as before.
It turns out that even this is not the most efficient way of doing it because this intention of doing something across columns is something that we have all the times, so we might want to find the minimum value, maximum value, median value and so on and all of these operations requires going through all the columns and aggregating the values.
To make this faster, we can simply use the ‘.mean()’ method on the dataframe object which will apply the NumPy ‘mean()’ function using Numpy’s efficient way of implementing it to each column of the dataframe.
Now, we can play around with the functions we have created, so we can have a dataframe object of any size filled in with the random values and we can compute the mean of all of its column in one go:
‘df.mean()’ returns a series with the column names as indices and that’s why we were able to use it to assign the values to a new row in the function we defined earlier.
To compute the mean value across the rows, for example, the mean value for all entries under the row index ‘R1’, we can still use the ‘df.mean()’ function and pass it a keyword argument named ‘axis’ with a value of 1, this will compute the mean values for all the row’s index labels and returns a series object
We can add this extra column to the dataframe as:
Similarly, we can quickly add in the column’s mean value using just one line of code
df.median() — returns the median value for each column
df.min() — returns the minimum value for each column
df.max() — returns maximum value for each column
df.quantile(n) — returns the n’th percentile for each column
Instead of computing these statistics one by one, we can use the ‘df.describe()’ which provides all these numbers in one go
For each column, it computes different quantities. This method helps to understand any new dataset in a quick manner. If we re-create the planets dataframe and use the ‘.describe()’ method, we can get some insights:
We see that the count for ‘Mass’ is 9 whereas it is 10 for the ‘Diameter’ column, that’s because the value of mass is missing for ‘Moon’.
Then the mean value of the mass is 296 and the median (50% percentile) is 5.97, this means that the data is skewed, there is a large value of the outlier to the right side of the median value.
So, ‘describe()’ helps us to quickly check the data if there are any null values, if the data is skewed or not, the range of the values.