Advanced Pandas — Continued
In the last article, we discussed the dataframe object of the pandas package, how to create a dataframe object from a series, dictionary; then we discussed that we can consider the dataframe a combination of multiple series objects, how to index a dataframe object, and how to slice a dataframe object.
In this article, we discuss how to load a dataset from seaborn package, how to deal with null, blank, missing values/data, how to iterate over a dataframe, how to filter a dataframe, how to update the data in a dataframe. So, let’s get started.
Loading data from Seaborn
We can load in the datasets available with the ‘seaborn’ package by using the ‘load_dataset()’ function passing in the name of the dataset and this functions loads in the dataset into a dataframe and return the dataframe.
When we start working with a new dataset, the first thing we could try is the ‘.info()’ method
It tells us the number of entries, data points(1035 in this case) and the list of columns along with the number of non-null values for each of the columns and the data type of the column.
df.head(n) — returns the first ’n’ rows of the dataframe
df.tail(n) — returns the last ’n’ rows of the dataframe
When working with a new dataset, we should also use the ‘.describe()’ method of the dataframe and it will return a statistical summary of all the numerical columns of the dataframe.
In our case, the column named ‘method’ was not numerical and as is clear from the result set, by default ‘.describe()’ just ignores the non-numerical column.
Now for the orbital period, the mean value is 2002 and the median is around 39, this points that there are some outliers. The same is true for the ‘distance’ attribute.
This is a quick way to get a sense of what the data looks like. One thing of importance is that there are many many rows of the data which seems to not have the complete data, there are missing values for ‘orbital period’, ‘mass’, ‘distance’(this is clear from the count value in the statistical summary).
One thing to deal with this problem is to drop the rows with missing values for any of the columns. There are multiple ways to do this, let’s solve it using two ways:
- Iterating over each row and if the value of any of the column is missing, we delete the row using the ‘.drop()’ function.
- Using the built-in functionality of pandas.
Dropping null values
As is clear from the result set of ‘planets.describe()’, there are many rows with missing values of either the ‘mass’ or ‘diameter’ or any other column.
We iterate over the row's index and for each row, we look at the columns one by one and for each combination of the index values(that is row index and column index), we check if the value at that index is null or not using the ‘pd.isnull()’ and if the value is null, we drop the row using the ‘df.drop()’ passing in the row index and specifying ‘inplace’ argument as True and we can specify ‘break’ in the code as soon as we drop the row because even if the value for the first column is missing, we want to drop the row and don’t want to iterate over the rest of the columns.
Here, we are iterating over the rows and for each row, we then iterate over the list of columns. We can iterate over both of them simultaneously using the ‘df.iterrows()’
It iterates through each row and gives the index of the row and the data for that row as a series object with column names as the index(of the series object)
Let’s write the code to drop any row with any missing values using the ‘df.iterrows()’
So, here we pass in the series object to ‘pd.isnull()’ and it returns a boolean value for each column/index in the series(we have re-run the part in the code which loads in the data so that it contains the missing values) and we can chain this call to the ‘pd.isnull()’ with the ‘.any()’ method and it will then return true if the value of any of the index is True.
So, the optimization here was that the first pattern of looping over the rows and the columns to essentially iterate cell wise is common and we switched to ‘.iterrows()’ to iterate row-wise and turns out that dropping the missing data is a common task in Pandas and we have the ‘.dropna()’ method in pandas which we can use to drop the rows with missing data.
We re-ran the code which reads in the data so that the dataframe contains some rows with missing values and then we can use the ‘.dropna()’
Querying from dataframe
Let’s look at how to filter the data from a dataframe based on some conditions say we want to see only those planets that are found in the 2010s and the method is either ‘Radial Velocity’ or ‘Transit’ and the distance is greater than the 75th percentile value of the distance attribute.
So, essentially we want to query a dataframe to look at the subset of the data by specifying some constraints on the columns and all those rows, entries which fulfill the constraints are part of the result set.
We iterate over the dataframe(row-wise using ‘.iterrows()’) and it will return the data of each row as a series, we access the column of interest using the square brackets notation and if it does not fulfills the constraint then we drop the row and continue with the next iteration.
If the first constraint is satisfied, we go to the next one check if that’s fulfilled or not and if not then we drop the row and continue with the next iteration. In a similar manner, we check any other constraints and take the decision if the row is to be kept or not.
There is an efficient way of querying the dataframe. We can do indexing on pandas using conditional checks on any combination of the columns.
We can directly access only that subset of the dataframe which satisfies the specified constraints
Manipulating the Data
Writing the full name of the method attribute seems a bit cumbersome, we want to modify this attribute so that it stores in the abbreviation(first character of each word in the method) of the method.
To get the unique values of an attribute, we can use the ‘.unique()’ method
Let’s solve this problem in chunks, to see how to convert a given string to its abbreviation.
The first thing we do is to split the string based on the white space which will give all the words as a list
We can then iterate over this list and for each element, we take the first character using the 0th index
And now we can join all the elements of this list using ‘.join()’ method
We just have 10 unique values for the method attribute and almost 1000 rows in the dataframe, we don’t want to do this computation again and again and would like to store this as a mapping, so we could store it in a dictionary wherein the key represents the original string and the value represents the abbreviation.
Let’s create a dictionary for this and iterate over the list of unique values
And now we iterate over the dataframe and create a new column named ‘short_method’ which refers to the abbreviation of the method’s attribute value.
So, this is one way of doing this operation and there is an efficient way of doing this operation instead of specifically iterating over the dataframe.
We want to take the ‘method’ attribute and apply some function on top of it and we want to do this efficiently, let’s say the function takes in a string, looks in the dictionary for this string and return the corresponding abbreviation.
df[‘col_name’].apply(function) — this applies the ‘function’ to value of the ‘col_name’ attribute for each row
So, we are getting the same results and here we are completely avoiding iterating over the dataframe ourselves.
Now that we know there are only 10 unique values of ‘method’, let us see how we can get the count of planets discovered for each method type.
To solve this task, what we essentially need to do is take the dataframe and divide it into smaller parts, each part corresponding to a unique ‘method’ name, the next thing to do is do some operation on each chunk and then we need to aggregate the results from each chunk. The general approach is:
- Split the dataframe into smaller chunks(in this case, we want to split the data into smaller chunks so that each chunk should have the same method name).
- Apply some function to each chunk(in this case, it is the ‘count’ function).
- Aggregate the results from each chunk into a single table.
Now one way of solving this task is to iterate over the unique values of the ‘method’ attribute and for each unique, we filter the data corresponding to the current value of ‘method’ and we then apply the ‘.count()’ on top of it to count the number of elements for all the columns names, so we can subset it to take the count of just one column and applying count on that means counting the non-null values for the column:
We have not yet aggregated the result, we could define a dictionary and store the count corresponding to each unique category/value of the ‘method’ attribute
Let’s see how we can achieve the same result using pandas efficient way of doing it
Here we use the ‘.groupby()’ method, it essentially splits the data with respect to some condition(here the condition is that the data should have the same value for ‘method’).
So, ‘df.groupby(‘method’)’ creates as many groups as there are unique values of the ‘method’ attribute.
After that, we subset only one attribute from it namely ‘method’ and on top of it we can the function of interest to us which in this case is ‘.count()’
The whole thing returns us a series object where we have the name/value of the method as the index and the corresponding count as its value.
We can use ‘groupby()’ on a combination of columns as well which then takes into account all those columns to create different chunks.
Similarly, we can apply some other function as well, for example, to compute the mean distance for each method type, we have:
Now that we know how to do some computation on chunks, let’s take the next task to find the fraction of planets that have been found in the last decade across each method type.
Let’s break this task into sub-problems and figure out a way to filter out those planets which have been discovered in the last decade and then we can split the data into chunks where each chunk represents each unique method type and then we apply the count function on it to count the entries for each chunk and finally we aggregate the results.
- Filter the dataframe based on a given condition
- Split the dataframe into small chunks
- Apply some function on each chunk
- Aggregate the results.
Let’s first filter the data to have only those planets which have been discovered in the last decade
We want to split this filtered dataset based on the ‘method’ attribute and then we want to take the count of entries
So, this gives the count of planets under each method that has been discovered in the last decade and we also have the overall count for each method:
Now the task is to find the fraction of planets that are found in the last decade. So, we have two ‘series’ objects
We can simply divide the two series objects
When we do this operation, it first aligns the indices and then does the operation point-wise.
In this article, we looked at how to load the data from seaborn package, how to deal with missing values/data, how to iterate over a dataframe, how to filter a dataframe, how to update the data in a dataframe.
References: PadhAI