Pandas
We are interested in Python packages which are of relevance in Data Science work, we have already seen NumPy in this article, which is used for numerical computing. Let us move on to the next more important package which is pandas.
Why Pandas?
We have already discussed in the previous articles that Data Science is mainly about working with high dimensional arrays as is the case with many many domains whether it is scientific data, financial data or any kind of relational data, or if we are storing and processing multimedia data or the models being trained using deep learning; in all of these domains, we have high dimensional data which needs to be created, accessed in complex ways, processed in terms of computing aggregates like sum, mean, median, mode and we already know various data structures like lists, tuples, a dictionary which can represent these arrays and process them.
The performance is the main factor for us and that’s where we move towards NumPy arrays, so the way the NumPy arrays are implemented considering lower-level details leads to very efficient ways to store and process these arrays.
Apart from storing them, NumPy also allows us to index them, take slices of n-dimensional arrays, there can be conditional slices as well(say picking only even values).
Then there is this concept of broadcasting as well which allows us to compactly write code.
NumPy also provides a default implementation of some standard functions.
So, NumPy seems to be doing quite a bit already in terms of data science work but it is still missing some features and that is why we are going on to pandas. Let’s look at those missing features.
Say we have the data, say a table like in the below image, it has multiple columns, and there are multiple entries for the same student id given the different subjects that the student had and then there are marks for 10th and 12th, in some cases, there are no marks for 12th maybe because the student didn’t write that exam or data is missing.
To work with this type of dataset, firstly there is no way in NumPy to attach labels to data for instance if we consider the data in the above table as an array variable, then there is no way actually to tell us that the 3rd column in the table represents 10th marks whereas when we look at this table we have the headers that tell us that the 3rd column is 10th marks, we could somewhere else maintain in NumPy that the 3rd column is 10th marks but there is no direct way for us to attach this label with the data.
The second thing is that there are no pre-built ways to fill in missing values. Missing values is a common problem in data science and it might be represented by different values say ‘-’, ‘/’, ‘0’, ‘null’, ‘nan’ and many others ways in which these missing values show up and normalizing them and filling up appropriate values is an important task and NumPy does not provide any function for that.
Nextly, there is no way to group data, say if we want to see the table only with maths data, we can apply some indexing on NumPy, combine multiple NumPy arrays but it seems very cumbersome, we would like to have more efficient ways to group data because often we process data in a grouped structure.
Finally, there is no way to pivot data, so as per the given data, we have a unique combination of student id and marks, that is one way of representing data, there might be another way of representing it for instance in a single row we might show all the marks data that we have of a given student, that is another way of representing it. There could be multiple ways of representing it, and the way we use to represent the data depends on how we want to use the data. So, we want to move between these representations(as at different times we might want to achieve different things with the same dataset), this transformation should be efficient. Again, NumPy does not provide any simple way of doing it.
There are a few other limitations of NumPy as well. To overcome these limitations, ‘pandas’ is built on top of NumPy to provide these features for data science, so pandas inherits much of the way of thinking in NumPy, much of the syntax of NumPy, of course, all the functionality and importantly the efficiency of dealing with these numerical arrays but adds much more features for dealing with relational data.
Series Objects
The basic data structure in pandas is ‘series’. Let’s create a ‘series ’object:
First, we import the NumPy and pandas along with their aliases/short-hands.
So, the way we used to create NumPy arrays was that we first create a list and then transform it into a NumPy array. We will follow a similar approach here, we first create a list object and convert it into a series object.
As is clear from the above output, in the second column it contains our numbers, the first column is the index for the ‘series ’object and it contains an implicitly created list of whole numbers, this is auto-created for us as an index for this data, then at the end, we can see that the data type is ‘int64’ which means that for each of these numbers in the second column, the data type is ‘int64’. If we change even one number in the list to have a decimal point, it will show the data type as ‘float64’ and implicitly converts the other entries to have a decimal point.
So, a ‘series’ object contains all the values of the same type.
.values — this attribute creates an iterable containing all the values that we have
.index — this attribute gives us the first column of the ‘series’ object which refers to the index, in the below case it is an iterable starting with 0, stopping at 7 with step size of 1.
Both these attributes return an iterable which we can iterate over:
We can even combine these two using the ‘zip()’ function:
We can access any element in the series object using its index.
Let’s create another series object but this time explicitly mentions the index we want for corresponding values:
This is the main purpose of using series objects, we can specify the index as a string value instead of just having numbers.
We can access any value using its index either using the square bracket notation or using the dot(.) operator.
Let’s create a ‘series’ object from a NumPy array:
We can even create the index randomly and pass in this value when creating a series object.
The index must be unique for series objects.
Let’s create a series object using a dictionary:
So, we first create the dictionary and pass this dictionary to ‘pd.Series()’, and it will automatically pick the index and values from the dictionary and assign it to series object.
When creating a series object from a dictionary, we can also specify an explicit index. By default, a ‘series’ object will pick all the indices as in the dictionary but when we pass an explicit index, then the series will retain only the specified indices.
We have seen that to access the element at any index, we use the square bracket notation for example:
series_object[index]
But it seems like there are two different things, there is this lookup for the exact value in the index that is one aspect but we might still be interested in saying like give me the first element of the series, the second element of the series; so we still want to be able to do things as we do in lists, so we would like to access elements using index 0, 1 and so on and if required we want to use the explicit index that was specified when creating lists.
So, there are two indices, one the explicit indices that we specify when creating a series object and the other one a more implicit one, and we would like to switch between the two. To resolve this, pandas provides two different functions:
Let’s say we have the below series object
So, there is a gap of 1 between the implicit index(automatically created starting from 0) and the explicit index.
If we do s.loc[4], it looks for the value at the location 4 but as per the explicit index specified by series.
.loc[] — gives us the value as per the explicit index
.iloc[] — gives the value at the implicit location
s.iloc[4] looks for the value at the implicit location 4, so implicit way would be when we are using the standard way of counting in programming i.e from starting from 0. So, for the above case, s.iloc[4] corresponds to value 3.
If we do s.iloc[0], it gives the first value from the series object and if we do s.loc[0], then we get an error as we don’t have 0 as the key in the series object ‘s’.
It’s not a standard practice to use something like s[5] because the other person will not be very clear about this if it refers to the implicit index or to the explicit index, so it’s better to specify the index using .loc[] or .iloc[]
Even if we have the strings as the explicit indices, we can use the .loc[] and it will search for that specific index.
We might still say that we want the first element of the ‘mercury’ object, we can’t do that with .loc[], we don’t know in what order it is stored, we can do this using .iloc[]
And similarly, we can specify the index as -1 to get the last number.
We can even do slices, for instance
mercury.iloc[0:2] — gives the values at implicit index 0 and 1.
The output of slicing a series is also of the type series.
Interestingly, we can use slicing with the explicit indices but there is a slight difference here that both the ends are included.
So, these are two ways of indexing a series object. This way of indexing will also be applicable in the dataframe objects.
Now, let’s talk about the operations we can perform on the ‘series’ object. So, for this let’s work with the series object which contains the mass of various planets:
We have already discussed that it is best practice to use .loc[] and .iloc[] especially when we have the integer values as the explicit index.
And when slicing using the explicit index, both the ends are included in the output:
We can create slices by giving not just indices but by giving some kind of conditional statements, let’s say we want the list of those planets which are at least 100 units heavy; to get this we simply use a comparison operator on the series object and it returns a series of booleans for each object.
And we can then index the original object with this condition and it will output only those entries which are True as a result of the conditional operator.
We can have multiple conditions as well for instance if we want all planets with the mass in the range of 100–600(both exclusive), we can write it as:
The other thing is that though we have the ‘mass’ object with indices and we can think of it as a dictionary with a key and a value pair, here another way of looking at this would be NumPy array, and just with any other NumPy array, any operation on the series object with a scalar value is broadcasted to all the values in it.
And if we just multiply this series object with a value of 2, then all the data items, values within the series object gets multiplies by 2.
We can also do NumPy operations on the series object for instance to compute the mean of the values in the ‘mass’ object, we can simply use ‘np.mean()’ function:
.amin(object) — gives the minimum value in the series object
.amax(object) — gives the maximum value in the series object
np.median(data) — returns the median value in the data
We can see that the median is very much less than the mean value, the reason for this would be that there are outliers in the data(Jupiter is an outlier in terms that its mass is very high compared to the other planets), so there are these heavy planets which pull the mean value significantly but they don’t impact the median value.
We can use almost all of NumPy functions on the series object for example we can take the exponential of each value, apply trigonometric functions on it, and so on.
We can also add two series objects and wherever the index matches, the add values get added up
Let’s consider an example where the indices do not match. For that, first, create an object which contains only those planets which have a mass value greater than 100:
And we have ‘mass’ as:
If we add these two series objects, we get the following:
So, we get the values for only those indices which are present in both the series(and the value gets changed as per the operation being performed between two objects) and for the remaining indices(the ones which are present only in one of the series objects being operated upon), we get the value as ‘NaN’.
This is a significantly different operation than adding NumPy arrays. NumPy arrays are implicitly indexed, so it can just add based on the position but in series, first, we have to re-order the indices and then add.
pd.isnull(object) — this function gives a boolean value for each point(in case of series and dataframes) which tells us if something is actually ‘Not a number’.
So, using this result we can filter the ‘new_mass’ object to show only those values which are not null.
Now, let’s look at how to add something new to the series object. So, we have the ‘mass’ object as below and say we want to add to it the mass of the ‘Moon’
We can add any new key-value pair using the same square brackets notation as we use to add any new key-value pair to a dictionary.
And to drop a key-value pair from the series object, we can use the ‘.drop()’ method and pass it a list of the keys we want to drop from it. This will give back a new series object.
In this article, we have seen quite a bit of operations we can perform on the ‘series’ objects. In the next article, let’s use this knowledge to solve some tasks using pandas.
References: PadhAI