Nowadays, all of us have heard that “Data is the new oil and the Data Science is the combustion engine that drives it” or “Data Science is the most sought after job of the twenty-first century” or “Data Science is the future”. In this article, we discuss what exactly is Data Science.
Before we start answering this question, it’s important to understand why there is so much confusion(something so popular) around it, why there is no clear understanding of what data science is.
One reason for this is that it’s an assortment of several tasks, there are different tasks involved in the Data Science pipeline and that’s why it's not clear who can call himself/herself as a Data Scientist and from application to application, the importance of these tasks changes. In some environments, organization a particular task might be important and in other applications/environments/organizations, some other task might be more important. So, there is this uneven distribution/importance of these different tasks and hence there is this confusion that if I’m doing this task or say these two tasks, is it still called Data Science or not? So, we try to clear out this confusion and start by asking the question: what are these different tasks involved in the Data Science pipeline or the different tasks that the Data Scientist should know. …
As discussed in the previous article, data can be classified into two broad categories: Qualitative and Quantitative. In this article, we discuss how to Describe Qualitative Data.
The typical characteristic of Qualitative Data is that we have repeating values for a qualitative attribute.
So, if we take the color of a shirt, we will see the value ‘Green’ appears many times(say in the database of available shirts on a e-commerce site), and the same with the other colors as well. And this is across different domains also, say we are talking about the Agriculture domain and looking at data about the season attribute for all the crops, there we have Kharif, Rabi, All-seasons, and so on. So, the season attributes will keep repeating for all the crops because there are only a finite number of classes/values/seasons and if we have data about say some 10000 farms in the country then for each farm we have the same set of values repeating or in other words each crop take a value(of the season attribute) from a unique list of possible values. …
A swarm plot is another way of plotting the distribution of an attribute or the joint distribution of a couple of attributes.
Let’s use the ‘diamonds’ dataset which is pre-loaded in the ‘seaborn’ package and have a swarm plot of the ‘carat’ attribute for the first 1000 data items
The ‘x-axis’ depicts the carat and this plot helps us to understand at what carat are there more diamonds or how the diamond count varies with carat size. It plots one dot for each data item(diamond in this case).
We can say that there are more diamonds with a carat size of 0.8 compared to the number of diamonds with a carat size as 0.6 …
So far we have discussed how to viz. and understand the distribution of an attribute, in this article, we discuss the joint distribution of two variables.
Joint distribution is helpful to understand how two variables are related so if we have ‘x’ and ‘y’ as two variables, we can plot two KDEs but we would not know for instance when ‘x’ is small what is the corresponding distribution of ‘y’ or in general the correlation between the two attributes.
Let’s first define two independent variables(both normally distributed)
And create a dataframe using these two variables
So far we have looked at the distribution of data both with a distribution plot and a box plot — we are not looking at the numbers themselves but a frequency of numbers(how often a certain value is present) but we looked at continuous variables in both the plots.
If we look at the diamonds dataset(available with the ‘seaborn’ package), there are some categorical attributes as well, the below is a snapshot of this dataset
Attributes like ‘cut’, ‘color’, and ‘clarity’ seem to have a finite number of values. …
In the last article, we discussed the histogram plot that helps to understand the distribution of an attribute. In this article, let’s look at another way of plotting the distribution of data:
Let’s take a sample of size 1000 from a normal distribution to start with
And now we can simply leverage the ‘seaborn’ package to plot out a nice boxplot
In the last article, we looked at how to represent the dataset in a tabular format and how to style the tables to get insights at a high level.
In this article, we look at how to understand the distribution of a column i.e how the different values in a particular attribute are distributed.
Here we are interested in taking a sample and plotting its distribution — so essentially how the data is distributed in a given sample. Within this broad class of data distribution, there are several plots to understand the same. We start with the Histogram plot.
Let’s first create some random data from a ‘normal distribution’ that contains 1000 data…
In the last article, we discussed that Data Viz. is an important task of the data science pipeline and if done in the right manner, it could reduce the overall effort to a great extent and is very helpful in communicating the insights to the audience. In this article, we discuss the tabulation viz., how to design the table content.
We start by importing the relevant libraries
‘matplotlib’ and ‘seaborn’ are the two important libraries for data visualization. In particular, we use the ‘pyplot’ package from the ‘matplotlib’ library.
Under this format, we visualize the data through tables. So, tables of rows and columns and filling in the cells with values. In fact, in most research papers, results are presented in form of tables and typically some important values might be highlighted in the table and so on. …
In this article, we discuss one of the most important task of the data science pipeline which is Data Viz. We’ll understand what is meant by data viz., what is the need of data viz., and what are the common pitfalls in data viz. So, let’s get started:
As discussed in this article, there are essentially 5 steps which constitute the process of Data Science
Data Visualization is about describing data.
In the last few articles, we discussed the probability mass function(PMF) and its properties. Below is the PMF plot for a uniform distribution(a uniform distribution is the one which has equally likely outcomes one example of the uniform distribution is throwing a fair dice)
Then we have the Cumulative Distribution function that tells the probability of the random variable taking on a value less than or equal to a given value for example for the example of throwing one dice with equally likely outcomes, the probability of the random variable taking on a value less than or equal to 1(meaning the dice outcome is 1) would be 0.167(1/6), the probability of random variable taking on values less than or equal to 2(this would be the event that has outcome as 1 and 2 and these two are disjoint events, therefore, we can sum up their probabilities) would be (0.167 + 0.167) …