Nowadays, all of us have heard that “Data is the new oil and the Data Science is the combustion engine that drives it” or “Data Science is the most sought after job of the twenty-first century” or “Data Science is the future”. In this article, we discuss what exactly is Data Science.

Before we start answering this question, it’s important to understand why there is so much confusion(something so popular) around it, why there is no clear understanding of what data science is.

One reason for this is that it’s an assortment of several tasks, there are different tasks involved…

In the last article, we discussed how to create **NumPy** arrays. In this article, we discuss how to index different slices of the **NumPy** array i.e access different parts of the ‘**n-dimensional**’ array.

Say we have a 3D array:

In the last article, we started with the **NumPy** package and discussed its advantages, the performance it offers. In this article, we discuss how to create a **NumPy** array.

Let’s just visualize what the high-dimensional array looks like. In the picture below, we have an array that contains 4 numbers, so this is a **1-D array** with 4 numbers

We can use 1-D arrays to represent things like time-series data for instance what is the temperature per hour, it becomes a **1-D array**.

If we expand it one dimension, it becomes a **2-D array** wherein the below case there are…

In the last article, we discussed how a histogram could help us understand the data distribution for an attribute. In this article, we discuss how to understand the relationship between multiple attributes.

This is a common observation in many real-world datasets where for a given object, we have several attributes describing the object so for example in the **Sports domain**, in Cricket, we have the **Runs scored**, **Balls faced**, **Minutes played**, **Strike Rate**, **Type of dismissal**. …

In the last article, we looked at the histogram as a way of describing and understanding the trends in Quantitative Data. Let’s look at another kind of plot for describing **Quantitative Data** which is the **Stem and Leaf plot**. It is an efficient way of describing small to medium data sets.

Let’s say we have the data for the runs scored by Sachin in his last 30 ODIs.

**Stem and Leaf plot represents every number/data item in the data(score in each match in this case) in two parts, one is called the stem and the other is called the leaf**

…

In the last few articles, we have discussed histogram in great detail, and it's clear by now that histogram reveals a lot of trends in the data, it helps us understand interesting patterns in the data. Histograms are also used in Machine Learning for various purposes and in this article, we discuss the uses of histograms in Machine Learning. So, let’s get started.

**Identifying Discriminatory features**

Let’s say we are trying to build a machine learning system that takes a lot of information about a patient such as Age, Height, Weight, Cholesterol, Sugar Level, and so on and this Machine…

The main purpose when we draw a histogram is to see if some trends are visible in the data or not and there are some standard trends that we should look out for in the histogram. These standard trends are discussed in this article. So, let’s get started.

Here is the list of standard trends:

- To look for how far are the values in the data spread out
- Is the data density high in certain intervals — if data is divided into class intervals, is it the case that there are few intervals that have very tall bars and the…

In the previous article, we looked at how Histogram is the preferred way of representing Quantitative Data and how to select the perfect bin size. In this article, we look at how to answer questions on percentages /proportions— “**In what percentage of matches did Sachin score less than 10 runs?”**. As with the case with a frequency table, frequency plot, questions related to percentages are a bit hard to answer directly from the histogram.

Below is a histogram of runs scored by Sachin Tendulkar in all the ODIs he played taking a bin size of 10

Suppose we want to…

Before we get started with this question, the primary question that we are guided with when **describing Qualitative Data** is “**What is the frequency of different categories?**” and we plotted the frequencies using Frequency Table, Frequency Plots, Relative Frequencies Plots, and so on.

As discussed in the previous article, data can be classified into two broad categories: **Qualitative **and **Quantitative**. In this article, we discuss how to **Describe Qualitative Data**.

The typical characteristic of **Qualitative **Data is that we have repeating values for a **qualitative** attribute.

So, if we take the **color of a shirt**, we will see the value ‘**Green**’ appears many times(say in the database of available shirts on a e-commerce site), and the same with the other colors as well. And this is across different domains also, say we are talking about the Agriculture domain and looking at data about…