Data Science as the name suggests deals with data. Data could be of several types. In this article, we look at how to categorize data into different categories. So, let’s get started.
Different Types of Data
In dealing with real-world data, we largely encounter two broad categories of data which are ‘Qualitative’ and ‘Quantitative’ data. Now within ‘Qualitative’, we again have two categories which are ‘Nominal’ and ‘Ordinal’ and similarly, within ‘Quantitative’ we have two categories which are ‘Discrete’ and ‘Continuous’.
Let’s take some examples of Data and then we will look at each of these categories.
So, here is an example from e-Commerce where we have an e-Catalog where say we have information about shirts(so these shirts are the objects we are interested in)
Each of these shirts can be described by various attributes, they could be described by color, pattern, size, rating, price, and discount. And we can see that there is variety in these small set of attributes over here, some of them are numbers, some of them are labels, some of them are very special kind of labels like size which has some increasing or decreasing order. So, all of these needs to be categorized into different buckets which will do below starting with qualitative attributes.
Qualitative data is that data which can be divided into classes or labels or categories. So, color can be divided into one of ‘n’ categories, similarly, the pattern can be a bunch of categories but a fixed set of categories, we cannot have infinite patterns here. Similarly, sizes have a fixed set of categories and ratings belong to a fixed set of categories.
Within these, we again need to distinguish between nominal and ordinal. Let’s look at Color and Pattern, these are just classes and the interesting thing here is that “There is no natural ordering in these attributes”. We can not say that Red is greater than Blue which in turn is greater than green or Plain is greater than Checkered or is greater than Striped in terms of numerical order that we have say 5 is greater than 4. So, this type of natural ordering is not possible for these attributes like Color, Pattern and so on and therefore these type of attributes as categorized as Nominal attributes.
Now, compare this with the other kind of Qualitative attributes that we have which are Size and Rating. They are again a fixed set of labels, these are again categories but there is a natural ordering in these categories. So, we know that small is less than the medium which in turn is less than large. Similarly, for ratings, we know that the poor is less than okay which is less than good and so on. So, the Qualitative attributes for which there is natural ordering are classified as Ordinal attributes.
Let’s look at some more examples from different domains. Say if we are looking at a company or an organization's, we have employee data, so Gender is a qualitative attribute because it can take a fixed number of categories male, females and others and there is no natural ordering in this attribute, we can not say that Male is less than or greater than Female and so on, so Gender is a Nominal attribute, on the other hand, if we take Income Range, so if say that these are employees which have low-income range, these are medium, these are high, then there is natural ordering in these values, low is less than the medium which in turn is less than high, so Income Range is an ordinal attribute. Both of the attributes(Gender, Income Range) are classes/labels but there is a natural ordering in one but no natural ordering in other.
Let’s look at healthcare domain, if there is a disease we can label it as communicable or non-communicable and there is no natural ordering between the two, so it’s a nominal attribute. Similarly, if we look at health risk that is an ordinal attribute as there is a natural ordering in the values as a low health risk, medium health risk and so on.
Similarly in the agriculture domain, if we look at crop type which could be Kharif, Rabi or All-season and there is no natural ordering among them but farm type could be like a small farm, medium or large and there is natural ordering over here.
And similarly, if we look at Government, there is Nationality so Indian, Chinese, etc. there is no natural ordering here but if we look at the opinion, say I strongly agree with the policy, I’m neutral with the policy, I disagree with the policy. There is again an ordering, agree is better than neutral which in turn is better than disagreement and so on.
Let’s look at Quantitative Data. As the name suggests, this is quantities so we need to quantify something and here numbers come into the picture, so all the below mentioned attributes are again about shirts from an e-catalog, all of these take on numerical values. Price is a number, Number of buttons is a number, Days for Delivery is a number and Discount as well as is a number. Now, these numerical attributes, at least some of them can take infinite possible values, for example, if we take Discount, it could lie anywhere from 0.01 to 0.05 to 0.1 to 1, 1.2 and so on. It can take actually an infinite number of values and that’s the key thing we need to understand here.
As we can see, some numbers here can take fractions and some numbers don’t take fractions.
Within these attributes, if we look at Number of buttons and Days for Delivery which in this data happens to be non-fractional(as all are integers). So, the data which can take on only a finite set of numerical values(these are integers), such data is known as discrete data(no fractions, we just have whole numbers and integers).
Now compare this with Continuous data, so here we have fractions. Price could say $23.99, Rs. 525.50. Similarly, the discount could also be fractional numbers. So, such data that could take fractional values also is known as Continuous Quantitative Data.
It is not necessary that all values of Price would be fractional but as long as the attribute can take some values which are fractional, we would call that attribute to be a Continuous attribute.
If we look at an example, in the Employee data say we have Gross Salary or Income Tax(could be fractional as it would be some percentage of gross salary), similarly gross salary might include some fractional or decimal points and is a continuous attribute but if we look at the number of projects the employee is working on or the number of family members the employee has, these attributes would be discrete.
Similarly, in the healthcare domain, if we look at Cholesterol level or sugar level, these could be fractional numbers and therefore is a Continuous attribute, but if we look at days of treatment or weeks of pregnancy, these are typically expressed as whole numbers or integers and therefore falls under Discrete attributes.
In the Agriculture domain, if we look at total yield or acres, these could be in fractions say the area of a farm is 525.26 acres but if we look at the number of farmers or the number of crops farmed in a particular farm, then these are discrete quantities.
Similarly, the GDP, GST, CGST rates could be fractional whereas the number of citizens or the number of villages is going to be a discrete quantity.
Ordinal(qualitative) vs Discrete(quantitative)
So, here is one point of confusion, if we have Ordinal data which is these ratings say Very Poor, Poor, Okay, Good, Very Good, then we could just write these as numbers like 1, 2, 3, 4, 5. These are equivalent to the ratings that we have. In fact, we could have rated it as 1 star, 2 stars and all the way up to 5 stars. So, now is this data Ordinal data or Discrete data?
The answer is Ordinal data and one subtle reason for this is that although these are expressed as numbers, the notion of distance here is not well-defined. When we are talking about numbers, we know that ‘2–1’ is 1 which is the same as ‘3–2’ which is equal to 1. But when we are talking about these ratings, these distances are not clear, although very poor has a rating of 1 and Poor has a rating of 2 and Okay has a rating of 3; the numerical difference between the Poor and Very Poor which is ‘2–1’ is equal to 1; the numerical difference is the same as the difference between Okay and Poor which again happens to be 1 but in reality, if we look at it, Very Poor is very bad and when we say something to be Poor or Okay, the difference in our mind the difference between Okay and Poor is not going to be the same as Poor and Very Poor and same goes with Good and Very Good. So, we would say something Very Good if we are really really happy with it but Good we might be okay by saying Good even if it is slightly better than Okay. So, that’s why this difference is not really well defined, although we are using numbers the notion of numbers that the ‘4–3’ is the same as ‘3–2’ does not hold in many cases.
Why bother about data types?
It turns out that the type of statistical analysis depends on the type of variable.
For example, if we are looking at Qualitative attributes, it does not make sense to ask the questions like the below ones:
It makes sense to ask the below type of questions for Qualitative Attributes
For Qualitative Data, Regression Analysis(will be discussed in another article) does not make sense as it involves two numbers or numeric data and Analysis of Variance(ANOVA) (will be discussed in another article) makes sense for Qualitative attributes.
Similarly, if we have Quantitative Discrete attributes say we are talking about Number of Delivery Days, we could ask below type of questions:
Similarly, for Quantitative Continuous attributes, we could ask questions like:
Asking about the frequency does not make sense because all of these are fractional values, so it’s very unlikely that the weight 58.72 actually repeats many times, these values would typically be unique and appear only once or twice in data.
So, as is clear from above discussion, knowing the data types helps us to perform the correct analysis on the data.
In this article, we saw how data can be categorized into two broad categories(which are further classified) based on the type of values it can take. We also looked at why it makes sense to know the data types of the different attributes before starting the analysis.