You are on page 1of 10

Introduction

Welcome to the session on ‘Univariate Analysis’. In the previous sessions, you learnt how to source
data and how to clean it for analysis.

In this session
As the term “univariate” suggests, this session deals with analyzing variables one at a time. It is
important to separately understand each variable before moving on to analyzing multiple variables
together.

The broad agenda for this session is as follows:


Metadata description
Data distribution plots
Summary metrics

People you will hear from in this session:


Subject Matter Expert:
S Anand
CEO, Gramener
Gramener is one of the most prominent data analytics and visualization companies in India. Anand,
currently the CEO, was previously the Chief Data Scientist at Gramener and has extensive experience
in management consulting and equity research.

Data Description
Given a data set, the first step is to understand what it contains. Information about a data set can be
gained simply by looking at its metadata. Metadata, in simple terms, is the data that describes the each
variable in detail. Information such as the size of the data set, how and when the data set was created,
what the rows and variables represent, etc. are captured in metadata.

Let us learn more about data description in the following video.

Video: - Metadata

Ordered and Unordered Categorical Variables


Categorical variables can be of two types - ordered categorical and unordered categorical. In unordered,
it is not possible to say that a certain category is 'more or less' or 'higher or lower' than others. For
example, color is such a variable (red is not greater or more than green etc.)
On the other hand, ordered categories have a notion of 'higher-lower', 'before-after', 'more-less' etc. For
e.g. the age-group variable having three values - child, adult and old is ordered categorical because an
old person is 'more aged' than an adult etc. In general, it is possible to define some kind of ordering.

In this lecture, you learnt about metadata description.

Types of Variables

You learnt the difference between ordered and unordered categorical variables -
 Ordered ones have some kind of ordering. Some examples are
o Salary = High-Medium-low
o Month = Jan-Feb-Mar etc.

 Unordered ones do not have the notion of high-low, more-less etc. Example:
o Type of loan taken by a person = home, personal, auto etc.
o Organization of a person = Sales, marketing, HR etc.

Apart from the two types of categorical variables, the other most common type is quantitativevariables.
These are simply numeric variables which can be added up, multiplied, divided etc. For example,
salary, number of bank accounts, runs scored by a batsman, the mileage of a car etc.

So far, we have discussed the following types of variables:


1. Categorical variables
o Unordered
o Ordered
2. Quantitative / numeric variables

Unordered Categorical Variables - Univariate Analysis


Let’s now move to the most interesting part of EDA — getting useful insights from the data. So far,
you have seen two types of variables - categorical (ordered / unordered) and quantitative (or numeric).
In this segment, you will learn how to conduct univariate analysis on unordered categorical variables.

But before studying that, think a bit about how you would conduct analysis on such a variable.

Video: - Approach to analyzing data


You saw how one can use plots to extract meaningful information from unordered categorical variables.
Compare the answer you had given to the question before the lecture - how would you approach of
analyzing unordered categorical variables change after studying this?

It is important to note that rank-frequency plots enable you to extract meaning even from seemingly
trivial unordered categorical variables such as country, name of an artist, name of a github user etc.

The objective here is not to put excessive focus on power laws or rank-frequency plots, but rather to
understand that non-trivial analysis is possible even on unordered categorical variables, and that plots
can help you out in that process.

Let us now see how a power law distribution is created in Excel.

Dataset: - employment-data.csv

Video: - employment-data-analysis

Why plotting on a log-log scale helps

The objective of using a log scale is to make the plot readable by changing the scale. For example, the
first ranked item had a frequency of 29000, the second ranked had 3500, the seventh had 700 and most
others had very low frequencies such as 100, 80, 21 etc. The range of frequencies is too large to fit on
the plot.

Plotting on a log scale compresses the values to a smaller scale which makes the plot easy to read.

This happens because log(x) is a much smaller number than x. For example, log(10) = 1, log(100) = 2,
log(1000) = 3 and so on. Thus, log(29000) is now approx. 4.5, log(3500) is approx. 3.5 and so on. What
was earlier varying from 29000 to 1 is now compressed between 4.5 and 0, making the values easier to
read on a plot.

To summarize, the major takeaways from this lecture are:


 Plots are immensely helpful in identifying hidden patterns in the data
 It is possible to extract meaningful insights from unordered categorical variables using rank-
frequency plots
 Rank-frequency plots of unordered categorical variables, when plotted on a log-log scale,
typically result in a power law distribution
Ordered Categorical Variables - Univariate Analysis
In the previous lecture, you saw how to conduct univariate analysis on unordered categorical variables.
Let's now see how to do the same on ordered categorical variables.

Before that, think about the type of analysis you would do on ordered categorical variables.

You may have had some experience taking power meter readings at least once. It would be interesting
to see how meter readings vary across households. Let us look at a case study that Anand did for a
power company.

Video: - Detecting Fraud

You saw an example of how a simple histogram can reveal interesting insights. This is an extremely
important takeaway — whenever you have a continuous or an ordered categorical variable, make sure
you plot a histogram or a bar chart and observe any unexpected trends in it.

Let’s look at a few more examples of univariate analysis revealing hidden patterns.

I’m sure you remember the sleepless nights exams gave you. For a student, the examiner is an
antagonist most of the times, which prevents you from getting the scores you deserve. You might also
have been intrigued by questions such as: how many students obtained marks similar to yours, how
many students were ahead, or how many lagged behind. And everyone has an opinion on when and
where grace marks are justified.

So, let’s use this typical student experience to help you learn univariate analysis. Let us now look at an
interesting analysis that Anand conducted on class X student marks across Tamilnadu.

Video: - TN Class X

Tendulkar Batting - Univariate Analysis

The dataset given below contains Sachin Tendulkar's ODI batting statistics from about 1989 to 2012.
Download the dataset given below to attempt the following questions.

The variables 'Runs' and 'X4s' respectively represent the number of runs scored and number of 4s hit
by him in each match. These are two ordered categorical variables you'll analyze in this exercise.
'Runs', which should ideally be a numeric variable, contains some entries such as DNB (Did Not Bat),
8* (Not out at 8 runs) etc. You'll need to clean these quality issues.

Similarly, in 'X4s', there are some missing entries represented by "-". You will need to clean them as
well.

Download the data set below to answer the following questions.

Dataset: - tendulkar_ODI_batting.csv

You have seen how to conduct univariate analysis on categorical variables. Next, you will learn to
conduct univariate analysis on quantitative (or numeric) variables.

Quantitative Variables - Univariate Analysis


You have seen how to conduct univariate analysis on categorical variables. Now, let's look at
quantitative or numeric variables.

Prerequisites
In this segment, Anand will take you through various summary metrics. In case you are not familiar
with the basics of descriptive statistics such as mean, median, mode, and standard deviation, you are
recommended to go through the study material provided on the resources page. Knowledge of these
concepts is essential for this topic and the forthcoming topics in other modules, so make sure you make
yourself familiar before moving on.

Let’s now learn how to analyze quantitative variables.

Video: - Quantitative Variables

Numeric and Ordered Categorical Variables


Anand mentioned that you can treat numeric variables as ordered categorical variables. For analysis,
you can deliberately convert numeric variables into ordered categorical, for example, if you have
incomes of a few thousand people ranging from $5,000 to $100,000, you can categorise them into bins
such as [5000, 10000], [10000,15000] and [15000, 20000].

This is called 'binning'.


Mean and median are single values that broadly give a representation of the entire data. As Anand
stated very clearly, it is very important to understand when to use these metrics to avoid doing
inaccurate analysis.

While mean gives an average of all the values, median gives a typical value that could be used to
represent the entire group. As a simple rule of thumb, always question someone if someone uses the
mean, since median is almost always a better measure of ‘representativeness’.

Let’s now look at some other summary descriptive statistics such as mode, interquartile distance,
standard deviation, etc.

Video: - Descriptive Data Summary

Standard deviation and interquartile difference are both used to represent the spread of the data.

Interquartile difference is a much better metric than standard deviation if there are outliers in the data.
This is because the standard deviation will be influenced by outliers while the interquartile difference
will simply ignore them.

You also saw how box plots are used to understand the spread of data.

Quantitative Variables - Summary Metrics


Let's take some time to understand why summary metrics such as mean and standard deviation are not
representative of the data in some cases.

Video: - Summary Metrics

News Popularity Data Set - Reading Box Plots

You saw that quartiles are a better measure of the spread than the standard deviation. A good way to
visualise quartiles is to use a box plot. The 25th percentile is represented by the bottom horizontal line
of the box, the 75th is the top line of the box, and the median is the line between them.

In the following questions, let us use the news popularity data set to understand the measures of the
deviation of the number of shares of a news article across the days of the week, weekends, and types of
articles.
You can access the data dictionary of the news popularity data set (to understand the variables in
detail) here.

You can download the data set below:

Dataset: - EDA News Popularity.csv

The box plots given below show the variation of the number of shares of a news article (on the y-axis)
against three variables — weekend (1 implies weekend), weekday (0 implies weekday), and the
channel type. Read the charts carefully and answer the questions that follow.

1. Shares vs Weekend

Weekend

2. Shares vs Weekday
Shares versus Weekday

3. Shares vs Channel type

Shares versus Channel Type

Recall some of the metrics you learnt in this session and answer the questions below.

Summary
In this session, you learnt about the various elements of univariate analysis such as data distribution
plots and summary metrics.

Let’s quickly recap what you learnt.


Video: - Summary

Let's summarise what you learnt:


1. Metadata description describes the data in a structured way. You should make it a habit of
creating a metadata description for whatever data set you are working on. Not only will it serve
as a reference point for you, it will also help other people understand the data better and save
time.
2. Distribution plots reveal interesting insights about the data. You can observe various visible
patterns in the plots and try to understand how they came to be.
3. Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be
used everywhere. Thus, it is important to understand the data and then choose what metric to
use to summarise the data.

Practice Questions
For this exercise, we will use the news popularity data set again which is a set of articles published by
the digital media website Mashable over a period of two years.

You can download the data set below.

Popularity Dataset
file_downloadDownload

You can access the data dictionary of the news popularity dataset (to understand the variables in
detail) here.

There are more than 60 attributes in the data set which describe the articles, such as the total number of
words, the number of positive and negative words, the length of the title, the article genre (e.g.
business, travel or entertainment) and, more importantly, the output variable, i.e. the number of shares.

In the following questions, you will attempt to understand the distribution of shares (the number of
times an article was shared on the internet) using measures of spread. Suppose you work for a content
marketing company as an analyst and want to understand the ‘distribution of shares’. Based on your
understanding, you will share with your peers some rough metrics such as the representative number of
shares, the spread of shares, etc.

The following functions and concepts will be helpful for the questions below:
The describe() function is used to obtain the percentile distribution in Python. You may learn more
about this function here.

The df[‘column_name’].mean() function is used to obtain mean of a particular column.

arrow_drop_downarrow_drop_upQuiz Summary
 Question 1
keyboard_arrow_rightIncorrect
 Question 2
keyboard_arrow_rightIncorrect
 Question 3
keyboard_arrow_rightCorrect
Outliers

Roughly speaking, outliers are abnormally large or small values. There is no one fixed rule in deciding
outliers. However, you would have noticed that going from the 99th to the 100th percentile, there is an
approximately 25x increase in the number of shares.

You may have also noted that in the lower quartiles, there is only about 5% increase in the number of
shares per percentile. This increases to about 10% per percentile in the higher quartiles and 20%
beyond the 95th percentile. In this case, some articles in the top quartiles are clearly outliers.

To classify some articles as outliers, you may need to consult your client or the business to understand
the reasons behind abnormal values. In some cases, they are justifiable, whereas in others they cannot
be explained and are thus labelled as outliers.

For example, let’s say that you decide to label all the articles beyond the 95th percentile as outliers
since you observe a roughly 20% increase in the number of shares beyond that. Thus, the last article
you include will be the one at the 95th percentile. You remove every article beyond that point from the
data set.

Here is a short explanation to remove outliers in Python.

After treating the outliers using this method, answer the following questions.

You might also like