Professional Documents
Culture Documents
Welcome to the session on ‘Univariate Analysis’. In the previous sessions, you learnt how to source
data and how to clean it for analysis.
In this session
As the term “univariate” suggests, this session deals with analyzing variables one at a time. It is
important to separately understand each variable before moving on to analyzing multiple variables
together.
Data Description
Given a data set, the first step is to understand what it contains. Information about a data set can be
gained simply by looking at its metadata. Metadata, in simple terms, is the data that describes the each
variable in detail. Information such as the size of the data set, how and when the data set was created,
what the rows and variables represent, etc. are captured in metadata.
Video: - Metadata
Types of Variables
You learnt the difference between ordered and unordered categorical variables -
Ordered ones have some kind of ordering. Some examples are
o Salary = High-Medium-low
o Month = Jan-Feb-Mar etc.
Unordered ones do not have the notion of high-low, more-less etc. Example:
o Type of loan taken by a person = home, personal, auto etc.
o Organization of a person = Sales, marketing, HR etc.
Apart from the two types of categorical variables, the other most common type is quantitativevariables.
These are simply numeric variables which can be added up, multiplied, divided etc. For example,
salary, number of bank accounts, runs scored by a batsman, the mileage of a car etc.
But before studying that, think a bit about how you would conduct analysis on such a variable.
It is important to note that rank-frequency plots enable you to extract meaning even from seemingly
trivial unordered categorical variables such as country, name of an artist, name of a github user etc.
The objective here is not to put excessive focus on power laws or rank-frequency plots, but rather to
understand that non-trivial analysis is possible even on unordered categorical variables, and that plots
can help you out in that process.
Dataset: - employment-data.csv
Video: - employment-data-analysis
The objective of using a log scale is to make the plot readable by changing the scale. For example, the
first ranked item had a frequency of 29000, the second ranked had 3500, the seventh had 700 and most
others had very low frequencies such as 100, 80, 21 etc. The range of frequencies is too large to fit on
the plot.
Plotting on a log scale compresses the values to a smaller scale which makes the plot easy to read.
This happens because log(x) is a much smaller number than x. For example, log(10) = 1, log(100) = 2,
log(1000) = 3 and so on. Thus, log(29000) is now approx. 4.5, log(3500) is approx. 3.5 and so on. What
was earlier varying from 29000 to 1 is now compressed between 4.5 and 0, making the values easier to
read on a plot.
Before that, think about the type of analysis you would do on ordered categorical variables.
You may have had some experience taking power meter readings at least once. It would be interesting
to see how meter readings vary across households. Let us look at a case study that Anand did for a
power company.
You saw an example of how a simple histogram can reveal interesting insights. This is an extremely
important takeaway — whenever you have a continuous or an ordered categorical variable, make sure
you plot a histogram or a bar chart and observe any unexpected trends in it.
Let’s look at a few more examples of univariate analysis revealing hidden patterns.
I’m sure you remember the sleepless nights exams gave you. For a student, the examiner is an
antagonist most of the times, which prevents you from getting the scores you deserve. You might also
have been intrigued by questions such as: how many students obtained marks similar to yours, how
many students were ahead, or how many lagged behind. And everyone has an opinion on when and
where grace marks are justified.
So, let’s use this typical student experience to help you learn univariate analysis. Let us now look at an
interesting analysis that Anand conducted on class X student marks across Tamilnadu.
Video: - TN Class X
The dataset given below contains Sachin Tendulkar's ODI batting statistics from about 1989 to 2012.
Download the dataset given below to attempt the following questions.
The variables 'Runs' and 'X4s' respectively represent the number of runs scored and number of 4s hit
by him in each match. These are two ordered categorical variables you'll analyze in this exercise.
'Runs', which should ideally be a numeric variable, contains some entries such as DNB (Did Not Bat),
8* (Not out at 8 runs) etc. You'll need to clean these quality issues.
Similarly, in 'X4s', there are some missing entries represented by "-". You will need to clean them as
well.
Dataset: - tendulkar_ODI_batting.csv
You have seen how to conduct univariate analysis on categorical variables. Next, you will learn to
conduct univariate analysis on quantitative (or numeric) variables.
Prerequisites
In this segment, Anand will take you through various summary metrics. In case you are not familiar
with the basics of descriptive statistics such as mean, median, mode, and standard deviation, you are
recommended to go through the study material provided on the resources page. Knowledge of these
concepts is essential for this topic and the forthcoming topics in other modules, so make sure you make
yourself familiar before moving on.
While mean gives an average of all the values, median gives a typical value that could be used to
represent the entire group. As a simple rule of thumb, always question someone if someone uses the
mean, since median is almost always a better measure of ‘representativeness’.
Let’s now look at some other summary descriptive statistics such as mode, interquartile distance,
standard deviation, etc.
Standard deviation and interquartile difference are both used to represent the spread of the data.
Interquartile difference is a much better metric than standard deviation if there are outliers in the data.
This is because the standard deviation will be influenced by outliers while the interquartile difference
will simply ignore them.
You also saw how box plots are used to understand the spread of data.
You saw that quartiles are a better measure of the spread than the standard deviation. A good way to
visualise quartiles is to use a box plot. The 25th percentile is represented by the bottom horizontal line
of the box, the 75th is the top line of the box, and the median is the line between them.
In the following questions, let us use the news popularity data set to understand the measures of the
deviation of the number of shares of a news article across the days of the week, weekends, and types of
articles.
You can access the data dictionary of the news popularity data set (to understand the variables in
detail) here.
The box plots given below show the variation of the number of shares of a news article (on the y-axis)
against three variables — weekend (1 implies weekend), weekday (0 implies weekday), and the
channel type. Read the charts carefully and answer the questions that follow.
1. Shares vs Weekend
Weekend
2. Shares vs Weekday
Shares versus Weekday
Recall some of the metrics you learnt in this session and answer the questions below.
Summary
In this session, you learnt about the various elements of univariate analysis such as data distribution
plots and summary metrics.
Practice Questions
For this exercise, we will use the news popularity data set again which is a set of articles published by
the digital media website Mashable over a period of two years.
Popularity Dataset
file_downloadDownload
You can access the data dictionary of the news popularity dataset (to understand the variables in
detail) here.
There are more than 60 attributes in the data set which describe the articles, such as the total number of
words, the number of positive and negative words, the length of the title, the article genre (e.g.
business, travel or entertainment) and, more importantly, the output variable, i.e. the number of shares.
In the following questions, you will attempt to understand the distribution of shares (the number of
times an article was shared on the internet) using measures of spread. Suppose you work for a content
marketing company as an analyst and want to understand the ‘distribution of shares’. Based on your
understanding, you will share with your peers some rough metrics such as the representative number of
shares, the spread of shares, etc.
The following functions and concepts will be helpful for the questions below:
The describe() function is used to obtain the percentile distribution in Python. You may learn more
about this function here.
arrow_drop_downarrow_drop_upQuiz Summary
Question 1
keyboard_arrow_rightIncorrect
Question 2
keyboard_arrow_rightIncorrect
Question 3
keyboard_arrow_rightCorrect
Outliers
Roughly speaking, outliers are abnormally large or small values. There is no one fixed rule in deciding
outliers. However, you would have noticed that going from the 99th to the 100th percentile, there is an
approximately 25x increase in the number of shares.
You may have also noted that in the lower quartiles, there is only about 5% increase in the number of
shares per percentile. This increases to about 10% per percentile in the higher quartiles and 20%
beyond the 95th percentile. In this case, some articles in the top quartiles are clearly outliers.
To classify some articles as outliers, you may need to consult your client or the business to understand
the reasons behind abnormal values. In some cases, they are justifiable, whereas in others they cannot
be explained and are thus labelled as outliers.
For example, let’s say that you decide to label all the articles beyond the 95th percentile as outliers
since you observe a roughly 20% increase in the number of shares beyond that. Thus, the last article
you include will be the one at the 95th percentile. You remove every article beyond that point from the
data set.
After treating the outliers using this method, answer the following questions.