0% found this document useful (0 votes)
128 views71 pages

Applied Statistics for Decision-Making

The document discusses key concepts in statistics and data analysis. It introduces important terms like population, sample, parameter, statistic, outliers, and data distributions. It explains different measures of central tendency - mode, mean, and median. The mode is the most common value, the mean is the average, and the median divides the data set in half. The mean can be affected by outliers while the median and mode are not. Understanding these concepts is important for analyzing data sets and making decisions.

Uploaded by

devashreechande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views71 pages

Applied Statistics for Decision-Making

The document discusses key concepts in statistics and data analysis. It introduces important terms like population, sample, parameter, statistic, outliers, and data distributions. It explains different measures of central tendency - mode, mean, and median. The mode is the most common value, the mean is the average, and the median divides the data set in half. The mean can be affected by outliers while the median and mode are not. Understanding these concepts is important for analyzing data sets and making decisions.

Uploaded by

devashreechande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

APPLIED STATISTICS FOR DECISION-MAKING

The fascinating world of numbers…

Workbook Part 1
Prof. S. Arvind
NUMBERS ARE IMPORTANT, BUT…
Factfulness by Hans Rosling

I don’t love numbers. I am a huge, huge fan of data, but I don’t love it. It has its limits. I love data only
when it helps me to understand the reality behind the numbers, i.e., people’s lives. In my research, I
have needed the data to test my hypotheses, but the hypotheses themselves often emerged from
talking to, listening to, and observing people. Though we absolutely need numbers to understand the
world, we should be highly skeptical about conclusions derived purely from number crunching.

The prime minister from Mozambique from 1994 to 2004, Pascoal Mocumbi, visited Stockholm in 2002
and told me that his country was making great economic progress. I asked him how he knew that;
after all, the quality of the economic statistics in Mozambique was probably not very good. Had he
looked at GDP per capita?

“I do look at those figures,” he said, “but they are not so accurate. So, I have also made it a habit to
watch the marches on 1st May every year. They are a popular tradition in our country. And I look at
people’s feet and what kind of shoes they have. I know that people do their best to look good on that
day. I know that they cannot borrow their friend’s shoes, because their friend will be out marching, too.
So, I look. And I can see, if they walk barefoot, or if they have bad shoes or if they have good shoes.
And I can compare with what I saw last year.

“Also, when I travel across the country, I look at the construction going on. If the grass is growing over
new foundations, that is bad. But if they keep adding new bricks to the building, then I know people
have money to invest, not just to consume day-to-day.”

A wise prime minister looks at the numbers, but not only at the numbers.

And, of course, some of the most valued and important aspects of human development cannot be
measured in numbers at all. We can estimate suffering from disease using numbers. We can measure
improvements in material living conditions using numbers. But the end goal of economic growth is
individual freedom and culture, and these values are difficult to capture with numbers. The idea of
measuring human progress in numbers seems completely bizarre to many people. I often agree. The
numbers will never tell the full story on what life on Earth is all about.

The world cannot be understood without numbers. But the world cannot be understood with numbers
alone.
TERMS WE NEED TO KNOW…
HISTOGRAM
A bar chart where vertical bars represent the frequency or percentage of the class
(group or individual member).
The variable of interest is displayed along the X-axis.
Note: There should be no gaps between adjacent bars.

DATA DISTRIBUTION OR DISTRIBUTION


A graphical display of statistical data under study.
The variable of interest is displayed along the X-axis.
The frequency or percentage of different values of the variable is displayed along
the Y-axis.
Note: A histogram is one example of a data distribution.
ANALYSING DATA SETS
TERMS WE NEED TO KNOW…
STATISTICS
The branch of mathematics that deals with collecting, analysing,
presenting and interpreting data/information for decision-making.

POPULATION
The set of all the items or individuals of interest in a particular study.
Example: All the students who have graduated from our school since
inception.

SAMPLE
A collection of a portion of the population selected for analysis. Thus, a
sample is a sub-set of the population.
Example: 80 students selected at random from the 4,800 students who
graduated from our school since inception (population).

REPRESENTATIVE SAMPLE
A sample that is expected to be similar to the population in its
characteristics. Inferences based upon the sample are expected to be
reasonably accurate for the population as well.
PICKING SAMPLES AT RANDOM

2-DIGIT RANDOM
NUMBER TABLE

EXCEL FUNCTION:
RANDBETWEEN(lowest, highest)
Example: RANDBETWEEN(00,99)
TERMS WE NEED TO KNOW…
PARAMETER (term starts with ‘P’)
A descriptive measure for a population.
Example: Average marks scored in the first quiz of this [entire] batch of
students.

STATISTIC (term starts with ‘S’)


A descriptive measure for a sample.
Example: Average marks scored in the first quiz of a sample from this
batch of students.
PARAMETER AND STATISTIC

Parameter Statistic
Average marks scored Average marks scored
by all students of by a sample of students
this batch. of this batch.
TERMS WE NEED TO KNOW…
OUTLIER
A data-point/observation whose value lies at an ‘abnormal’ distance from
the rest of the data-set.
Point to ponder over: What is meant by ‘abnormal distance’?
Example: Student marks in an exam that qualify for ‘A+’ or ‘F’ grades.

SKEWNESS
The asymmetry of a data distribution.
INDEX OF SKEWNESS
A measure of the degree of asymmetry of a data distribution.
IDENTIFYING
OUTLIERS

Abnormal
Distance?
UNDERSTANDING DISTRIBUTION
OF DATA
Properties to describe data distribution

1 Central Tendency
2 Dispersion
3 Skewness
CENTRAL TENDENCY

Measures of central tendency

1 Mode
2 Mean (arithmetic)
3 Median
CENTRAL TENDENCY - MODE
The Mode is the score or qualitative category that
occurs most frequently in the data set.

Example:
On studying trends over three years in the sales of
white shirts of sizes 35”, 37” 39” and 41”, a garment
manufacturer finds that 37” size is the fastest moving,
accounting for 43% of total sales.

THE MODE TELLS US WHAT IS MOST TYPICAL.


Examples:
1 Most families in Europe have four members.
2 Graduates’ favourite colour in 2021 is maroon.
Important for
63 years
Marketing!
CENTRAL TENDENCY - MODE
NOTE
The Mode is usually arrived at by inspection, not
computation.

Example of grades in a class:


B, A, B, B, C, A, B, B, C, B, A, B, B

VALUE FREQUENCY
A 3
B 8
C 2
Quiz 1:
Can a data set have more than one Mode?
Quiz 2:
Can a data set have no Mode?

Every value
has the same
frequency!
Quiz 3:
Can the Mode be an extreme value?

Data Set
A+ A+ A+ A B+ B+ B C+ C+ C D

The concept of ‘Central Tendency’?


CENTRAL TENDENCY - MEAN
The (Arithmetic) Mean is the sum of scores divided by
the number of scores.

It usually represents the “average”.

Caution:
Check if the mean is to be computed without or with
weights (simple versus weighted mean).
If weights are assigned, we would be computing the
‘weighted mean’ or ‘weighted average’.
Note – Recent values are usually given more
importance than older ones.
CENTRAL TENDENCY - MEAN
NOTE 1
A numerical data set has only one Mean.

NOTE 2
The Mean cannot be applied to qualitative data.
CENTRAL TENDENCY - MEDIAN
The Median is the single point in a distribution
that divides the data set into two groups of equal
frequency.

It represents the “middlemost” point of the


distribution.

Caution:
The median can often be observed; sometimes it
has to be computed.
Median value is not affected by extreme values [outliers].
CENTRAL TENDENCY - MEDIAN
NOTE
In an ordered data set,

If n is odd, the median is the (n + 1)/2th score from


either end of the line.

If n is even, the median is the midway point


(mean) between the n/2th score and the (n/2) + 1th
score from either end of the line.
CENTRAL TENDENCY - MEDIAN
Example: n = 7
Values – 8, 13, 15, 19, 26, 33, 37
Median is the (n + 1)th value: 4th value = 19

Example: n = 8
Values – 8, 13, 15, 19, 26, 33, 37, 43
Median is the mean of the middle two values
= Mean of 4th and 5th values = 22.5
COMPARING MEAN & MEDIAN

Outliers, if present.

Mean divides the data set into two equal parts by Value.
Mean is affected by the presence of extreme values [outliers].

Median divides the data set into two equal parts by Frequency
[number of data points].
Median is unaffected by the presence of extreme values [outliers].
LOCATING CENTRAL TENDENCIES

A data distribution may have multiple Modes,


but can have only one Mean and one Median.
MEAN, MEDIAN & MODE
MODE # Applies to both quantitative & qualitative data.
# A data set may contain no modes.
# A data set may contain multiple modes.
# The mode may be an extreme data point.
Use to determine the most ‘typical’ occurrence.
MEAN # Applies to quantitative data only.
# A data set can contain only one mean.
# Dependent on the value of each data point.
# The mean is affected by extreme values.
Divides the data set into two equal parts by Value.
MEDIAN # Applies to quantitative data only.
# A data set can contain only one median.
# Not dependent on the values of data points.
# The median is not affected by extreme values.
Divides the data set into two equal parts by Number
of Values.
LOCATING CENTRAL TENDENCIES

For Symmetrical Distributions,


Mean = Median = Mode (unimodal cases)
LOCATING CENTRAL TENDENCIES

Significant tail on left Equal tails Significant tail on right


LOCATING CENTRAL TENDENCIES

Negatively or Left Skewed Distribution:


Mean is ‘pulled’ to the left by data points in the significant tail.
Median and Mode are unaffected by the values of the data points.
LOCATING CENTRAL TENDENCIES

Positively or Right Skewed Distribution:


Mean is ‘pulled’ to the right by data points in the significant tail.
Median and Mode are unaffected by the values of the data points.
LOCATING CENTRAL TENDENCIES

Mean & Median

What’s that?
PRACTICAL WORKING RULE
When dealing with a skewed quantitative
distribution, consider using Median instead
of Mean for the average.
WHAT WOULD THE DATA DISTRIBUTION OF USAIN BOLT’S
100 m TIMINGS LOOK LIKE (LAST 50 RACES)?
THE 1% SUPER-RICH
In this case, are the values being considered or the number of values?
SAMPLE EXERCISE
30 applicants for a driving license scored the following marks in
the written test (in ascending order).

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 4 6 6 9 11 11 13 16 17 19
R2 23 27 28 31 32 35 37 37 42 43
R3 46 47 51 53 54 58 59 64 72 78

The driving school issues grades from A (highest) to F (lowest)


where A and F grades are awarded to outliers on a relative scale.

Required of you:
1) Compute the Mode, Mean and Median of the data set.
2) Identify the outliers in the data set.
TERMS WE NEED TO KNOW…
Dispersion/Variation/Variability of a Data Distribution

DEGREES OF VARIATION ZERO VARIATION


Frequency = 16

5’ 9”
Variable of interest
5’ 0” 5’ 3” 5’ 6” 5’ 9” 6’ 0” 6’ 3” 6’ 6”
(height of students)
Variable of interest
(height of students)
TERMS WE NEED TO KNOW…
Dispersion/Variation/Variability of a Data Distribution

RANGE
Difference between the largest and smallest values in the data set.
Note: a) Considers only extreme values; ignores all others.
b) Is always positive.
c) Simple to compute.

VARIANCE
Is the average (mean) of the squared deviations of observations from the
mean in the data set.
Note: a) Higher the variance, greater is the data dispersed from the mean.
b) If all observations in the data set have the same value, the
variance is zero.
c) Its unit of measure is meaningless.

STANDARD DEVIATION
Is the square root of variance.
Note: a) Its unit of measure is the same as the average and deviation.
b) Considers all the data points in its computation.
NOTATIONS
For Mean and Standard Deviation

POPULATION [Greek] SAMPLE [English]


Mean :µ Mean : Xbar

Standard Deviation : σ Standard Deviation : S

Variance : σ2 Variance : S2
FORMULAE FOR VARIANCE

For data relating to a population,


Variance σ2 = ∑ (xi - µ)2 / N

For data relating to a sample,


Variance s2 = ∑ (xi - xbar)2 / (n – 1)

Xi is the value of each data point in the data set.


N is the population size.
n is the sample size.
µ is the mean of the population (parameter).
Xbar is the mean of the sample (statistic).
WHY USE (n – 1) FOR SAMPLES?

Bessel’s Correction
To compensate for sampling errors (the sample not
accurately representing the population).
DEGREES OF FREEDOM
The number of variables in a system that are free to
vary without violating any constraint.

In other words, how many variables does one need


to specify in order to define the system?

To locate a point in space, df = 3


FORMULAE FOR STANDARD DEVIATION

To compute the standard deviation of a


data set, first compute the variance.
Then compute the square root of variance.

For data relating to a population,


Standard deviation σ = √σ2

For data relating to a sample,


Standard deviation s = √s2
1 To compute Standard Deviation, we need to first
compute the Variance.

2 Two or more Standard Deviations cannot be added


to compute the combined value.
COEFFICIENT OF VARIATION
COV = SD/MEAN

Variance and Standard Deviation : Absolute measures of data dispersion

Example: Standard deviation of scores in two exams


Statistics: 16% Research Methods: 13%

Mean of scores in the two exams


Statistics: 76% Research Methods: 58%

Coefficient of Variation of scores in the two exams


Statistics: 16%/76% Research Methods: 13%/58%
= 0.211 = 0.224

RELATIVE MEASURES ARE USUALLY MORE


MEANINGFUL THAN ABSOLUTE MEASURES.
COEFFICIENT OF VARIATION

EXAMPLE:
Estimated profit in a project [3 years] = $10,000,000
Standard Deviation of expected profit = $16,000

Coefficient of Variation = 0.16%

Significant?
COMBINING TWO DATA SETS

MEANS AND STANDARD DEVIATIONS


ARE NOT ADDITIVE!
FOR POPULATIONS

Population 1
Number of data points = N1
Mean = µ1
Standard Deviation = σ1

Population 2
Number of data points = N2
Mean = µ2
Standard Deviation = σ2

Combined Mean (weighted average)

Combined Mean = [(N1 * µ1) + (N2 * µ2)]

(N1 + N2)
FOR POPULATIONS
Combined Variance (weighted average)

Combined Variance =

[(N1 * σ12) + (N2 * σ22)] + [N1 * (µ1 - µ)2] + [N2 * (µ2 - µ)2]

(N1 + N2)

Re-written as:

Combined Variance =

N1 * [σ12 + (µ1 - µ)2] + N2 * [σ22 + (µ2 - µ)2]

(N1 + N2)
FOR SAMPLES
Sample 1
Number of data points = n1
Mean = X1bar
Standard Deviation = S1

Sample 2
Number of data points = n2
Mean = X2bar
Standard Deviation = S2

Combined Mean (weighted average)

Combined Mean = [(n1 * X1bar) + (n2 * X2bar)]

(n1 + n2)
FOR SAMPLES
Combined Variance (weighted average)

Combined Variance =

[(n1 * s12) + (n2 * s22)] + [n1 * (X1bar - Xbar)2] + [n2 * (X2bar - Xbar)2]

(n1 + n2)

Re-written as:

Combined Variance =

n1 * [s12 + (X1bar - Xbar)2] + n2 * [s22 + (X2bar - Xbar)2]

(n1 + n2)
EXERCISE: COMBINING TWO DATA SETS
Students’ marks for the final exam of Marketing Management for the first
two batches of EMBA are shown below.

Required of you:
1 Compute the mean and standard deviation of the marks for the two
batches separately.
2 Compute the mean and standard deviation of the two batches combined.

BATCH 1 [N = 14]
73 67 68 74 70 72 73 72
75 73 75 69 69 56

BATCH 2 [N = 12]
74 50 72 69 71 72 72 68
72 70 70 68
EXERCISE: COMBINING TWO DATA SETS

BATCH 1 [N = 14]:
Total = 986 Mean = 70.43 Std. Dev. = 4.70

BATCH 2 [N = 12]:
Total = 828 Mean = 69.00 Std. Dev. = 5.99

COMBINED MEAN:
[986 + 828]/26 = 69.77

COMBINED VARIANCE:
{14 x [4.72 + (70.43 – 69.77)2] + 12 X [5.992 + (69 – 69.77)2]}/(14 + 12)
= 28.96

COMBINED STANDARD DEVIATION:


= 5.38
CLASSIFICATION OF DATA

DATA

QUALITATIVE DATA QUANTITATIVE DATA

CONTINUOUS DISCRETE
DATA DATA
CLASSIFICATION OF DATA
CONTINUOUS DATA
As the name suggests, with continuous data, the variable can take any
value in a defined range without any breaks. In other words, there are
an ‘infinite’ number of possible values the variable can take.

Continuous data should be measured (not counted). To specify the


value of a variable using continuous data, we should ask the question,
“How much?”

Examples: Measurement of weight, height, temperature, purity

DISCRETE DATA
Discrete classes of data involve breaks between the classes. Usually, the
number of possible values the variable can take are limited and known.

Discrete data should be counted (not measured). To specify the value of


a variable using discrete data, we should ask the question,
“How many?”

Examples: Measurement of students present in class, number of


textbooks prescribed for a course, seats on a plane
EXERCISES ON CENTRAL TENDENCY AND DISPERSION OF DATA

Exercise 1
These data are a sample of the daily production rate of fiberglass boats from Hydrosport
Limited.

17 21 18 27 17 21 20 22 18 23

The company’s production manager feels that a standard deviation of more than three
boats per day indicates unacceptable production-rate variations. Should she be
concerned based on the above data?

Exercise 2
The reading readiness of pre-school children in two neighbourhoods was determined
through sampling and the data sets are displayed below.

a) Compute the mode, mean and median for each neighbourhood separately.
b) Compute the range and standard deviation for each neighbourhood separately.
c) Are the two distributions symmetrical or skewed?

Neighbourhood A
30 33 32 31 35 33
32 29 33 30 32 28
31 31 29 31 26 30
32 30 33 32 27 32

Neighbourhood B
29 32 28 29 29
30 31 26 30 28
28 29 29 34 30
29 27 30 31 35

Exercise 3
The head chef of The Flying Taco has just received two dozen tomatoes from her
supplier, but she isn’t ready to accept them. She knows from the invoice that the
average weight of a tomato is 7.5 ounces, but she insists that all be of uniform weight.
She will accept them only if the average weight is 7.5 ounces and the standard
deviation is less than 0.5 ounce.

Based on the weights of the tomatoes, determine what the head chef’s decision is.

6.3 7.2 7.3 8.1 7.8 6.8 7.5 7.8 7.2 7.5 8.1 8.2
8.0 7.4 7.6 7.7 7.6 7.4 7.5 8.4 7.4 7.6 6.2 7.4

1
RANGE AND STANDARD DEVIATION
When should one use each of these measures?

Exercise 4
A population with 20 numbers drawn at random from 0 to 99 has been tabulated below.

17 63 23 84 6 47 38 29 73 19
61 84 92 81 43 4 13 28 89 38

Part A
Compute the range and standard deviation of the population.

Part B
From the population, a random sample of five numbers is drawn, as tabulated below.

92 81 84 6 4

Compute the range and standard deviation of the sample.

Do you find the computed values strange?

CASE ANALYSIS
Life Expectancy in Top 10 Countries

RANK COUNTRY HEALTHY YEARS IN LIFE


YEARS ILL-HEALTH EXPECTANCY
1 Singapore 73.62 10.11 83.73
2 Japan 73.16 10.78 83.94
3 Spain 72.62 10.35 82.97
4 Switzerland 71.93 11.25 83.18
5 Italy 71.75 10.59 82.34
6 France 71.71 10.63 82.34
7 Australia 71.53 10.99 82.52
8 Norway 71.49 10.61 82.10
9 Iceland 71.48 10.79 82.27
10 Israel 71.44 10.70 82.14

Source: Global Burden of Disease Study, Institute for Health Metrics and Evaluation,
University of Washington

Comment on the data presented above.


2
UNDERSTANDING THE WORLD AS FOUR LEVELS
Factfulness by Hans Rosling

I am often quite rude during my presentations when people from the “developed world” use the
term “developing world”.

Afterward, people ask me, “So, what should we call them instead?”

But, listen carefully. It’s the same misconception: we and them. What should “we” call “them”
instead?

What we should do is stop dividing the countries of the world into two groups. It doesn’t make
sense anymore. It doesn’t help us to understand the world in a practical way. It doesn’t help
businesses find opportunities, and it doesn’t help aid money to find the poorest people.

But we need to do some kind of sorting to make sense of the world. We can’t give up our old
labels and replace them with… nothing. What should we do?

One reason the old labels are so popular is that they are so simple. But they are wrong! So, to
replace them, I will now suggest an equally simple but more relevant and useful way of dividing
up the world. Instead of dividing the world into two groups, I will divide it into four income levels,
as describe below.

Each figure in the chart represents 1 billion people, and the seven figures show how the current
world population is spread out across four income levels, expressed in terms of dollar income
per day. You can see that most people are living in the two middle levels, where people have
most of their basic human needs met.

Are you excited? You should be. Because the four income levels are the first, most important
part of your new fact-based framework. They are one of the simple thinking tools I promised
would help you to guess better about the world. So, I want to try to explain what life is like on
each of these four levels.

Think of the four income levels as the levels of a computer game. Everyone wants to move from
Level 1 to Level 2 and upward through the levels from there. Only, it’s a very strange computer
game, because Level 1 is the hardest. Let’s play.
Page | 1
LEVEL 1
You start on Level 1 with $1 per day. Your five children have to spend hours walking barefoot
with your single plastic bucket, back and forth, to fetch water from a dirty mud hole an hour’s
walk away. On their way home, they gather firewood, and you prepare the same grey porridge
that you have been eating at every meal, every day, for your whole life – except during the
months when the meagre soil yielded no crops and you went to bed hungry. One day, your
youngest daughter develops a nasty cough. Smoke from the indoor fire is weakening her lungs.
You cannot afford antibiotics, and one month later, she is dead. This is extreme poverty. Yet,
you keep struggling on. If you are lucky, and the yields are good, you can maybe sell some
surplus crops and manage to earn more than $2 a day, which would move you to the next level.
Good luck!
[Roughly 1 billion people live like this today.]

LEVEL 2
You’ve made it. In fact, you’ve quadrupled your income and now you earn $4 a day. Three extra
dollars every day. What are you going to do with all this money? Now you can buy food that you
didn’t grow yourself, and you can afford chickens, which means eggs. You save some money
and buy sandals for your children, and a bike, and more plastic buckets. Now it takes you only
half an hour to fetch water for the day. You buy a gas stove so your children can attend school
instead of gathering wood. When there’s power, they can do their homework under a bulb. But
the electricity is too unstable for a freezer. You save up for mattresses so you don’t have to
sleep on the mud floor. Life is much better now, but still very uncertain. A single illness and you
would have to sell most of your possessions to buy medicine. That would throw you back to
Level 1 again. Another three dollars a day would be good, but to experience really drastic
improvement, you need to quadruple again. If you can land a job in the local garment industry,
you will be the first member of your family to bring home a salary.
[Roughly 3 billion people live like this today.]

LEVEL 3
Wow! You did it! You work multiple jobs, 16 hours a day, seven days a week, and manage to
quadruple your income again, to $16 a day. Your savings are impressive and you install a cold-
water tap. No more fetching water. With a stable electric line, the kids’ home work improves and
you can buy a fridge that lets you store food and serve different dishes each day. You save to
buy a motorcycle, which means you can travel to a better-paying job at a factory in town.

Unfortunately, you crash on your way there one day, and you have to use money you had saved
for your children’s education to pay the medical bills. You recover, and thanks to your savings,
you are not thrown back a level. Two of your children start high school. If they manage to finish,
they will be able to get better-paying jobs than you have ever had. To celebrate, you take the
whole family on its first-ever vacation, one afternoon to the beach, just for fun.
[Roughly, 2 billion people live like this today.]

LEVEL 4
You have more than $32 a day. You are a rich consumer and three more dollars a day makes
very little difference to your everyday life. That’s why you think three dollars, which can change
the life of someone living in extreme poverty, is not a lot of money. You have more than 12

Page | 2
years of education and you have been on an airplane on vacation. You can eat out once a
month and you can buy a car. Of course, you have hot and cold water indoors.

But you know about this level already. Since you are reading this passage, I’m pretty sure you
live in Level 4. I don’t have to describe it for you to understand. The difficulty, when you have
always known this high level of income, is to understand the huge differences between the other
three levels. People on Level 4 must struggle hard not to misunderstand the reality of the other
6 billion people in the world.
[Roughly, 1 billion people live like this today.]

I’ve described the progress up the levels as if one person managed to move through several
levels. That is very unusual. Often, it takes several generations for a family to move from Level
1 to Level 4. I hope though that you now have a clear picture of the kinds of lives people live on
different levels; a sense that it is possible to move through the levels, both for individuals and for
countries; and above all the understanding, that there are not just two kinds of lives.

Human history started with everyone on Level 1. For more than 100,000 years, nobody made it
up the levels and most children didn’t survive to become parents. Just 200 years ago, 85% of
the world population was still on Level 1, in extreme poverty.

Today, the vast majority of people are spread out in the middle, across Levels 2 and 3, with the
same range of standards of living as people in Western Europe and North America in the 1950s.
And this has been the case for many years.

The gap instinct of classifying all data into two categories is very strong. The first time I lectured
to the staff of the World Bank was in 1999. I told them the labels “developing” and “developed”
were no longer valid. It took the World Bank 17 years and 14 more of my lectures before it
finally announced publicly that it was dropping these terms and would from now on divide the
world into four income groups. The UN and most other global organisations have still not made
this change.

Page | 3
Page | 4
QUARTILES, DATA SET SUMMARIES AND OUTLIERS

QUARTILES
Often, to understand and analyse a set of data, it is useful to divide the set into four equal parts so
that each part contains about 25% of the values. Each of the three dividing points (three dividers are
required for four parts) is called a ‘quartile’ and is defined as follows:

First Quartile, Q1 (25th percentile)


Divides the smallest 25% of the values from the other 75% that are larger.
Q1 = (n + 1)/4 ranked value

Second Quartile, Q2 (50th percentile) – also the median


Divides the data set so that 50% of the values are smaller than or equal to the median and 50% are
larger than or equal to the median.

Third Quartile, Q3 (75th percentile)


Divides the smallest 75% of the values from the largest 25%.
Q3 = 3 x (n + 1)/4 ranked value

INTERQUARTILE RANGE (MIDSPREAD)


The interquartile range or midspread is the difference between the third and first quartiles in a data
set. It measures the spread of the middle 50% of the values in the data set. Therefore, it is not
influenced by extreme values (which may be ‘outliers’).
Interquartile range [IQR] = Q3 – Q1

EXAMPLE
A data set consists of 21 recordings, which are shown in the table below. Compute the following:
1 The first, second and third quartiles
2 The interquartile range

8408 1374 1872 8879 2459 11413 608


14138 6452 1850 2818 1356 10498 7478
4019 4341 739 2127 3653 5794 8305

Solution
As the first step, we arrange the data set in ascending order as shown below.
608 739 1356 1374 1850 1872 2127
2459 2818 3653 4019 4341 5794 6452
7478 8305 8408 8879 10498 11413 14138

Median or Q2 is the 11th value: 4019


Q1 is the value ranked (n + 1)/4 : 1861
Q3 is the value ranked 3 x (n + 1)/4: 8356.50
Interquartile range [IQR]: [8356.50 - 1861] = 6495.50

Page | 1
FIVE-NUMBER SUMMARY
It is often convenient to summarise a data set by specifying five numbers:
1 Smallest value
2 First quartile, Q1
3 Second quartile (median), Q2
4 Third quartile, Q3
5 Largest value

In the previous example, the five numbers (in sequence) are:


608, 1861, 4019, 8356.50, 14138

The five-number summary is useful in judging the shape of the data distribution, as described
below.

COMPARISON LEFT-SKEWED SYMMERTIC RIGHT-SKEWED


DISRIBUTION DISTRIBUTION DISTRIBUTION
1 Distance from Xsmallest Distance from Xsmallest The two distances Distance from
to median versus to median is greater are the same Xsmallest to median is
distance from median than distance from less than distance
to Xlargest median to Xlargest from median to
Xlargest
2 Distance from Xsmallest Distance from Xsmallest The two distances Distance from
to Q1 versus distance to Q1 is greater than are the same Xsmallest to Q1 is less
from Q3 to Xlargest distance from Q3 to than distance from Q3
Xlargest to Xlargest
3 Distance from Q1 to Distance from Q1 to The two distances Distance from Q1 to
median versus median is greater than are the same median is less than
distance from median distance from median distance from median
to Q3 to Q3 to Q3

In the previous exercise,

Comparison 1 [Entire range of data set]


[Median - Xsmallest] = [4019 – 608] = 3411
[Xlargest – Median] = [14138 – 4019] = 10119
This is right-skewed.

Comparison 2 [Tails]
[Q1 - Xsmallest] = [1861 – 608] = 1253
[Xlargest – Q3] = [14138 – 8356.50] = 5781.50
This is right-skewed.

Comparison 3 [Body]
[Median – Q1] = [4019 – 1861] = 2158
[Q3 – Median] = [8356.50 – 4019] = 4337.50
This is right-skewed.

Thus, from the five-number summary, we can judge the shape of the data set: is it symmetric or
skewed to the left or right.

Page | 2
IDENTIFYING OUTLIERS
An objective method of identifying outliers in a data set is to compute threshold values on either side
of the median that are called ‘limits’. Data values that lie outside these two limits (Lower Limit and
Upper Limit) are identified as outliers.

We use the following formulae to compute the limits:


Lower Limit = Q1 – 1.5 x [Interquartile Range] = Q1 – 1.5 x [Q3 – Q1]
Upper Limit = Q3 + 1.5 x [Interquartile Range] = Q3 + 1.5 x [Q3 – Q1]

In our example,
Lower Limit = 1861 – 1.5 x [8356.50 – 1861] = - 7882.25
Upper Limit = 8356.50 + 1.5 x [8356.50 – 1861] = + 18099.75

Since none of the data points lies outside this range, we conclude that the data set does not contain
outliers on either side of the median.

THE BOXPLOT
A boxplot provides a graphical representation of the data based on the five-number summary. It
looks like this.

The ‘box’ refers to the interquartile range (Q3 – Q1), which also contains the median Q2. The lines
on either side of the box are referred to as ‘whiskers’. The ‘box and whiskers plot’ may be drawn in
three different ways:

1 Until the extreme values (minimum and maximum) in the data set
2 Until the lower and upper limits as computed in the previous section (as shown above)
3 Until the extreme values (minimum and maximum) in the data set that lie within the lower and
upper limits computed in the previous section (as shown in the next page).

Outliers are simply data points that lie outside the lower and upper limits in the plot.

Page | 3
The shape of the data set can be determined from the box and whiskers plot, as discussed on page
2. Applying Comparison 2 from the table on page 2 to the plot below, we can infer that the data set
is skewed to the right.

Page | 4
EXERCISE 1 ON QUARTILES, OUTLIERS & BOX-AND-WHISKERS PLOT

Naples, Florida, hosts a marathon in January each year. The event attracts top runners
from across the United States. In January last year, 22 men and 31 women entered the
19-24 age class.
Finish time in minutes were recorded as shown in the table below (in order of finish).

FINISH MEN WOMEN


1 65.30 109.03
2 66.27 111.22
3 66.52 111.65
4 66.85 111.93
5 70.87 114.38
6 87.18 118.33
7 96.45 121.25
8 98.52 122.08
9 100.52 122.48
10 108.18 122.62
11 109.05 123.88
12 110.23 125.78
13 112.90 129.52
14 113.52 129.87
15 120.95 130.72
16 127.98 131.67
17 128.40 132.03
18 130.90 133.20
19 131.80 133.50
20 138.63 136.57
21 143.83 136.75
22 148.70 138.20
23 139.00
24 147.18
25 147.35
26 147.50
27 147.75
28 153.88
29 154.83
30 189.27
31 189.28

Required of you:
1 Prepare the five-number summary for men and women separately.
2 Draw the box-and-whiskers plots for men and women separately on a single graph
sheet. The end of the whiskers can be taken as the lower and upper limits for
identifying outliers.
3 Determine whether the two data sets (men and women) are symmetric or skewed.
4 Identify outliers, if any, in the two data sets.
Page | 5
EXERCISE 2 ON QUARTILES, OUTLIERS & BOX-AND-WHISKERS PLOT

ASSETS UNDER MANAGEMENT OF TOP 25 ASSET MANAGERS GLOBALLY

The world’s largest money managers ranked by total assets under management [AUM]
in billions of US$ as on 31st December 2020.

RANK COPANY COUNTRY TOTAL AUM [$ in bn]

1 Blackrock US 7,318
2 Vanguard Group US 6,100
3 UBS Group Switzerland 3,518
4 Fidelity Investments US 3,319
5 State Street Global Advisors US 3,054
6 Allianz Group Germany 2,530
7 JP Morgan UA 2,511
8 Goldman Sachs US 2,057
9 Bank of New York Mellon US 1,961
10 PIMCO US 1,920
11 Morgan Stanley US 1,901
12 Amundi France 1,791
13 Capital Group US 1,700
14 Prudential Financial US 1,605
15 Credit Suisse Switzerland 1,521
16 Franklin Resources US 1,428
17 Deutsche Bank Germany 1,368
18 Northern Trust US 1,258
19 Legal & General Trust UK 1,232
20 BNP Paribas France 1,221
21 Bank of America US 1,220
22 T. Rowe Price US 1,218
23 Invesco Limited US 1,145
24 TIAA US 1,143
25 Wellington Management Co. US 1,100

Required of you:
1 Prepare the five-number summary for this data set.
2 Draw the box-and-whiskers plots for the data set.
3 Determine whether the data set is symmetric or skewed.
4 Identify outliers, if any, in the data set.

Page | 6

You might also like