Applied Statistics for Decision-Making
Applied Statistics for Decision-Making
Workbook Part 1
Prof. S. Arvind
NUMBERS ARE IMPORTANT, BUT…
Factfulness by Hans Rosling
I don’t love numbers. I am a huge, huge fan of data, but I don’t love it. It has its limits. I love data only
when it helps me to understand the reality behind the numbers, i.e., people’s lives. In my research, I
have needed the data to test my hypotheses, but the hypotheses themselves often emerged from
talking to, listening to, and observing people. Though we absolutely need numbers to understand the
world, we should be highly skeptical about conclusions derived purely from number crunching.
The prime minister from Mozambique from 1994 to 2004, Pascoal Mocumbi, visited Stockholm in 2002
and told me that his country was making great economic progress. I asked him how he knew that;
after all, the quality of the economic statistics in Mozambique was probably not very good. Had he
looked at GDP per capita?
“I do look at those figures,” he said, “but they are not so accurate. So, I have also made it a habit to
watch the marches on 1st May every year. They are a popular tradition in our country. And I look at
people’s feet and what kind of shoes they have. I know that people do their best to look good on that
day. I know that they cannot borrow their friend’s shoes, because their friend will be out marching, too.
So, I look. And I can see, if they walk barefoot, or if they have bad shoes or if they have good shoes.
And I can compare with what I saw last year.
“Also, when I travel across the country, I look at the construction going on. If the grass is growing over
new foundations, that is bad. But if they keep adding new bricks to the building, then I know people
have money to invest, not just to consume day-to-day.”
A wise prime minister looks at the numbers, but not only at the numbers.
And, of course, some of the most valued and important aspects of human development cannot be
measured in numbers at all. We can estimate suffering from disease using numbers. We can measure
improvements in material living conditions using numbers. But the end goal of economic growth is
individual freedom and culture, and these values are difficult to capture with numbers. The idea of
measuring human progress in numbers seems completely bizarre to many people. I often agree. The
numbers will never tell the full story on what life on Earth is all about.
The world cannot be understood without numbers. But the world cannot be understood with numbers
alone.
TERMS WE NEED TO KNOW…
HISTOGRAM
A bar chart where vertical bars represent the frequency or percentage of the class
(group or individual member).
The variable of interest is displayed along the X-axis.
Note: There should be no gaps between adjacent bars.
POPULATION
The set of all the items or individuals of interest in a particular study.
Example: All the students who have graduated from our school since
inception.
SAMPLE
A collection of a portion of the population selected for analysis. Thus, a
sample is a sub-set of the population.
Example: 80 students selected at random from the 4,800 students who
graduated from our school since inception (population).
REPRESENTATIVE SAMPLE
A sample that is expected to be similar to the population in its
characteristics. Inferences based upon the sample are expected to be
reasonably accurate for the population as well.
PICKING SAMPLES AT RANDOM
2-DIGIT RANDOM
NUMBER TABLE
EXCEL FUNCTION:
RANDBETWEEN(lowest, highest)
Example: RANDBETWEEN(00,99)
TERMS WE NEED TO KNOW…
PARAMETER (term starts with ‘P’)
A descriptive measure for a population.
Example: Average marks scored in the first quiz of this [entire] batch of
students.
Parameter Statistic
Average marks scored Average marks scored
by all students of by a sample of students
this batch. of this batch.
TERMS WE NEED TO KNOW…
OUTLIER
A data-point/observation whose value lies at an ‘abnormal’ distance from
the rest of the data-set.
Point to ponder over: What is meant by ‘abnormal distance’?
Example: Student marks in an exam that qualify for ‘A+’ or ‘F’ grades.
SKEWNESS
The asymmetry of a data distribution.
INDEX OF SKEWNESS
A measure of the degree of asymmetry of a data distribution.
IDENTIFYING
OUTLIERS
Abnormal
Distance?
UNDERSTANDING DISTRIBUTION
OF DATA
Properties to describe data distribution
1 Central Tendency
2 Dispersion
3 Skewness
CENTRAL TENDENCY
1 Mode
2 Mean (arithmetic)
3 Median
CENTRAL TENDENCY - MODE
The Mode is the score or qualitative category that
occurs most frequently in the data set.
Example:
On studying trends over three years in the sales of
white shirts of sizes 35”, 37” 39” and 41”, a garment
manufacturer finds that 37” size is the fastest moving,
accounting for 43% of total sales.
VALUE FREQUENCY
A 3
B 8
C 2
Quiz 1:
Can a data set have more than one Mode?
Quiz 2:
Can a data set have no Mode?
Every value
has the same
frequency!
Quiz 3:
Can the Mode be an extreme value?
Data Set
A+ A+ A+ A B+ B+ B C+ C+ C D
Caution:
Check if the mean is to be computed without or with
weights (simple versus weighted mean).
If weights are assigned, we would be computing the
‘weighted mean’ or ‘weighted average’.
Note – Recent values are usually given more
importance than older ones.
CENTRAL TENDENCY - MEAN
NOTE 1
A numerical data set has only one Mean.
NOTE 2
The Mean cannot be applied to qualitative data.
CENTRAL TENDENCY - MEDIAN
The Median is the single point in a distribution
that divides the data set into two groups of equal
frequency.
Caution:
The median can often be observed; sometimes it
has to be computed.
Median value is not affected by extreme values [outliers].
CENTRAL TENDENCY - MEDIAN
NOTE
In an ordered data set,
Example: n = 8
Values – 8, 13, 15, 19, 26, 33, 37, 43
Median is the mean of the middle two values
= Mean of 4th and 5th values = 22.5
COMPARING MEAN & MEDIAN
Outliers, if present.
Mean divides the data set into two equal parts by Value.
Mean is affected by the presence of extreme values [outliers].
Median divides the data set into two equal parts by Frequency
[number of data points].
Median is unaffected by the presence of extreme values [outliers].
LOCATING CENTRAL TENDENCIES
What’s that?
PRACTICAL WORKING RULE
When dealing with a skewed quantitative
distribution, consider using Median instead
of Mean for the average.
WHAT WOULD THE DATA DISTRIBUTION OF USAIN BOLT’S
100 m TIMINGS LOOK LIKE (LAST 50 RACES)?
THE 1% SUPER-RICH
In this case, are the values being considered or the number of values?
SAMPLE EXERCISE
30 applicants for a driving license scored the following marks in
the written test (in ascending order).
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
R1 4 6 6 9 11 11 13 16 17 19
R2 23 27 28 31 32 35 37 37 42 43
R3 46 47 51 53 54 58 59 64 72 78
Required of you:
1) Compute the Mode, Mean and Median of the data set.
2) Identify the outliers in the data set.
TERMS WE NEED TO KNOW…
Dispersion/Variation/Variability of a Data Distribution
5’ 9”
Variable of interest
5’ 0” 5’ 3” 5’ 6” 5’ 9” 6’ 0” 6’ 3” 6’ 6”
(height of students)
Variable of interest
(height of students)
TERMS WE NEED TO KNOW…
Dispersion/Variation/Variability of a Data Distribution
RANGE
Difference between the largest and smallest values in the data set.
Note: a) Considers only extreme values; ignores all others.
b) Is always positive.
c) Simple to compute.
VARIANCE
Is the average (mean) of the squared deviations of observations from the
mean in the data set.
Note: a) Higher the variance, greater is the data dispersed from the mean.
b) If all observations in the data set have the same value, the
variance is zero.
c) Its unit of measure is meaningless.
STANDARD DEVIATION
Is the square root of variance.
Note: a) Its unit of measure is the same as the average and deviation.
b) Considers all the data points in its computation.
NOTATIONS
For Mean and Standard Deviation
Variance : σ2 Variance : S2
FORMULAE FOR VARIANCE
Bessel’s Correction
To compensate for sampling errors (the sample not
accurately representing the population).
DEGREES OF FREEDOM
The number of variables in a system that are free to
vary without violating any constraint.
EXAMPLE:
Estimated profit in a project [3 years] = $10,000,000
Standard Deviation of expected profit = $16,000
Significant?
COMBINING TWO DATA SETS
Population 1
Number of data points = N1
Mean = µ1
Standard Deviation = σ1
Population 2
Number of data points = N2
Mean = µ2
Standard Deviation = σ2
(N1 + N2)
FOR POPULATIONS
Combined Variance (weighted average)
Combined Variance =
[(N1 * σ12) + (N2 * σ22)] + [N1 * (µ1 - µ)2] + [N2 * (µ2 - µ)2]
(N1 + N2)
Re-written as:
Combined Variance =
(N1 + N2)
FOR SAMPLES
Sample 1
Number of data points = n1
Mean = X1bar
Standard Deviation = S1
Sample 2
Number of data points = n2
Mean = X2bar
Standard Deviation = S2
(n1 + n2)
FOR SAMPLES
Combined Variance (weighted average)
Combined Variance =
[(n1 * s12) + (n2 * s22)] + [n1 * (X1bar - Xbar)2] + [n2 * (X2bar - Xbar)2]
(n1 + n2)
Re-written as:
Combined Variance =
(n1 + n2)
EXERCISE: COMBINING TWO DATA SETS
Students’ marks for the final exam of Marketing Management for the first
two batches of EMBA are shown below.
Required of you:
1 Compute the mean and standard deviation of the marks for the two
batches separately.
2 Compute the mean and standard deviation of the two batches combined.
BATCH 1 [N = 14]
73 67 68 74 70 72 73 72
75 73 75 69 69 56
BATCH 2 [N = 12]
74 50 72 69 71 72 72 68
72 70 70 68
EXERCISE: COMBINING TWO DATA SETS
BATCH 1 [N = 14]:
Total = 986 Mean = 70.43 Std. Dev. = 4.70
BATCH 2 [N = 12]:
Total = 828 Mean = 69.00 Std. Dev. = 5.99
COMBINED MEAN:
[986 + 828]/26 = 69.77
COMBINED VARIANCE:
{14 x [4.72 + (70.43 – 69.77)2] + 12 X [5.992 + (69 – 69.77)2]}/(14 + 12)
= 28.96
DATA
CONTINUOUS DISCRETE
DATA DATA
CLASSIFICATION OF DATA
CONTINUOUS DATA
As the name suggests, with continuous data, the variable can take any
value in a defined range without any breaks. In other words, there are
an ‘infinite’ number of possible values the variable can take.
DISCRETE DATA
Discrete classes of data involve breaks between the classes. Usually, the
number of possible values the variable can take are limited and known.
Exercise 1
These data are a sample of the daily production rate of fiberglass boats from Hydrosport
Limited.
17 21 18 27 17 21 20 22 18 23
The company’s production manager feels that a standard deviation of more than three
boats per day indicates unacceptable production-rate variations. Should she be
concerned based on the above data?
Exercise 2
The reading readiness of pre-school children in two neighbourhoods was determined
through sampling and the data sets are displayed below.
a) Compute the mode, mean and median for each neighbourhood separately.
b) Compute the range and standard deviation for each neighbourhood separately.
c) Are the two distributions symmetrical or skewed?
Neighbourhood A
30 33 32 31 35 33
32 29 33 30 32 28
31 31 29 31 26 30
32 30 33 32 27 32
Neighbourhood B
29 32 28 29 29
30 31 26 30 28
28 29 29 34 30
29 27 30 31 35
Exercise 3
The head chef of The Flying Taco has just received two dozen tomatoes from her
supplier, but she isn’t ready to accept them. She knows from the invoice that the
average weight of a tomato is 7.5 ounces, but she insists that all be of uniform weight.
She will accept them only if the average weight is 7.5 ounces and the standard
deviation is less than 0.5 ounce.
Based on the weights of the tomatoes, determine what the head chef’s decision is.
6.3 7.2 7.3 8.1 7.8 6.8 7.5 7.8 7.2 7.5 8.1 8.2
8.0 7.4 7.6 7.7 7.6 7.4 7.5 8.4 7.4 7.6 6.2 7.4
1
RANGE AND STANDARD DEVIATION
When should one use each of these measures?
Exercise 4
A population with 20 numbers drawn at random from 0 to 99 has been tabulated below.
17 63 23 84 6 47 38 29 73 19
61 84 92 81 43 4 13 28 89 38
Part A
Compute the range and standard deviation of the population.
Part B
From the population, a random sample of five numbers is drawn, as tabulated below.
92 81 84 6 4
CASE ANALYSIS
Life Expectancy in Top 10 Countries
Source: Global Burden of Disease Study, Institute for Health Metrics and Evaluation,
University of Washington
I am often quite rude during my presentations when people from the “developed world” use the
term “developing world”.
Afterward, people ask me, “So, what should we call them instead?”
But, listen carefully. It’s the same misconception: we and them. What should “we” call “them”
instead?
What we should do is stop dividing the countries of the world into two groups. It doesn’t make
sense anymore. It doesn’t help us to understand the world in a practical way. It doesn’t help
businesses find opportunities, and it doesn’t help aid money to find the poorest people.
But we need to do some kind of sorting to make sense of the world. We can’t give up our old
labels and replace them with… nothing. What should we do?
One reason the old labels are so popular is that they are so simple. But they are wrong! So, to
replace them, I will now suggest an equally simple but more relevant and useful way of dividing
up the world. Instead of dividing the world into two groups, I will divide it into four income levels,
as describe below.
Each figure in the chart represents 1 billion people, and the seven figures show how the current
world population is spread out across four income levels, expressed in terms of dollar income
per day. You can see that most people are living in the two middle levels, where people have
most of their basic human needs met.
Are you excited? You should be. Because the four income levels are the first, most important
part of your new fact-based framework. They are one of the simple thinking tools I promised
would help you to guess better about the world. So, I want to try to explain what life is like on
each of these four levels.
Think of the four income levels as the levels of a computer game. Everyone wants to move from
Level 1 to Level 2 and upward through the levels from there. Only, it’s a very strange computer
game, because Level 1 is the hardest. Let’s play.
Page | 1
LEVEL 1
You start on Level 1 with $1 per day. Your five children have to spend hours walking barefoot
with your single plastic bucket, back and forth, to fetch water from a dirty mud hole an hour’s
walk away. On their way home, they gather firewood, and you prepare the same grey porridge
that you have been eating at every meal, every day, for your whole life – except during the
months when the meagre soil yielded no crops and you went to bed hungry. One day, your
youngest daughter develops a nasty cough. Smoke from the indoor fire is weakening her lungs.
You cannot afford antibiotics, and one month later, she is dead. This is extreme poverty. Yet,
you keep struggling on. If you are lucky, and the yields are good, you can maybe sell some
surplus crops and manage to earn more than $2 a day, which would move you to the next level.
Good luck!
[Roughly 1 billion people live like this today.]
LEVEL 2
You’ve made it. In fact, you’ve quadrupled your income and now you earn $4 a day. Three extra
dollars every day. What are you going to do with all this money? Now you can buy food that you
didn’t grow yourself, and you can afford chickens, which means eggs. You save some money
and buy sandals for your children, and a bike, and more plastic buckets. Now it takes you only
half an hour to fetch water for the day. You buy a gas stove so your children can attend school
instead of gathering wood. When there’s power, they can do their homework under a bulb. But
the electricity is too unstable for a freezer. You save up for mattresses so you don’t have to
sleep on the mud floor. Life is much better now, but still very uncertain. A single illness and you
would have to sell most of your possessions to buy medicine. That would throw you back to
Level 1 again. Another three dollars a day would be good, but to experience really drastic
improvement, you need to quadruple again. If you can land a job in the local garment industry,
you will be the first member of your family to bring home a salary.
[Roughly 3 billion people live like this today.]
LEVEL 3
Wow! You did it! You work multiple jobs, 16 hours a day, seven days a week, and manage to
quadruple your income again, to $16 a day. Your savings are impressive and you install a cold-
water tap. No more fetching water. With a stable electric line, the kids’ home work improves and
you can buy a fridge that lets you store food and serve different dishes each day. You save to
buy a motorcycle, which means you can travel to a better-paying job at a factory in town.
Unfortunately, you crash on your way there one day, and you have to use money you had saved
for your children’s education to pay the medical bills. You recover, and thanks to your savings,
you are not thrown back a level. Two of your children start high school. If they manage to finish,
they will be able to get better-paying jobs than you have ever had. To celebrate, you take the
whole family on its first-ever vacation, one afternoon to the beach, just for fun.
[Roughly, 2 billion people live like this today.]
LEVEL 4
You have more than $32 a day. You are a rich consumer and three more dollars a day makes
very little difference to your everyday life. That’s why you think three dollars, which can change
the life of someone living in extreme poverty, is not a lot of money. You have more than 12
Page | 2
years of education and you have been on an airplane on vacation. You can eat out once a
month and you can buy a car. Of course, you have hot and cold water indoors.
But you know about this level already. Since you are reading this passage, I’m pretty sure you
live in Level 4. I don’t have to describe it for you to understand. The difficulty, when you have
always known this high level of income, is to understand the huge differences between the other
three levels. People on Level 4 must struggle hard not to misunderstand the reality of the other
6 billion people in the world.
[Roughly, 1 billion people live like this today.]
I’ve described the progress up the levels as if one person managed to move through several
levels. That is very unusual. Often, it takes several generations for a family to move from Level
1 to Level 4. I hope though that you now have a clear picture of the kinds of lives people live on
different levels; a sense that it is possible to move through the levels, both for individuals and for
countries; and above all the understanding, that there are not just two kinds of lives.
Human history started with everyone on Level 1. For more than 100,000 years, nobody made it
up the levels and most children didn’t survive to become parents. Just 200 years ago, 85% of
the world population was still on Level 1, in extreme poverty.
Today, the vast majority of people are spread out in the middle, across Levels 2 and 3, with the
same range of standards of living as people in Western Europe and North America in the 1950s.
And this has been the case for many years.
The gap instinct of classifying all data into two categories is very strong. The first time I lectured
to the staff of the World Bank was in 1999. I told them the labels “developing” and “developed”
were no longer valid. It took the World Bank 17 years and 14 more of my lectures before it
finally announced publicly that it was dropping these terms and would from now on divide the
world into four income groups. The UN and most other global organisations have still not made
this change.
Page | 3
Page | 4
QUARTILES, DATA SET SUMMARIES AND OUTLIERS
QUARTILES
Often, to understand and analyse a set of data, it is useful to divide the set into four equal parts so
that each part contains about 25% of the values. Each of the three dividing points (three dividers are
required for four parts) is called a ‘quartile’ and is defined as follows:
EXAMPLE
A data set consists of 21 recordings, which are shown in the table below. Compute the following:
1 The first, second and third quartiles
2 The interquartile range
Solution
As the first step, we arrange the data set in ascending order as shown below.
608 739 1356 1374 1850 1872 2127
2459 2818 3653 4019 4341 5794 6452
7478 8305 8408 8879 10498 11413 14138
Page | 1
FIVE-NUMBER SUMMARY
It is often convenient to summarise a data set by specifying five numbers:
1 Smallest value
2 First quartile, Q1
3 Second quartile (median), Q2
4 Third quartile, Q3
5 Largest value
The five-number summary is useful in judging the shape of the data distribution, as described
below.
Comparison 2 [Tails]
[Q1 - Xsmallest] = [1861 – 608] = 1253
[Xlargest – Q3] = [14138 – 8356.50] = 5781.50
This is right-skewed.
Comparison 3 [Body]
[Median – Q1] = [4019 – 1861] = 2158
[Q3 – Median] = [8356.50 – 4019] = 4337.50
This is right-skewed.
Thus, from the five-number summary, we can judge the shape of the data set: is it symmetric or
skewed to the left or right.
Page | 2
IDENTIFYING OUTLIERS
An objective method of identifying outliers in a data set is to compute threshold values on either side
of the median that are called ‘limits’. Data values that lie outside these two limits (Lower Limit and
Upper Limit) are identified as outliers.
In our example,
Lower Limit = 1861 – 1.5 x [8356.50 – 1861] = - 7882.25
Upper Limit = 8356.50 + 1.5 x [8356.50 – 1861] = + 18099.75
Since none of the data points lies outside this range, we conclude that the data set does not contain
outliers on either side of the median.
THE BOXPLOT
A boxplot provides a graphical representation of the data based on the five-number summary. It
looks like this.
The ‘box’ refers to the interquartile range (Q3 – Q1), which also contains the median Q2. The lines
on either side of the box are referred to as ‘whiskers’. The ‘box and whiskers plot’ may be drawn in
three different ways:
1 Until the extreme values (minimum and maximum) in the data set
2 Until the lower and upper limits as computed in the previous section (as shown above)
3 Until the extreme values (minimum and maximum) in the data set that lie within the lower and
upper limits computed in the previous section (as shown in the next page).
Outliers are simply data points that lie outside the lower and upper limits in the plot.
Page | 3
The shape of the data set can be determined from the box and whiskers plot, as discussed on page
2. Applying Comparison 2 from the table on page 2 to the plot below, we can infer that the data set
is skewed to the right.
Page | 4
EXERCISE 1 ON QUARTILES, OUTLIERS & BOX-AND-WHISKERS PLOT
Naples, Florida, hosts a marathon in January each year. The event attracts top runners
from across the United States. In January last year, 22 men and 31 women entered the
19-24 age class.
Finish time in minutes were recorded as shown in the table below (in order of finish).
Required of you:
1 Prepare the five-number summary for men and women separately.
2 Draw the box-and-whiskers plots for men and women separately on a single graph
sheet. The end of the whiskers can be taken as the lower and upper limits for
identifying outliers.
3 Determine whether the two data sets (men and women) are symmetric or skewed.
4 Identify outliers, if any, in the two data sets.
Page | 5
EXERCISE 2 ON QUARTILES, OUTLIERS & BOX-AND-WHISKERS PLOT
The world’s largest money managers ranked by total assets under management [AUM]
in billions of US$ as on 31st December 2020.
1 Blackrock US 7,318
2 Vanguard Group US 6,100
3 UBS Group Switzerland 3,518
4 Fidelity Investments US 3,319
5 State Street Global Advisors US 3,054
6 Allianz Group Germany 2,530
7 JP Morgan UA 2,511
8 Goldman Sachs US 2,057
9 Bank of New York Mellon US 1,961
10 PIMCO US 1,920
11 Morgan Stanley US 1,901
12 Amundi France 1,791
13 Capital Group US 1,700
14 Prudential Financial US 1,605
15 Credit Suisse Switzerland 1,521
16 Franklin Resources US 1,428
17 Deutsche Bank Germany 1,368
18 Northern Trust US 1,258
19 Legal & General Trust UK 1,232
20 BNP Paribas France 1,221
21 Bank of America US 1,220
22 T. Rowe Price US 1,218
23 Invesco Limited US 1,145
24 TIAA US 1,143
25 Wellington Management Co. US 1,100
Required of you:
1 Prepare the five-number summary for this data set.
2 Draw the box-and-whiskers plots for the data set.
3 Determine whether the data set is symmetric or skewed.
4 Identify outliers, if any, in the data set.
Page | 6