You are on page 1of 757

Statistics for Business Students

A Guide to Using Excel & IBM SPSS Statistic

Glyn Davis

&

Branko Pecar

First edition, 2021

Page | 1
Statistics for Business Students; A Guide to Using Excel and IBM SPSS
Statistics

Copyright © 2021 by Branko Pecar & Glyn Davis

ISBN: 978-1-63795-762-2

All rights reserved. No part of this book shall be reproduced, stored in a retrieval
system, or transmitted by any means, electronic, mechanical photocopying, recording,
photographing, or otherwise, without written permission from the authors. No patent
liability is assumed with respect to the use of the information contained herein.
Although every precaution has been taken in the preparation of this book, the publisher
and authors assume no responsibility for errors and omissions. Nor is any liability
assumed for damages resulting from the use of the information contained herein.

Electronic print as Amazon Kindle Edition

First version: January 2021

Trademarks

All terms mentioned in this book that are known as trademarks or service marks have
been appropriately referenced. Use of a term in this book should not be regarded as
affecting the validity of any trademark or service mark.

Warning and disclaimer

Every effort has been made to make this book as complete and as accurate as possible,
but no warranty or fitness is implied. The information provided is on an “as is” basis.
The authors and the publisher shall have neither liability nor responsibility to any
person or entity with respect to any loss or damages arising from the information
contained in this book.

Workbooks and supporting material

All workbooks and other supporting materials are published for a free download from
the website: https://www.stats-bus.co.uk. This includes Excel and IBM SPSS data and
output files.

Contact email addresses for authors:

info@stats-bus.co.uk

Page | 2
Contents – Brief

Preface……………………………………………………………………………………………………..……………..13

Chapter 1, Data Visualisation……………………………………………………………………………………18

Chapter 2, Descriptive statistics…………………………………………………………………………….....95

Chapter 3, Probability distributions………………………………………………………………………..166

Chapter 4, Sampling distributions…………………………………………………………………….…….241

Chapter 5, Point and interval estimates……………………………………………………….………….289

Chapter 6, Hypothesis testing……………………………………………………………………..…………..334

Chapter 7, Parametric hypothesis tests…………………………………………………………..……….358

Chapter 8, Nonparametric tests………………………………………………………………………………417

Chapter 9, Linear correlation and regression analysis…………………………………………..…510

Chapter 10, Introduction to time series data, long term forecasts and seasonality…….569

Chapter 11, Short and medium-term forecasts………………………………………………...……...644

Appendices…………………………………………………………………………………………………………....717

Index……………………………………………………………………………………………………………………..751

Page | 3
Contents
Preface .............................................................................................................................................................. 13
What features do we have?.................................................................................................................. 16
Greek alphabet letters used within this textbook ...................................................................... 16
Acknowledgements ................................................................................................................................ 17
Chapter 1 Data visualisation .................................................................................................................... 18
1.1 Introduction and learning objectives ....................................................................................... 18
1.2 What is a variable?........................................................................................................................... 19
Quantitative variables: interval and ratio scales .................................................................... 20
Qualitative variables: categorical data (nominal and ordinal scales) ............................ 21
Discrete and continuous data types ............................................................................................ 21
1.3 Tables ................................................................................................................................................... 23
Simple tables ........................................................................................................................................ 23
Frequency distribution..................................................................................................................... 25
Creating a crosstab table using Excel PivotTable .................................................................. 38
Summarising the principles of table construction ................................................................ 46
Check your understanding .............................................................................................................. 46
1.4 Graphs - visualising of data .......................................................................................................... 47
Bar charts .............................................................................................................................................. 47
Creating a bar chart using Excel and SPSS ................................................................................ 50
Check your understanding .............................................................................................................. 58
Pie charts ............................................................................................................................................... 58
Check your understanding .............................................................................................................. 65
Histograms ............................................................................................................................................ 66
Creating a histogram using Excel and SPSS.............................................................................. 70
Check your understanding .............................................................................................................. 77
Scatter and time series plots ............................................................................................................... 78
Creating a scatter and times series plot using Excel and SPSS ......................................... 80
Check your understanding .............................................................................................................. 90
Chapter summary .................................................................................................................................... 91
Test your understanding ...................................................................................................................... 91
Want to learn more?............................................................................................................................... 94
Chapter 2 Descriptive statistics .............................................................................................................. 95
2.1 Introduction and learning objectives ....................................................................................... 95
Introduction.......................................................................................................................................... 95
Learning objectives ............................................................................................................................ 98

Page | 4
2.2 Measures of average for a set of numbers.............................................................................. 98
Mean, median and mode for a set of numbers ........................................................................ 99
Check your understanding ........................................................................................................... 108
2.3 Measures of dispersion for a set of numbers ..................................................................... 108
Percentiles and quartiles for a set of numbers .................................................................... 109
Check your understanding ........................................................................................................... 118
The range ............................................................................................................................................ 118
The interquartile range and semi-interquartile range ..................................................... 119
The standard deviation and variance ...................................................................................... 120
Check your understanding ........................................................................................................... 128
Interpretation of the standard deviation ............................................................................... 129
The coefficient of variation .......................................................................................................... 130
Check your understanding ........................................................................................................... 131
2.4 Measures of shape ........................................................................................................................ 131
Measuring skewness: distribution symmetry ........................................................................... 132
Pearson’s coefficient of skewness ............................................................................................. 133
Fisher–Pearson skewness coefficient ...................................................................................... 133
Check your understanding ........................................................................................................... 139
Measuring kurtosis: distribution outliers and peakedness ................................................. 140
Check your understanding ........................................................................................................... 146
Calculating a five-number summary............................................................................................. 146
To identify symmetry..................................................................................................................... 148
To identify outliers.......................................................................................................................... 149
Check your understanding ........................................................................................................... 152
Creating a box plot ............................................................................................................................... 152
To identify symmetry..................................................................................................................... 153
To identify outliers.......................................................................................................................... 153
Check your understanding ........................................................................................................... 158
2.5 Using the Excel Data Analysis menu ...................................................................................... 159
Check your understanding ........................................................................................................... 162
Chapter summary ................................................................................................................................. 163
Test your understanding ................................................................................................................... 164
Want to learn more?............................................................................................................................ 165
Chapter 3 Probability distributions ................................................................................................... 166
3.1 Introduction and learning objectives .................................................................................... 166
Learning objectives ......................................................................................................................... 166

Page | 5
3.2 What is probability? ..................................................................................................................... 167
Introduction to probability .......................................................................................................... 167
Relative frequency .......................................................................................................................... 169
Sample space ..................................................................................................................................... 170
Discrete and continuous random variables .......................................................................... 170
3.3 Continuous probability distributions.................................................................................... 171
Introduction....................................................................................................................................... 171
The normal distribution................................................................................................................ 172
Check your understanding ........................................................................................................... 178
The standard normal distribution (Z distribution) ............................................................ 178
Check your understanding ........................................................................................................... 190
Checking for normality....................................................................................................................... 190
Check your understanding ........................................................................................................... 197
Student’s t distribution ...................................................................................................................... 198
Check your understanding ........................................................................................................... 204
F distribution ......................................................................................................................................... 205
Check your understanding ........................................................................................................... 208
Chi-square distribution ...................................................................................................................... 208
Check your understanding ........................................................................................................... 211
3.4 Discrete probability distributions .......................................................................................... 211
Introduction....................................................................................................................................... 211
Check your understanding ........................................................................................................... 215
Binomial probability distribution .................................................................................................. 216
Normal approximation to the binomial distribution ......................................................... 228
Check your understanding ........................................................................................................... 228
Poisson probability distribution .................................................................................................... 229
Check your understanding ........................................................................................................... 237
Chapter summary ................................................................................................................................. 237
Test your understanding ................................................................................................................... 238
Want to learn more?............................................................................................................................ 240
Chapter 4 Sampling distributions ....................................................................................................... 241
4.1 Introduction and learning objectives .................................................................................... 241
Learning objectives ......................................................................................................................... 241
4.2 Introduction to sampling ........................................................................................................... 242
Types of sampling............................................................................................................................ 243
Types of errors ................................................................................................................................. 249

Page | 6
Check your understanding ........................................................................................................... 249
4.3 Sampling from a population ..................................................................................................... 250
Population versus sample ............................................................................................................ 250
Sampling distribution of the mean ........................................................................................... 259
Sampling from a normal population ........................................................................................ 264
Sampling from a non-normal population ............................................................................... 273
Sampling without replacement .................................................................................................. 276
Sampling distribution of the proportion ................................................................................ 282
Check your understanding ........................................................................................................... 285
Chapter summary ................................................................................................................................. 286
Test your understanding ................................................................................................................... 286
Want to learn more?............................................................................................................................ 288
Chapter 5 Point and interval estimates ............................................................................................ 289
5.1 Introduction and learning objectives .................................................................................... 289
Learning objectives ......................................................................................................................... 289
5.2 Point estimates .............................................................................................................................. 290
Point estimate of the population mean and variance ........................................................ 292
Point estimate of the population proportion and variance............................................. 303
Pooled estimates .............................................................................................................................. 308
Check you understanding ............................................................................................................. 308
5.3 Interval estimates ......................................................................................................................... 309
Interval estimate of the population mean where σ is not known and the sample is
smaller than 30 observations ..................................................................................................... 315
Interval estimate of a population proportion....................................................................... 323
Check your understanding ........................................................................................................... 325
5.4 Calculating sample sizes ............................................................................................................. 326
Check your understanding ........................................................................................................... 331
Chapter summary ................................................................................................................................. 331
Test your understanding ................................................................................................................... 332
Want to learn more?............................................................................................................................ 333
Chapter 6 Hypothesis testing ............................................................................................................... 334
6.1 Introduction and Learning Objectives .................................................................................. 334
Learning objectives ......................................................................................................................... 335
6.2 What is hypothesis testing? ...................................................................................................... 335
What are parametric and nonparametric statistical tests? ............................................. 335
Hypothesis statements H0 and H1 ............................................................................................. 337
One- and two-tailed tests ............................................................................................................. 338

Page | 7
One- and two-sample tests .......................................................................................................... 339
Independent and dependent samples/populations........................................................... 340
Sampling distributions from different population distributions .................................. 341
Sampling from a normal distribution, large sample and known  (AAA) ................. 341
Sampling from a non-normal distribution, large sample size and known  (BAA)
................................................................................................................................................................ 342
Sampling from a normal distribution, small sample size and unknown  (ABB) .. 342
Sampling from a normal distribution, large sample and unknown  (AAB)............ 344
Sampling from a normal distribution, small sample and known  (ABA) ................ 344
Sampling from a non-normal distribution, large sample and unknown  (BAB) .. 344
Sampling from a non-normal distribution, small sample and known  (BBA) ....... 344
Sampling from a non-normal distribution, small sample and unknown  (BBB) .. 344
Check your understanding ........................................................................................................... 344
6.3 Introduction to hypothesis testing procedure................................................................... 345
Steps in hypothesis testing procedure .................................................................................... 345
How do we make decisions? ....................................................................................................... 348
Types of errors and statistical power ...................................................................................... 354
Check your understanding ........................................................................................................... 356
Chapter summary ................................................................................................................................. 356
Test your understanding ................................................................................................................... 357
Want to learn more?............................................................................................................................ 357
Chapter 7 Parametric hypothesis tests............................................................................................. 358
7.1 Introduction and Learning Objectives .................................................................................. 358
Learning objectives ......................................................................................................................... 359
7.2 One-sample hypothesis tests .................................................................................................... 359
One-sample z test for the population mean .......................................................................... 359
Check your understanding ........................................................................................................... 366
One-sample t test for the population mean ........................................................................... 366
Check your understanding ........................................................................................................... 376
One-sample z test for the population proportion ............................................................... 377
Check your understanding ........................................................................................................... 382
7.3 Two-sample hypothesis tests ................................................................................................... 383
Two-sample t test for the population mean: independent samples ............................ 383
Excel Data Analysis solutions ..................................................................................................... 399
Check your understanding ........................................................................................................... 401
Two-sample t-test for the population mean: dependent or paired samples............ 402

Page | 8
Check your understanding ........................................................................................................... 413
Chapter summary ................................................................................................................................. 414
Test your understanding ................................................................................................................... 415
Want to learn more?............................................................................................................................ 416
Chapter 8 Chi square and non-parametric hypothesis tests .................................................... 417
8.1 Introduction and learning objectives .................................................................................... 417
Learning objectives ......................................................................................................................... 419
8.2 Chi-square tests ............................................................................................................................. 419
Chi-square test of independence ............................................................................................... 420
How do you solve problems when you have raw data? ................................................... 432
Check your understanding ........................................................................................................... 439
Chi-square test for two proportions (independent samples) ........................................ 441
Check your understanding ........................................................................................................... 449
McNemar’s test for the difference between two proportions (dependent samples)
................................................................................................................................................................ 450
Check your understanding ........................................................................................................... 458
8.3 Nonparametric tests .................................................................................................................... 459
Sign test ............................................................................................................................................... 460
Check your understanding ........................................................................................................... 473
Wilcoxon signed-rank test for matched pairs ...................................................................... 473
Mann–Whitney U test for two independent samples ........................................................ 488
Check your understanding ........................................................................................................... 505
Chapter summary ................................................................................................................................. 505
Test your understanding ................................................................................................................... 506
Want to learn more?............................................................................................................................ 509
Chapter 9 Linear correlation and regression analysis ............................................................... 510
9.1 Introduction and chapter overview ....................................................................................... 510
Learning objectives ......................................................................................................................... 511
9.2 Introduction to linear correlation .......................................................................................... 511
9.3 Linear correlation analysis........................................................................................................ 513
Scatter plots ....................................................................................................................................... 514
Covariance .......................................................................................................................................... 518
Pearson’s correlation coefficient, r ........................................................................................... 523
The coefficient of determination, r2 or R-Squared ............................................................. 529
Spearman’s rank correlation coefficient, rs ........................................................................... 529
Check your understanding ........................................................................................................... 532
9.4 Introduction to linear regression ........................................................................................... 533

Page | 9
9.5 Linear regression .......................................................................................................................... 533
Fit line to a scatter plot.................................................................................................................. 538
Sum of squares defined ................................................................................................................. 543
Regression assumptions ............................................................................................................... 545
Test how well the model fits the data (Goodness-of-fit) .................................................. 547
Prediction interval for an estimate of Y .................................................................................. 554
Excel data analysis regression solution .................................................................................. 559
Regression and p-value explained ............................................................................................ 562
Check your understanding ........................................................................................................... 564
Chapter summary ................................................................................................................................. 564
Test your understanding ................................................................................................................... 565
Want to learn more?............................................................................................................................ 568
Chapter 10 Introduction to time series data, long-term forecasts and seasonality ........ 569
10.1 Introduction and chapter overview .................................................................................... 569
Learning objectives ......................................................................................................................... 570
10.2 Introduction to time series analysis ................................................................................... 570
Stationary and non-stationary data sets ................................................................................ 571
Seasonal time series ....................................................................................................................... 573
Check your understanding ........................................................................................................... 574
10.3 Trend extrapolation as long-term forecasting method ............................................... 575
A trend component ......................................................................................................................... 575
Fitting a trend to a time series ................................................................................................... 577
Using a trend chart function to forecast time series .......................................................... 580
Trend parameters and calculations.......................................................................................... 585
Check your understanding ........................................................................................................... 589
10.4 Error measurements ................................................................................................................. 589
Types of error statistics ................................................................................................................ 594
Check your understanding ........................................................................................................... 598
10.5 Prediction interval ..................................................................................................................... 599
Standard errors in time series .................................................................................................... 600
Check your understanding ........................................................................................................... 613
10.6 Seasonality and Decomposition in classical time series analysis ............................ 613
Cyclical component ......................................................................................................................... 619
Seasonal component....................................................................................................................... 625
Error measurement ........................................................................................................................ 634
Prediction interval .......................................................................................................................... 636

Page | 10
Check your understanding ........................................................................................................... 640
Chapter summary ................................................................................................................................. 641
Test your understanding ................................................................................................................... 642
Want to learn more?............................................................................................................................ 643
11. Short and medium-term forecasts .............................................................................................. 644
11.1 Introduction and chapter overview .................................................................................... 644
Learning objectives ......................................................................................................................... 645
11.2 Moving averages ......................................................................................................................... 646
Simple moving averages ............................................................................................................... 646
Short-term forecasting with moving averages .................................................................... 654
Mid-range forecasting with moving averages ...................................................................... 662
Check your understanding ........................................................................................................... 667
11.3 Introduction to exponential smoothing............................................................................. 667
Forecasting with exponential smoothing............................................................................... 670
Mid-range forecasting with exponential smoothing.......................................................... 684
Check your understanding ........................................................................................................... 690
11.4 Handling errors for the moving averages or exponential smoothing forecasts 690
Prediction interval for short and mid-term forecasts ....................................................... 694
Check your understanding ........................................................................................................... 697
11.5 Handling seasonality using exponential smoothing forecasting ............................. 698
Classical decomposition combined with exponential smoothing ................................ 698
Holt-Winters’ seasonal exponential smoothing .................................................................. 702
Check your understanding ........................................................................................................... 713
Chapter summary ................................................................................................................................. 714
Test your understanding ................................................................................................................... 715
Want to learn more?............................................................................................................................ 716
Appendices .................................................................................................................................................. 717
Appendix A Microsoft Excel Functions ........................................................................................ 717
Appendix B Areas of the standardised normal curve............................................................. 724
Appendix C Percentage points of the Student’s t distribution (5% and 1%)................ 725
Appendix D Percentage points of the chi-square distribution ........................................... 726
Appendix E Percentage points of the F distribution ............................................................... 727
Upper 5%................................................................................................................................................. 727
Upper 2.5% ............................................................................................................................................. 728
Upper 1%................................................................................................................................................. 729
Appendix F Binomial critical values.............................................................................................. 730

Page | 11
Appendix G Critical values of the Wilcoxon matched-pairs signed-ranks test............. 731
Appendix H Probabilities for the Mann–Whitney U test ....................................................... 732
Mann–Whitney p-values (n2 = 3) ................................................................................................... 732
Mann–Whitney p-values (n2 = 4) ................................................................................................... 732
Mann–Whitney p-values (n2 = 5) ................................................................................................... 732
Mann–Whitney p-values (n2 = 6) .................................................................................................. 733
Mann–Whitney p-values (n2 = 7) .................................................................................................. 733
Mann–Whitney p-values (n2 = 8) .................................................................................................. 734
Appendix I Statistical glossary ........................................................................................................ 735
Book index ................................................................................................................................................... 751

Page | 12
Preface
The way we teach and learn statistics has not changed since the days before computers
and apps. This cannot be right. Surely the way we use various computing platforms and
associated apps should also have an impact on the way we teach statistics. This is
precisely what we attempted to achieve in this textbook.

When it comes to statistics, Universities all over the country will usually rely on one of
two software platforms, and they are either Microsoft Excel or IBM SPSS. Other
platforms are also used, but they only represent a minority. Furthermore, once our
students graduate, almost all of them are without a shadow of the doubt expected to be
proficient in Microsoft Excel, a platform of choice for business, government, and other
non-profit organisations.

For this reason, we decided to build this textbook around the solutions based on both
Microsoft Excel and IBM SPSS. Every problem is first explained and then solved using
Excel, followed by the SPSS solution. This is the general approach we followed
throughout this textbook.

The second point we would like to emphasise is why should one bother to learn
statistics at all. If you eliminate the technical and mathematical aspects of statistics and
think for a moment, you will realise that by learning statistics, you learn how to:

• Draw conclusions about the whole population based on limited data that you
were given.
• Assign specific level of confidence to your conclusions.
• Describe and quantify relationships between different phenomena.
• Reduce uncertainty.
• Predict the future by understanding the past.

We are sure that you will agree that these are very useful skills. If you ask students if
they would like to have these sorts of skills, they all agree that they are very desirable.
At the same time, when you try to teach them, a lot of them lose interest. Why?

This brings us back to the first point. The reason is that the way we teach our students
statistics has not changed for decades. Most of the statistics courses are structured and
taught in such a way as if all the students will become future statisticians. This could not
be further from the truth. They will be accomplished professionals in their chosen area
of expertise and statistics will be just one of many tools that they need to use to do their
jobs properly.

This is where we come to the third point of this preface. By using the tools such as Excel
and SPSS, we eliminate a need to learn the technical and mathematical details of the
methodology that underpins statistics. We put emphasis on:

• What is the problem you are trying to solve?


• Which method can be applied to provide a solution?
• How do you interpret the results that Excel, or SPSS produced using this method?

Page | 13
Why do we think that this approach is the right approach for business students? The
most fundamental aspect of running any business is the decision-making aspect. To
make a decision we can either follow chance, intuition, or rumours, or do something
completely different that usually guarantees success. We can:

• Gain a better understanding of the issue by analysing it.


• Draw the relevant conclusions.
• Decide on how to implement the actions.

This is the essence of any good decision-making process and the foundation for business
management. What we are advocating is that, to make good decisions and manage
business properly, we must first gain insight into a problem. How do we gain insight and
what exactly is insight?

Insight is an ability to understand the inner nature of things and clearly discern cause
and effect in a specific context. You could gain insight by talking to either experts or
people who have done something repeatedly, or by being there for a long time.
However, most of the time, you do not have the luxury of access to such people/experts,
or they simply do not exist in a field. The only alternative is to apply certain
methodology which will effectively convert you into such an expert.

So, the fundamental question here is: what sort of methodology are we talking about?

What kind of ‘science’ can give us the insight into the problems we have never dealt
with before and make us into experts overnight? It sounds like a science fiction movie in
which the main character takes a pill that enables a much larger percentage of his brain
to engage and suddenly gains tremendous insight into virtually any problem. What we
are advocating here is not science fiction, but a simple matter of understanding
statistics. The methodology that will give you an insight into virtually any problem is the
most fundamental toolset from statistics.

Moreover, the methodology that statistics teaches us is universal. It is designed so that,


no matter what the problem, no matter what the context, we will always have a solid
foundation to make the right decision and defend it without bias or arbitrary spin. The
decision becomes easy to defend and can be scrutinised by anyone who wants to
challenge it. Ultimately, they will come to the same conclusion.

Being proficient in statistics will give you a competitive advantage and help you with
employability (i.e. easier to get a job) and transferability (i.e. an ability to move from
one field or industry to another). The important point is that being proficient in
statistics does not imply that you must be an expert in statistics. It is no different to
speaking a foreign language. If you are in a foreign country, it helps if you speak the
language. You do not have to be a linguist and understand the nuances of the
grammatical syntax of the language you are speaking. It suffices to be fluent in it. The
same applies to statistics. You do not have to know how certain equations were derived.
You just need to know how to apply them and how to interpret the results. This alone
will give you a distinct advantage and a head start over those of your colleagues who
have not bothered to equip themselves with this statistical methodology.

Page | 14
Regardless of what your core skills and profession are, a proficiency in statistical
methodology makes you a more insightful individual who will find employment more
easily, stay in the job with more respect and satisfaction, and be able to take advantage
of what other subject fields offer. Not a bad proposition. Therefore, we believe that it is
a good personal investment to learn the skill set found in statistics.

Our ambition was to create a textbook that will be practical in nature and focus on
applications, rather than theory. To make it very useful, we used two core software
tools (Excel and SPSS) that are invariably used to teach statistics. Excel is very
transferable as it is a widely used tool throughout a range of organisations, including
business, public sector, and non-profit organisations. In addition to this, we offer an
incredible wealth of online resources to help those who either struggle with the content
or are keen to learn more. Most of all, as you go through this textbook and learn one
chapter after another, you should feel good about yourself. You have just gained a small
competitive advantage that might make your career prospects brighter than you ever
imagined.

To support this textbook, we created a supporting website containing numerous


resources, depending on whether you are a student or a lecturer. The website is:

https://stats-bus.co.uk/

Once you landed on the website, depending on if you are a lecturer or a student, there
will be two possible options for you.

Lecturers will have a password protected part of the website that contains:

• All data files with solutions.


• PowerPoint slides for all the lectures.
• Instructor’s manual.
• Multiple choice questions.
• Exam questions.
• A pdf version of the textbook.

Students and any other reader will have a free access to the side of the website that
contains:

• Student files (both Excel and SPSS)


• Want to learn more folder with numerous files expanding on current content
• Learning resources, such as:
o Revision tips.
o Multiple choice questions.
• Other resources, such as:
o Introduction to Excel.
o Introduction to SPSS.
o Develop your mathematical skills.
o Factorial experiment workbook.
• Video files, i.e. YouTube style videos, each up to 5 min, explaining key points.
Page | 15
Lecturers will have to register to gain access to the protected side of the website, whilst
students and the general reader do not have to register. However, you can register to
receive updates, corrections and any other improvements or additions to the textbook
and the website.

To contact the two authors of this textbook, just drop an email to:

info@stats-bus.co.uk

We will be glad to assist.

What features do we have?


Throughout the chapters various features are included, such as learning objectives,
examples, and solutions for Excel and SPSS. Every feature is clearly marked with a
symbol, or an icon, and they have the meanings shown in Table 0.1.

Learning Example Excel Solution SPSS Solution Check your


objectives understanding
for every
section

Chapter Test your Want to learn Glossary Index


summary understanding more

Table 0.1 Book icons

Greek alphabet letters used within this textbook


A list of the Greek letters used in this book is provided in Table 0.2.

Name Lowercase Name Lowercase


letter letter
Alpha  Mu 
Beta  Pi 
Chi  Rho 
Lambda  Sigma 
Table 0.2 Greek letters

Page | 16
Acknowledgements
We are grateful to everyone who granted permission to reproduce copyrighted material
in this textbook.

Every effort has been made to trace the copyright holders and we apologise for any
unintentional omissions.

We would be pleased to include the appropriate acknowledgements in any subsequent


edition of this publication or at reprint stage.

The authors of this textbook would like to thank the following for their agreement to
use software screenshot images and data files:

• Microsoft Excel: Microsoft Excel screenshots, Excel function definitions, and


generation of critical tables used with permission from Microsoft®.
• IBM SPSS Statistics software (‘SPSS’): IBM® SPSS Statistics screenshots of
solutions used throughout all chapters. Reprint courtesy of International
Business Machines Corporation, © International Business Machines Corporation.
SPSS Inc. was acquired by IBM in October 2009.
• Example 8.2 data set: reproduced with permission from Dr Anne Johnston.

We owe a debt of gratitude to our colleague from down under, Anthony Hyden. Our
Australian friend, though not a statistician but a thoroughbred engineer, volunteered to
read the manuscript in its early stages and made numerous suggestions to improve the
readability of the text.

Glyn Davis, Associate Professor, Teesside University Business School, Teesside


University

Dr Branko Pecar, Visiting Fellow, University of Gloucestershire

Page | 17
Chapter 1 Data visualisation
1.1 Introduction and learning objectives
In this chapter we shall look at methods to summarise data using tables and charts.

Learning Objectives

On completing this chapter, you will be able to:

1. Understand the different types of data variables that can be used to represent a
specific measurement.
2. Know how to present data in table form.
3. Present data in a variety of graphical forms.
4. Construct frequency distributions from raw data.
5. Distinguish between discrete and continuous data.
6. Construct histograms for equal class widths.
7. Solve problems using Microsoft Excel and IBM SPSS Statistics.

The display of various types of data or information in the form of tables, graphs and
diagrams is quite a common practice these days. Newspapers, magazines, television,
and social media all use these types of displays to try and convey information in an
easy-to-assimilate way. In a nutshell, what these forms of displays aim to do is to
summarise large sets of raw data so that we can see the 'behaviour' or the pattern in the
data.

This chapter and the next will use a variety of techniques that can be used to present the
raw data in a form that will make sense to people using both Microsoft Excel and IBM
SPSS Statistics software packages.

Tables and graphs can be useful tools for helping people to make decisions. As well as
being able to identify clearly what the graph or table is telling us, it is important to
identify what parts of the story are missing. This can help the reader decide what other
information they need, or whether the argument should be rejected because the
supporting evidence is suspect. You will need to know how to critique the data and the
way they are presented. It is worth remembering that a table or graph can misrepresent
information by either leaving out important information or by constructing it in such a
way that it misrepresents relationships. If we have a choice, should we use a table or a
graph? We can use both, but a general guide is that:

1. Tables are generally best if you want to be able to look up specific information or
if the values must be reported precisely.
2. Graphs are best for illustrating trends, ratios and making comparisons.

Figures 1.1 and 1.2 provide examples of a graph and table published within an economic
report by the Office for National Statistics
(https://www.ons.gov.uk/businessindustryandtrade/business/activitysizeandlocation
/bulletins/ukbusinessactivitysizeandlocation/2018).

Page | 18
Figure 1.1 Number of value added tax and/or pay as you earn based businesses,
2013–2018 (Source: Office for National Statistics)

Figure 1.2 Number of value added tax and/or pay as you earn based businesses
by region (thousands), 2016–2018 (Source: Office for National Statistics)

London accounted for the largest number of businesses in March 2018, with 19% of the
UK total. The region with the next largest share of businesses was the South East, with
15.2%.

1.2 What is a variable?


We collect data from either a published material (books, statistical bulletins, etc.) or by
conducting some form of survey. Regardless what is the source of our data, we are
usually interested in one specific measured characteristic. A good example is the height
of 1000 subjects whose height was measured. This measured characteristic, or

Page | 19
attribute, that differs for different subjects is called a variable. A variable is a symbolic
name that has a value associated with it. If we had 1000 subjects as in our example, then
there will be 1000 values associated with this single variable. You can also think of a
variable as a symbol that has a series of datapoints associated with it.

These datapoints are also called observations. So, 1000 values associated with a
variable called the height, also represent 1000 observations. However, these
observations (or datapoints) do not have to be numbers. This means that we can have
different variables. Variables are usually divided into quantitative (continuous or
discrete variables) and qualitative (nominal or ordinal variables), as indicated in Figure
1.3.

Figure 1.3 Quantitative versus qualitative variables

Quantitative variables always consist of numerical data, for example, the average height
of a person or the time required to finish the 800 m race by an athlete. In later chapters,
we will use statistical techniques that will be dependent upon whether the data
measured are quantitative or qualitative. Qualitative (non-metric) variables describe
some quality of the item being measured without measuring it. For example, when
describing the colour of the sky or the finish position of an athlete in an 800 m race.

Let us look at an example. If a group of business students were asked to name their
favourite video game, then the variable would be qualitative. If the time spent playing a
game was measured, then the variable would be quantitative. Both quantitative and
qualitative variables can have data measured at different scales, as shown in Figure 1.3.
Let us explore different scales.

Quantitative variables: interval and ratio scales

If one unit on the scale represents the same magnitude of the characteristic being
measured across the whole range of the scale, then we call this an interval measurement
scale. For example, we can attempt to measure student stress levels on an interval scale.
In this case, a difference between a score of 5 and a score of 6 would represent the same
difference in anxiety as would a difference between a score of 9 and a score of 10.
However, interval scales do not have a ‘true’ zero point; and therefore, it is not possible
to make statements about how many times higher one score is than another. For the
stress measurement, it would not be valid to say that a person with a score of 6 was
twice as anxious as a person with a score of 3.

Page | 20
Ratio scales, on the other hand, are very similar to interval scales, except that they have
true zero points. For example, a height of 2 m is twice as much as 1 m. Interval and ratio
measurements are also called continuous variables. Table 1.1 summarises the different
measurement scales with examples provided of these different scales.

Qualitative variables: categorical data (nominal and ordinal scales)

Qualitative variables do not use numbers in mathematical sense, but they use labels to
put data into categories.

If the scale used happens to be nominal, this means that the variable consists of groups
or categories to which every observation is assigned . No quantitative information is
conveyed, and no ordering of the observations is implied. Football club allegiance, sex
or gender, degree type, and courses studies are all examples of nominal scales. To show
how often every value occurs on a point on the scale, we use frequency distributions.

If the categories of the data can be placed in order of size, then the data are classed as
ordinal. Measurements with ordinal scales are ordered in the sense that higher numbers
represent higher values. However, the intervals between the numbers are not
necessarily equal. For example, on a five-point rating scale measuring student
satisfaction, the difference between a rating of 1 (‘very poor’) and a rating of 2 (‘poor’)
may not be the same as the difference between a rating of 4 (‘good’) and a rating of 5
(‘very good’). The lowest point on the rating scale in the example was arbitrarily chosen
to be 1, and this scale does not have a ‘true’ zero point

Measurement Scale Recognising a measurement scale


Interval data 1. Ordered, constant scale, with no natural zero e.g.
temperature, dates.
2. Differences make sense, but ratios do not e.g.
temperature difference
Ratio data 1. Ordered, constant scale, and a natural zero e.g. length,
height, weight, and age.
Nominal data 1. Classification data e.g. male or female, red or blue hat.
2. Arbitrary labels e.g. m (male) or f (female), 0 or 1.
3. No ordering e.g. it makes no sense to state that m > f.
Ordinal data 1. Ordered list e.g. customer satisfaction scale of 1, 2, 3, 4,
and 5.
2. Differences between values are not important e.g.
political parties can be given labels far left, left, mid,
right, far right etc.
Table 1.1 Examples of measurement scales

A scale also implies that data it measures can be either continuous or discrete. Let us
see what we mean by this.

Discrete and continuous data types

Data that a variable consists of, can exist in two forms: discrete and continuous. Discrete
data occur as an integer (whole number), for example, 1, 2, 3, 4, 5, 6, etc. Continuous
Page | 21
data occur as a continuous number and can take any level of accuracy, for example, the
number of miles travelled could be 110.5 or 110.52 or 110.524, etc. Note that whether
data are discrete or continuous does not depend upon how they are collected, but how
they occur. Thus height, distance and age are all examples of continuous data, although
they may be presented as whole numbers. Most of the time, the scale of the data (both
discrete and continuous) is very wide, and we want to put the data into groups, or
classes. This means that every class needs a boundary.

Class limits are the extreme boundaries. When we start creating frequency distributions
(further below), the class limits are called the stated limits. Two common types are
illustrated in Table 1.2.

A B
5 - under 10 5 - 9
10 - under 15 10 - 14
Table 1.2 Examples of class limits

We must make sure that there are no gaps between classes. To help place data in their
appropriate class, we use what are known as true limits, or mathematical limits. True or
mathematical limits are determined depending if we are dealing with continuous or
discrete data. Table 1.3 indicates how these limits may be defined.

MATHEMATICAL LIMIT
STATED LIMIT DISCRETE CONTINUOUS
A 5 - under 10 5–9 5 - 9.999999'
10 - under 15 10 – 14 10 - 14.999999'
B 5 - 9 5–9 4.5 - 9.5
10 - 15 10 - 14 9.5 - 15.5
Table 1.3 Example of mathematical limits

Why are true limits so important? If the data are continuous and stated limits are as in
style A, then a value of 9.9 would be placed in the class “5–under 10”. If style B were
used then it would be placed in the class “10–under 15” , given a value of 9.9 lies closer
to 10 than 9. This means that the class width is very important. Using the true or
mathematical limits, the width of a class can be found. If CW is the class width, UCB the
upper-class boundary, and LCB the lower-class boundary, then the class width is
calculated using equation (1.1).

CW = UCB – LCB (1.1)

If, for example, the true limits are 0.5–1.5, 1.5–2.5, etc., then the class width is 1.5 – 0.5 =
1 or 2.5 – 1.5 = 1. Or, if the true limits are 395.5–419.5, 419.5–439.5, then the class
width is 419.5 – 395.5 = 20 or 439.5 – 419.5 = 20.

When we come at the end of a distribution, open-ended classes can be used as a catch-
all for extreme values. For example: Up to 40, 40–50, 50-60, …, 100 and over. To decide
what number of classes to use is subjective and there are no strict rules about it.
However, the following should be taken into consideration:

Page | 22
a. Use between 5 and 12 classes. The actual number will depend on the size of the
sample and minimising the loss of information.
b. Class widths are easier to handle if in multiples of 2, 5 or 10 units.
c. Although not always possible, try and keep classes at the same widths within a
distribution.

As a guide equation (1.2) can be used to calculate the number of classes given the class
boundaries and the class width.

Highest value-Lowest value


Class width= (1.2)
Number of classes

For example, if we had highest value 309.5, lowest value 189.5, and say we want 6
classes of equal size, then from equation (1.2), the class width is:

Highest value-Lowest value 309.5−189.5 120


Class width= = = = 20
Number of classes 6 6

With 6 classes and given the highest and lowest value, each class should be 20 units
wide.

1.3 Tables
Table is the most basic arrangement of data into rows and columns. Apart from taking
up less room, a table enables figures to be located more quickly. It makes comparisons
between different classes easy and may reveal patterns which cannot otherwise be
deduced. The simplest form of table indicates the frequency of occurrence of objects
within several defined categories.

Simple tables

Tables come in a variety of formats, from simple tables to frequency distributions. When
creating a table, the following principles should be followed:

a. When a secondary data source is used it is acknowledged.


b. The title of the table is given.
c. The total of the frequencies is given.
d. When percentages are used for frequencies this is indicated together with the
sample size, n.

Example 1.1

355 undergraduate business students were asked the question ‘how many still plan to
stay with us and study one of our master’s degrees’, the results were as follows:

• 120 MSc International Management,


• 24 MSc Business Analysis,
• 80 MA Human Resource Management,
• 45 MSc Project Management,

Page | 23
• 86 MSc Digital Business.

We can put this information in table form, indicating the frequency within each category
either as a raw score or as a percentage of the total number of responses.

Number of undergraduate students entering differing master’s courses


(Source: School survey August 2020)
Frequency
Course Frequency Course
%
MSc International MSc International
120 34%
Management Management

MSc Business Analysis 24 MSc Business Analysis 7%

MA Human Resource or MA Human Resource


80 23%
Management Management

MSc Project Management 45 MSc Project Management 13%

MSc Digital Business 86 MSc Digital Business 24%

Total 355 Total 100%


Table 1.4 Type of master’s course chosen to study

Sometimes categories can be subdivided, and tables can be constructed to convey this
information together with the frequency of occurrence within the subcategories. For
example, Table 1.5 indicates the frequency of the number of hard disks sold in a shop or
online, with the sales split by month.

Example 1.2

Table 1.5 illustrates further subdivisions of categories.

Quick Computers Ltd Number of hard disks sold between shop and online
Month January March May July September November Total
Shop 23 56 123 158 134 182 11750
Online 64 145 423 400 350 409 7533
Total 87 201 546 558 484 591 19283
Table 1.5 Number of hard disks sold in shop vs. online

Another example of how categories may also be displayed is given in Table 1.6, showing
the time spent online for a sample of 560 adults.

Example 1.3

Table 1.6 shows the tabulated results from a survey undertaken to measure the time
spent online.

Page | 24
Young person (up 18 - 24 Over 24
Totals
to 18 years old) years old years old
Less than 15 hours per week 45 50 18 113
15 - 30 hours per week 62 82 54 198
More than 30 hours per week 81 102 66 249
Totals 188 234 138 560
Table 1.6 Time spend online

Frequency distribution

We already mentioned frequency distributions, so let’s learn more about this important
type of tables.

When data are collected by survey, or by some other form, we have initially a set of
unorganised raw data which, when viewed, would convey little information. A first step
would be to organise the set into a frequency distribution. By doing so, we create a table
showing the similar quantities that are collected and the frequency of occurrence of the
quantities for every group of quantities.

Example 1.4

Consider the number of telephone calls per day received by a mobile phone shop
enquiring about the new iPhone SE during over a period of 92 days in the summer of
2020.

1 5 2 5 3
4 1 3 4 5
3 4 1 3 4
5 3 5 5 4
4 5 4 2 3
1 4 3 5 2
4 3 5 4 5
3 5 4 3 4
2 1 5 5 5
4 5 3 5 5
5 3 2 3 3
3 4 5 5 2
4 2 4 5
1 3 5 4
4 4 3 3
3 1 5 5
5 5 4 5
2 3 5 5
4 5 1 2
3 4 3 4
Table 1.7 Number of calls received over a period of 92 days

Page | 25
The frequency distribution can be used to show how many days we had 0 calls per day,
1 call per day, 2 calls per day, and so on. You will notice that if you undertake the
calculation then you will find that the minimum and maximum value of the number of
calls per day is 1 and 5, respectively. Therefore, we wish to create a tally chart and then
a frequency distribution for the value of X.

Write down the range of values from lowest (1) to the highest (5) then go through the
data set recording each score in the table with a tally mark. It is a good idea to cross out
figures in the data set as you go through it to prevent double counting. Table 1.8
illustrates the frequency distribution for this data set.

Number of
telephone calls
Day Tally per day (f)
1   8
2   9
3      22
4      23
5       30
Table 1.8 Frequency distribution

Excel solution

Figure 1.4 illustrates the first 16 values in cells C4:C95 (first 20 observations
illustrated).

You will observe from the Excel solution that we calculate from the data set the
minimum and maximum values:

Minimum value of x = 1

Maximum value of x = 5

This enables the frequencies to be calculated and therefore the frequency distribution
identified for this data set (C4:C95) using the =COUNTIF() Excel function.

Page | 26
Figure 1.4 Excel solution

Observe that the Excel frequency distribution is the same as given in Table 1.7.

SPSS solution

Input data into SPSS

Figure 1.5 illustrates the first 20 values.

Figure 1.5 Example 1.4 SPSS data

We can create the frequency distribution using the SPSS Frequencies method
illustrated below.

Select Analyze > Descriptive Statistics > Frequencies.

Page | 27
Transfer Number_calls_day into the Variables(s) box.

Figure 1.6 SPSS frequencies menu

Click OK

SPSS output

Figure 1.7 SPSS solution

In this example, there were relatively few cases. However, we may have increased our
survey period to one year and the number of calls per day may have been between 0
and 100+.

Since our aim is to summarise information, we may find it better to group ‘likes’ into
classes to form a grouped frequency distribution. The next example illustrates this
point.

Example 1.5

Table 1.9 illustrates the distance travelled in miles by 100 FG delivery vans per day.

Page | 28
Data
499 526 456 545 426 561 501 495 579
501 489 528 505 522 566 497 523 526
553 451 509 507 582 574 466 514 554
455 457 537 493 528 459 482 472 472
452 481 440 474 483 582 477 594 496
506 460 556 486 484 535 528 517 500
533 482 400 526 555 534 450 552 469
526 470 588 485 434 469 499 471 519
492 484 541 496 500 461 529 488 545
514 501 492 480 534 534 476 513 496
Table 1.9 Distance travelled by delivery vans per day

This mass of data conveys little in terms of information. Because there would be too
many value scores, putting the data into an ungrouped frequency distribution would not
portray an adequate summary. Grouping the data, however, leads to Table 1.10.

Class Tally Frequency


400 - 449 |||| 4
450 - 499 |||| |||| |||| |||| |||| |||| |||| |||| |||| 44
500 - 549 |||| |||| |||| |||| |||| |||| |||| | 36
550- 599 |||| |||| |||| 15
600 - 649 | 1
Table 1.10 Grouped frequency distribution data for Example 1.5 data

For this example, we have created five classes with each class interval of the same value.

The lower- and upper-class widths are:

• 399.5
• 449.5
• 499.5
• 549.5
• 599.5
• 649.5

With corresponding class widths of 50:

• 399.5 – 449.5 = 50
• 449.5 – 499.5 = 50
• 499.5 – 549.5 = 50
• 549.5 – 599.5 = 50
• 599.5 – 649.5 = 50

Page | 29
Excel solution

Step 1 Input data into cells C6:C105 (first 20 data values illustrated)

Figure 1.8 Example 1.5 Excel dataset

Step 2 Use Excel Analysis ToolPak solution to create a frequency distribution.

Excel can construct grouped frequency distributions from raw data by using Analysis
ToolPak. Before we use this add-in, we must input the lower- and upper-class
boundaries into Excel.

Excel calls this the bin range.

In this example we have decided to create a bin range that is based upon equal class
widths. Therefore, the Excel bin range values will be:

• 399.5
• 449.5
• 499.5
• 549.5
• 599.5
• 649.5

We can now use Excel to calculate the grouped frequency distribution.

We put the bin range values in cells E6:E11 (with the label in cell E5) as illustrated in
Figure 1.9.

Page | 30
Figure 1.9 Excel bin range values

Now create the histogram.

Select Data.
Select Data Analysis menu.

Figure 1.10 Excel Data > Data Analysis menu

Click on Histogram.

Figure 1.11 Excel Data Analysis menu

Click OK.

Enter Input Range: C5:C105.


Enter Bin Range: E5:E11.
Click on Labels
Choose location of Output Range: G5.

Page | 31
Figure 1.12 Excel histogram

Click OK.

Excel will now print out the grouped frequency table (bin range and frequency of
occurrence) as presented in cells G5:H11.

Figure 1.13 Excel solution

The grouped frequency distribution would now be as illustrated in Table 1.11.

Bin Range Frequency


399.5 0
449.5 4
499.5 44
549.5 36
599.5 15
649.5 1
More 0
Table 1.11 Bin and frequency values

Page | 32
From this table we can now create the grouped frequency distribution as illustrated in
Table 1.12.

Class Frequency
400 - 449 4
450 - 499 44
500 - 549 36
550- 599 15
600 - 649 1
Table 1.12 Grouped frequency distribution

SPSS solution

Input data into SPSS (the first 20 values are illustrated in Figure 1.14).

We have called the SPSS column Distance_travelled.

Figure 1.14 Example 1.5 SPSS data

Transform data into a grouped frequency distribution using Excel Visual Binning

Select Transform > Visual Binning.

Transfer distance travelled into the Variables to Bin box (Figure 1.15).

Page | 33
Figure 1.15 SPSS Visual Binning menu

Click Continue: this gives you the Visual Binning dialog.

Type Distance_travelled_cat into the Binned Variable box.


Type Distance_travelled (Binned) into the Label box.

Figure 1.16 SPSS Visual Binning continued

Click the Excluded button.

Page | 34
Figure 1.17 SPSS Visual Binning Excluding end points

Click on Make Cutpoints button

Enter First cutpoint Location: 400


Enter Width = 50

Note that when you click out of the Width box that the Number of Cutpoints is
calculated by SPSS and enters 5 in the box as illustrated in Figure 1.18.

Figure 1.18 Make Cutpoints

Click Apply

You will find that Figure 1.16 will now change such that the values are provided as
illustrated in Figure 1.19.

Page | 35
Figure 1.19 SPSS class Cutpoints

Now generate the Labels

Click on the Make Labels button

Figure 1.20 Class values and labels

Click OK

SPSS warns you that the automatically generated labels are no longer correct.

Figure 1.21 SPSS warning – creating a new variable in SPSS data file

Once you are finally satisfied with the binning, click OK.

This will create a new variable within your SPSS data file (first 20 values out of

Page | 36
Figure 1.22 SPSS data file with binned variable

Now, create the grouped frequency table

Select Analyze > Descriptive Statistics > Frequencies

Transfer Distance_travelled_.. into the Variable(s) box.

Figure 1.23 SPSS Frequencies menu

We can ask SPSS to create the histogram if required for this grouped frequency
distribution.

Click on Charts
Choose Histograms

Please note we will discuss the histogram when introducing this later in this
chapter.
Page | 37
Click OK

SPSS output

Figure 1.24 SPSS solution

Observe that this result is the same as both the Excel and manual solutions.

Creating a crosstab table using Excel PivotTable

A cross tabulation (or a crosstab) is a table that shows relationships between variables.
Excel calls it a pivot table. A pivot table can organise and summarise large amounts of
data. If you collect raw data and store the data in Excel, then it is a convention that every
column will represent a variable and every row will represent an observation. You
might like to see and present how different variables are related to each other and how
are different observations distributed across these variables. To do this, you need to
create a pivot table. A pivot table can also filter the data to display just the details for
areas of interest.

Once you have a pivot table, you can present the content in form of a chart. Details on
creating a pivot chart are set out later in this section. The source data can be:

1. An Excel worksheet, a database/list, or any range that has labelled columns.


2. A collection of ranges must be consolidated. The ranges must contain both labelled
rows and columns.
3. A database file created in an external application such as Access.

Page | 38
Note that the data in a pivot table cannot be changed as they are the summary of other
data. However, the data set itself (raw data from the spreadsheet) can be changed and
the pivot table recalculated thereafter. Formatting changes (bold, number formats, etc.)
can be made directly to the pivot table data. The general rule is that you need more than
two criteria of data to work with, otherwise you have nothing to pivot. Figure 1.25
depicts a typical pivot table where we have tabulated department with the product
required per trimester. Notice the black down-pointing arrows in the pivot table. In
Figure 1.25, the rows represent the department and the columns represent the product.

Figure 1.25 Example of an Excel pivot table

We could click on a department and view the number of a product type per trimester.
But Excel does most of the work for you and puts in those drop-down boxes as part of
the wizard. In the example, we can see that the total number of toners required across
all departments for all trimesters was 134.

Example 1.6

This example consists of a set of data that has been collected to measure the
departmental product requirements (paper, toner) across the 3 trimesters.

Figure 1.26 Example 1.6 Excel dataset

Excel solution

Select Insert > PivotTable.

Page | 39
The PivotTable wizard will walk you through the process of creating an initial
PivotTable.

Select PivotTable as illustrated in Figure 1.27.

Figure 1.27 Selecting Excel PivotTable

Input in the Create PivotTable menu the cell range for the data table and where you
want the PivotTable to appear.

Select a table: D4:G20.

Choose to insert the PivotTable in Existing Worksheet in cell D25.

Figure 1.28 illustrates the Create PivotTable menu.

Figure 1.28 Create Excel PivotTable menu

Click OK

Excel creates a blank PivotTable and the user must then drag and drop the various fields
from the items. The resulting report is displayed ‘on the fly’ as illustrated in Figure 1.29.

Page | 40
Figure 1.29 Excel dataset with blank PivotTable

The PivotTable (cells D25:F42) will be populated with data from the data table in cells
B3:E32 with the completion of the PivotTable Fields list which is located at the right-
hand side of the worksheet. For example, choose to select Department, Product, and
Number budget as illustrated in Figure 1.30.

Figure 1.30 Excel PivotTable fields

Page | 41
1. From the Field List, drag the fields with the data you want to display in rows to
the area on the PivotTable diagram labelled Drop Row Fields Here or into the
Row Labels box.
2. Drag the fields with the data you want to display in columns to the area labelled
Drop Column Fields Here or into the Column Labels box.
3. Drag the fields that contain the data you want to summarize to the area labelled
Drop Value Fields Here or into the Values box. Excel assumes Sum as the
calculation method for numeric fields and Count for non-numeric fields.
4. If you drag more than one data field into rows or into columns, you can re-order
them by clicking and dragging the columns on the PivotTable itself or in the
boxes.
5. To rearrange the fields at any time, simply drag them from one area to another.
6. To remove a field, drag it out of the PivotTable report or untick it in the Field
List. Fields that you remove remain available in the field list.

Figure 1.31 Select PivotTable variable for table

The completed PivotTable for this problem is shown in Figure 1.32.

Figure 1.32 Excel PivotTable

SPSS solution

Input data into SPSS

Page | 42
Figure 1.33 SPSS data

Select Analyze > Tables > Custom Tables

Figure 1.34 SPSS Custom Table

Page | 43
Figure 1.35 SPSS Custom tables warning

Click OK.

Transfer Department from Variables to Rows box


Transfer Product from Variables to Columns box
Transfer Number from Variables to table cells labelled ‘nnnn’

Figure 1.36 Create table in SPSS Custom Tables

Change Mean to Sum

Page | 44
Figure 1.37 Change summary statistics data type

Click Apply to Selection button

Figure 1.38 Summary statistic now changed to Sum from Mean

Click OK

SPSS output

Page | 45
Figure 1.39 SPSS table solution

This solution agrees with the Excel solution in Figure 1.32.

Summarising the principles of table construction

From the above, we can conclude that when constructing tables good principles to be
adopted are as follows:

a. Aim at simplicity.
b. The table must have a comprehensive and explanatory title.
c. The source should be stated.
d. Units must be clearly stated.
e. The headings to columns and rows should be unambiguous.
f. Double counting should be avoided.
g. Totals should be shown where appropriate.
h. Percentages and ratios should be computed and shown where appropriate.
i. Overall, use your imagination and common sense.

Check your understanding

X1.1 Criticise Table 1.13.

Castings Weight of metal Foundry


Up to 4 ton 60 210
Up to 10 ton 100 640
All other weights 110 800
Other 20 85
Total 290 2000
Table 1.13 Weight of metal

X1.2 Table 1.14 shows the number of customers visited by a salesman over an 80-
week period. Use Excel to construct a grouped frequency distribution from the
data set and indicate both stated and mathematical limits (start at 50–54 with
class width of 5).

Page | 46
68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77
Table 1.14 Number of customers

1.4 Graphs - visualising of data


Once the data have been tabulated, we might like to graph the data using a variety of
visual display methods to provide a suitable graph or chart.

In this section we will explore bar charts, pie charts, histograms, frequency polygons,
scatter plots, and time series plots. The type of graph we will use to visualise the data
depends upon the type of variable we are dealing with within our data set. When
describing scales, we classified data as interval, ratio, nominal and ordinal. Different
types of data will use different graphs to visualise, as per Table 1.15.

Data type Which graph to use?


Interval and • Histogram and frequency polygon.
ratio • Cumulative frequency curve (or ogive).
• Scatter plots.
• Time series plots.
Nominal or • Bar chart.
category • Pie chart.
• Cross tab tables (or contingency tables).
Ordinal • Bar chart.
• Pie chart.
• Scatter plots.
Table 1.15 Deciding which graph type given data type

‘Graph’ and ‘chart’ are terms that are used here interchangeably to refer to any form of
graphical display.

Bar charts

Bar charts are very useful in providing a simple pictorial representation of several sets
of data on one graph. They are used for categorical data where each category is
represented by each vertical (or horizontal) bar. In bar charts each category is
represented by a bar with the frequency represented by the height of the bar. All bars
should have equal width, and the distance between each bar is kept constant. It is
important that the axes (X and Y) are labelled and the chart has an appropriate title.
What each bar represents should be clearly stated within the chart.

Page | 47
Example 1.7

Figure 1.40 shows a component bar chart for the number of bus tickets sold for route A
and route B.

Figure 1.40 Number of bus tickets sold

Example 1.8

Consider the categorical data in Example 1.1 which represents the number of
undergraduate students choosing to enter master’s courses.

Excel can be used to create a bar chart to represent this data set. For each category, a
vertical bar is drawn, with the vertical height representing the number of students in
that category (or frequency) and the horizontal distance for each bar and distance
between each bar equal. Each bar represents the number of students who would choose
a course.

From the bar chart, you can easily detect the differences of frequency between the five
categories:

1. MSc International Management


2. MSc Business Analysis
3. MA Human Resource Management
4. MSc Project Management
5. MSc Digital Business

Figure 1.41 shows the bar chart for the proposed postgraduate taught course chosen by
the undergraduate students.

Page | 48
Figure 1.41 Proposed postgraduate taught course

Example 1.9

If you are interested in comparing totals, then a component bar chart (or stacked chart)
is constructed. Figure 1.42 shows a component bar chart for the sales of hard disks. In
this component bar chart, you can see the variation in total sales from month to month,
and the split between the shop sales and online sales per month.

Figure 1.42 Excel component bar chart

Page | 49
Creating a bar chart using Excel and SPSS

Excel and IBM SPSS Statistics enable you to easily tabulate and graph your raw data. To
illustrate the application of Excel and SPSS we shall use the following data example that
provides data on 400 employees working at a local engineering firm.

Example 1.10

Figures 1.43 and 1.44 represent the Excel and SPSS data views for the first 10 records
out of a total of 400 records (complete data sets are available via the book website). The
variables identified in the screenshot are:

1. ID – employee ID from 1 to 400


2. Gender – male or female
3. Employment category – job category (1 = trainee, 2 = junior manager, 3 =
manager)
4. Current salary (£)
5. Starting salary (£).

Figure 1.43 Example 1.10 Excel data view

Figure 1.44 Example 1.10 SPSS data view

Use Excel and SPSS to create a bar chart to represent employment category.

Page | 50
Excel solution

Step 1 Input data series

ID: cells B5:B404


Gender: C5:C404
Employment category: D5:D404
Current salary: E5:E404
Starting salary: F5:F404.

Figure 1.45 Example 1.10 Excel data set

Step 2 Create the frequency distribution

Figure 1.46 Creating Excel frequency distribution

Step 3 Create the bar chart

Highlight H10:I13 (includes labels in cells H10 and I11).

Page | 51
Figure 1.47 Highlight bar chart cell range

Step 4 Select Insert > Insert Column or Bar Chart,

Figure 1.48 Excel bar chart options

Choose first option

This will result in the graph illustrated in Figure 1.49

Figure 1.49 Excel bar chart

Step 4 Edit the chart

Page | 52
We can now edit the bar chart to remove the frequency (f) from the x-axis and
add titles.

Right-click on bar chart in Excel and choose Select Data.

Click on variable labelled ‘x’

Figure 1.50 Select Data Source and remove variable ‘x’

Choose Remove ‘x’ (Figure 1.51).

Figure 1.51 Variable ‘x’ removed

Click OK

Page | 53
Figure 1.52 Excel bar chart

Now, add bar chart and vertical axis titles.

Select in the far-left menu Add Chart Element.

Figure 1.53 Add Chart Element

Now click on the drop-down menu (Figure 1.54) and choose the option you would like
to modify.

Figure 1.54 Add Chart Element options

For this example, we have added a chart title, axis titles, and placed the frequency values
in each bar as illustrated in Figure 1.55.

Page | 54
Figure 1.55 Final Excel bar chart

Changing the column colours

To change the bar colours, select each bar in turn, right-click and select Format Data
Point > Solid Fill > choose colour. Repeat this for the other bars. When each bar has a
unique colour the chart legend will list each of the bar titles with their respective
colours as illustrated in Figure 1.56.

Figure 1.56 Modified Excel bar chart

SPSS solution

Figure 1.57 shows the SPSS data view of the first 10 records out of a total of 400 records
taken from the SPSS data file.

Page | 55
Figure 1.57 Example 1.10 SPSS data

The variables identified in the screenshot represent:

1. ID – employee ID from 1 to 400


2. Gender – male or female
3. EmpCategory – job category (1 = trainee, 2 = junior manager, 3 = manager)
4. CurrentSalary – current salary (£)
5. StartSalary – starting salary (£).

When SPSS carries out an analysis (calculation of statistics or creating graphs) it will
create a separate file, called an SPSS output file, to store the outcome of the analysis.
Create a bar chart for the current salary for each employment category.
Graphs - Legacy Dialogs – Bar (Figure 1.58).

Select Simple

Figure 1.58 SPSS Bar charts menu

Select Define

Click on Bars Represent N of cases and click on arrow to transfer to the Variable box.
Transfer Employment Category variable to the Category Axis box (Figure 1.59).

Page | 56
Figure 1.59 Define Simple Bar summaries for groups of cases

Click on Titles

Type in Bar chart for employment category (Figure 1.60).

Figure 1.60 Add title

Click Continue.

Click OK: the output is shown in Figure 1.61.

SPSS output

Page | 57
Figure 1.61 SPSS solution

Check your understanding

X1.3 Draw a suitable bar chart for the data in Table 1.16.

Industrial Sources for Consumption and Investment Demand (thousand


million)
Producing Industry Consumption Investment
Agriculture, mining 1.1 0.1
Metal manufacturers 2.0 2.7
Other manufacturing 6.8 0.3
Construction 0.9 2.7
Gas, electricity & water 1.2 0.2
Services 16.5 0.8
Total 28.5 7.8
Table 1.16 Consumption and investment demand

Pie charts

In a pie chart the relative frequencies are represented by slices of a circle. Each section
represents a category, and the area of a section represents the frequency or number of
objects within a category. They are particularly useful in showing relative proportions,
but their effectiveness tends to diminish for more than eight categories.

Example 1.11

Consider the Example 1.1 proposed postgraduate course chosen data illustrated in table
1.17.

Page | 58
Number of undergraduate students entering differing
masters courses
(Source: School survey August 2020)
Course Frequency
MSc International
120
Management

MSc Business Analysis 24

MA Human Resource
80
Management

MSc Project Management 45

MSc Digital Business 86

Total 355
Table 1.17 Proposed postgraduate course

These data can then be represented by a pie chart.

Figure 1.62 represents a pie chart for proposed postgraduate course.

Figure 1.62 Pie chart for proposed voting behaviour

We can see that different slices of the circle represent the different choices that people
have when it comes to the chosen postgraduate course. To make a pie chart, you need to
calculate the angle of each slice in the circle representing each course.

From the table, we can calculate that the total number of students: 120 + 24 + 80 + 45 +
86 = 355. Given that 360° represents the total number of degrees in a circle, we can
calculate how many degrees would be represented by each student. For this example,

Page | 59
we have 360° = 355 students. Therefore, each student is represented by 360/355
degrees. Based upon this calculation we can now calculate each angle for each political
party category (see Table 1.18).

How to calculate Angle (degrees)


Course Frequency
angles for a pie chart (1 decimal place)
MSc International
120 = 120 * (360/355) 121.7
Management
MSc Business
24 = 24 * (360/355) 24.3
Analysis
MA Human
Resource 80 = 80 * (360/355) 81.1
Management
MSc Project
45 = 45 * (360/355) 45.6
Management
MSc Digital
86 = 86 * (360/355) 87.2
Business
Total 355 360
Table 1.18 Calculation of pie chart angles

The size of each slice (sector) depends on the angle at the centre of the circle which in
turn depends upon the number in the category the sector represents. Before drawing
the pie chart, you should always check that the angles you have calculated sum to 360°.
A pie chart may be constructed on a percentage basis, or the actual figures may be used.

Creating a pie chart using Excel and SPSS

Excel and IBM SPSS Statistics enable you to easily tabulate and graph your raw data. To
illustrate the application of Excel and SPSS we shall use the following data example that
provides data on 400 employees working at a local engineering firm.

Example 1.12

We will follow the same steps 1 and 2 as in Example 1.10 and use Figures 1.45 and 1.57
that represent the Excel and SPSS data views for the first 10 records out of a total of 400
records (complete data sets are available via the book website). Below we show how to
use Excel and SPSS to create a pie chart to represent employment category.

Excel solution

Step 1

Input data into Excel (first 10 values illustrated)

Page | 60
Figure 1.63 Data set

Step 2 Create the frequency distribution for this data set

Figure 1.64 Frequency distribution

Step 3 Create the pie chart.

In Figure 1.65 (identical to Figure 1.46), highlight H10:I13 (includes labels in


cells H10 and I11).

Figure 1.65 Calculation of frequencies using Excel

Select Insert > Insert Pie or Doughnut Chart

Page | 61
Figure 1.66 Choose type of pie chart

Choose the first option.

This will result in the graph illustrated in Figure 1.67.

Figure 1.67 Excel pie chart

Step 4 Edit the chart.

The chart can then be edited to improve its appearance, for example, to include a chart
title, change bar colours, and change data label information using the method described
for bar charts. Select data and remove variable ‘x’. The final bar chart is illustrated in
Figure 1.68.

Page | 62
Figure 1.68 Pie chart for employment category

SPSS solution

Enter data into SPSS

Figure 1.69 SPSS data

Create a pie chart for each employment category

Select Graphs – Legacy Dialogs – Pie

Figure 1.70 SPSS Pie chart menu

Page | 63
Click on Summaries for groups of cases (Figure 1.71).

Figure 1.71 SPSS Pie Charts menu

Click on Define

Transfer Employment Category to the Define Slices by box (Figure 1.72).

Figure 1.72 SPSS define pie summaries for groups of cases

Add the chart title

Click on Titles

Page | 64
Figure 1.73 Add chart title

Click Continue
Click OK

The output is shown in Figure 1.74.

Figure 1.74 SPSS pie chart

Double-click on the chart to edit.

Check your understanding

X1.4 Three thousand six hundred people who work in Bradford were asked about the
means of transport which they used for daily commuting. The data collected is
shown in Table 1.19. Construct a pie chart to represent this data.

Type of Transport Frequency of Response


Private Car 1800
Bus 900
Train 300
Other 600
Table 1.19 Type of transport

Page | 65
X1.5 The results of the voting in an election are as shown in Table 1.20. Show this
information on a pie diagram.

Mr P 2045 votes
Mr Q 4238 votes
Mrs R 8605 votes
Ms S 12012 votes
Table 1.20 Election results

Histograms

Frequency distribution as a method showing the category-type data as a table has


already been covered. This concept can now be extended to graphs. The method used to
graph a group frequency distribution (or table) is to construct a histogram. A
histogram looks like a bar chart, but they are different and should not be confused with
each other.

Histograms are constructed on the following principles:

a) The horizontal axis (x-axis) is a continuous scale.


b) Each class is represented by a vertical rectangle, the base of which extends from
one true limit to the next.
c) The area of the rectangle is proportional to the frequency of the class.

The last point is very important since it means that the area of the bar represents the
frequency of each category. In the bar chart the frequency is represented by the height
of each bar. This implies that if we double the class width for one bar compared to all
the other classes then we would have to half the height of that bar compared to all other
bars. In the special case where class widths are equal, the height of the bar can be taken
to be representative of the frequency of occurrence for that category. It is important to
note that either frequencies or relative frequencies can be used to construct a
histogram, but the shape of the histogram would be the same regardless.

Example 1.13

In Example 1.4 we listed the number telephone calls per day enquiring about the new
iPhone SE over a period of 92 days (Table 1.21). This number of telephone calls per day
could be called ‘Day’. The data variable ‘Day’ is a discrete variable, and the histogram is
constructed as illustrated in Table 1.22.

Day Frequency, f
1 8
2 9
3 22
4 23
5 30
Table 1.21 Example 1.13 frequency distribution

Page | 66
We can see from Table 1.22 that all the class widths have the same value 1 (constant,
class width = UCB – LCB). In this case the histogram can be constructed with the height
of the bar representing the frequency of occurrence.

Day LCB - UCB Class width Frequency, f


1 0.5 – 1.5 1 8
2 1.5 – 2.5 1 9
3 2.5 – 3.5 1 22
4 3.5 – 4.5 1 23
5 4.5 – 5.5 1 30
f= 92
Table 1.22 Creation of frequency distribution

To construct the histogram, we would plot frequency (y-axis, vertical) against Day (x-
axis), with the upper- and lower-class boundaries determining the class boundary
positions for each bar (see Figure 1.75).

Figure 1.75 Identifying histogram class boundaries

Figure 1.76 shows the completed histogram.

Page | 67
Figure 1.76 Histogram

We can use the histogram to see how the number of telephone calls per day varies in
frequency from the lowest value of 1 to the highest value of 5. If we look at the
histogram, we note that the number of telephone calls per day gradually increases.
These ideas will lead to the idea of average (central tendency) and data spread
(dispersion) which will be explored in Chapter 2.

Example 1.14

Example 1.5 represents the distance travelled 100 delivery vans. Figure 1.77 shows the
data set and Table 1.23 shows the grouped frequency table.

Figure 1.77 Example 1.14 Excel data

Page | 68
From this data we can construct the grouped frequency distribution as illustrated in
Table 1.23.

Class Frequency
400 - 449 4
450 - 499 44
500 - 549 36
550- 599 15
600 - 649 1
Table 1.23 Grouped frequency distribution

The data variable ‘distance travelled’ is a grouped variable and the histogram is
constructed as illustrated in Table 1.24.

Class Lower Upper Class width Frequency


400 - 449 399.5 449.5 50 4
450 - 499 449.5 499.5 50 44
500 - 549 499.5 549.5 50 36
550- 599 549.5 599.5 50 15
600 - 649 599.5 649.5 50 1
Table 1.24 Calculation procedure to identify class limits for the histogram

We can see from the table that all the class widths have the same value, 20 (constant,
class width = UCB – LCB). In this case the histogram can be constructed with the height
of the bar representing the frequency of occurrence. To construct the histogram, we
would plot frequency (y-axis, vertical) against score (x-axis, horizontal), with the upper-
and lower-class boundaries determining the class boundary positions for each bar (see
Figure 1.78).

Figure 1.78 Identifying group frequency histogram class boundaries

Page | 69
Figure 1.79 shows the completed histogram.

Figure 1.79 Grouped frequency distribution histogram

We can use the histogram to see how the frequency changes as the distance travelled
changes from the lowest group (400–449) to the highest group (600–649). If we look at
the histogram we can note:

• Looking along the x-axis, we can see that the mileage is evenly spread out.
• The mileage rises (400–449 to 450–499) with a maximum of 450–499 recorded.
• The mileage falls (450–499 to 600–649) with a minimum of 600–649 miles
recorded.

Creating a histogram using Excel and SPSS

Excel solution

Step 1 Input data into cells A6:G19 as illustrated in Figure 1.80.

Page | 70
Figure 1.80 Example 1.14 Excel dataset and bin range

Step 2 Excel spreadsheet macro command – using Analysis ToolPak.

Select Data > Data Analysis > Histogram

Figure 1.81 Excel Data Analysis menu

Click OK

Input (data) Range: C5:C105

Input Bin Range: E14:E19

Choose location of Output Range: G13

Page | 71
Figure 1.82 Data Analysis Histogram menu

Click OK

Excel will now print out the grouped frequency table (bin range and frequency of
occurrence) as presented in cells G13:H20.

Figure 1.83 Example 1.14 Excel solution for bin range and frequency

We can now use Excel to generate the graphical version of this grouped frequency
distribution – the histogram for equal class widths.

Step 3 Crete histogram

Input data series

Class: J14:K19 (label in J14)


Frequency: H14:H19 (label in K14)

Page | 72
Highlight cells J14:K19

Figure 1.84 Excel grouped frequency distribution

Step 4 Create column chart (Insert > Column > choose first option).

This will create the chart illustrated in Figure 1.85 with chart title and axis titles
updated.

Figure 1.85 Excel bar chart

Step 5 Transformation of the column chart to a histogram.

Select bars by clicking on them as illustrated in Figure 1.86.

Page | 73
Figure 1.86 Transforming bar chart to a histogram

Select any one of the bars, right-click and Select Format Data Series

Figure 1.87 Format bar chart data series

Reduce Gap Width to zero as illustrated in Figure 1.88.

Figure 1.88 Format data series

The final histogram is presented in Figure 1.89 after adding title, axis titles, and
removing the bar colour.

Page | 74
Figure 1.89 Excel grouped frequency histogram

These ideas will lead to the idea of average (central tendency) and data spread
(dispersion) which will be explored in Chapter 2.

SPSS solution

Input data into SPSS and calculate the bin values (see Example 1.5)– the first 10 values
are shown in Figure 1.90.

Figure 1.90 Example 1.14 SPSS dataset

Create the histogram.

Select Analyze > Descriptive Statistics > Frequencies

Transfer distance travelled into the Variable(s) box (Figure 1.91).

Page | 75
Figure 1.91 Transfer variable into Variable(s) box

Click on Charts.

Choose Histograms

Figure 1.92 Create histogram

Click Continue

Figure 1.93 SPSS frequencies menu

Page | 76
Click OK

SPSS output

Figure 1.94 SPSS solution

Figure 1.95 SPSS solution continued

Check your understanding

X1.6 Create a suitable histogram to represent the number of customers visited by a


salesman over an 80-week period (Table 1.25).

68 64 75 82 68 60 62 88 76 93 73 79 88 73 60 93
71 59 85 75 61 65 75 87 74 62 95 78 63 72 66 78
82 75 94 77 69 74 68 60 96 78 89 61 75 95 60 79
83 71 79 62 67 97 78 85 76 65 71 75 65 80 73 57
88 78 62 76 53 74 86 67 73 81 72 63 76 75 85 77
Table 1.25 Number of customers visited by salesman

Page | 77
X1.7 Create a suitable histogram to represent the spending (£s) on extra-curricular
activities for a random sample of university students during the ninth week of
the first term (Table 1.26).

16.91 9.65 22.68 12.45 18.24 11.79 6.48 12.93 7.25 13.02
8.10 3.25 9.00 9.90 12.87 17.50 10.05 27.43 16.01 6.63
14.73 8.59 6.50 20.35 8.84 13.45 18.75 24.10 13.57 9.18
9.50 7.14 10.41 12.80 32.09 6.74 11.38 17.95 7.25 4.32
8.31 6.50 13.80 9.87 6.29 14.59 19.25 5.74 4.95 15.90
Table 1.26 Extra-curricular spending

Scatter and time series plots


A scatter plot is a graph which helps us assess visually the form of relationship between
two variables. To illustrate the idea of a scatter plot, consider the following problem.

Example 1.15

A luxury car business sells luxury cars and is interested in the relationship between
sales and the advertising spend. The company has collected 12-month sales and
advertising data, which is presented in table 1.27. How does the advertising spend
impact upon the sales per month over the last 12 months?

Month Advertising spend (£) Number of sales


January 15712 17
February 53527 75
March 66528 84
April 31118 51
May 95460 118
June 69116 90
July 29335 61
August 96701 100
September 38706 54
October 60389 89
November 35783 58
December 47190 70
Table 1.27 12-month of sales and advertising data

Figure 1.96 shows the scatter plot. As can be seen, there would seem to be some form of
relationship as the number of sales increases as the advertising spend increases. The
data, in fact, would indicate a positive relationship.

Page | 78
Figure 1.96 Scatter plot for the number of sales against advertising spend

Care needs to be taken when using graphs to infer what this relationship may be. For
example, if we modify the y-axis scale then we have a very different picture of this
potential relationship. Figure 1.97 illustrates the effect on the graph of modifying the
vertical scale.

Figure 1.97 What happens if we change the size of the vertical axis?

We can see that the data points are now hovering above the x-axis, with the increase in
the vertical direction not as pronounced as in Figure 1.96. If we further increased the y-
axis scale, then this pattern would be diminished even further. This example illustrates
well how important the scale is when charting data.

Page | 79
Time series plot is concerned with data collected over time. It attempts to isolate and
imply the influence of various factors which contribute to changes over time in such
variable series. An example of a time series plot could involve imports and exports,
sales, unemployment, or prices. If we can determine the main components which
determine the value of, say, sales for a month then we can project the series into the
future to obtain a forecast.

Example 1.16

Consider the time series data table presented in Example 1.15 but this time we wish to
construct a scatterplot for the number of sales against time (months).

Figure 1.98 Time series plot

Figure 1.98 illustrates the up and down pattern, with the overall number of sales
oscillating between January and December. We shall explore these ideas of trend and
seasonal components in Chapters 9, 10, and 11.

Creating a scatter and times series plot using Excel and SPSS

Excel solution

Step 1 Input data series:

Month: B4: B16 (including data label)

Page | 80
Advertising spend: C4:C16 (including data label)
Number of sales: D4:D16 (including data label)

Figure 1.99 Example 1.15 data

Highlight C4:D16

Figure 1.100 Example 1.15 data set highlighted

Step 2 Select Insert > Scatter and choose option 1

Page | 81
Figure 1.101 Select Scatter plot type (option 1)

Figure 1.102 Excel scatter plot

Step 3 Edit the chart.

The chart can then be edited to improve its appearance, for example, to include
chart title, axis titles, removal of horizontal gridlines, and removal of the legend
as illustrated in Figure 1.103.

Page | 82
Figure 1.103 Excel scatter plot after update

We could now ask Excel to fit a straight line to this data chart by clicking on a data point
on the chart, right-clicking on a data point, and choosing Add Trendline. We will look at
fitting a trend line and curves to scatter and time series data charts in Chapters 8 to 10.

SPSS Solution

Two solutions: (a) scatterplot, (b) time series plot.

SPSS solution for a scatterplot

Input data into SPSS

Figure 1.104 Example 1.15 SPSS dataset

Graphs - Legacy Dialogs

Page | 83
Figure 1.105 Creating SPSS scatterplot

Select Scatter/Dot.

Choose Simple Scatter

Figure 1.106 Choose type of SPSS scatter plot

Choose Define.

Transfer Number_of_sales to the Y Axis box

Transfer Advertising_spend to the X Axis box.

Page | 84
Figure 1.107 Create SPSS simple scatter plot

Now click on Titles and add a title to your chart

Figure 1.108 Add title to scatter plot

Click Continue

Page | 85
Figure 1.109 Create SPSS simple scatter plot update

Select OK

SPSS output

Figure 1.110 SPSS solution

SPSS solution for a time series plot

Page | 86
Step 1 Given we have time series data then we need to define date and time.

Figure 1.111 SPSS data

Select Data > Define date and time

Figure 1.112 Define date and time in SPSS

Starting Jan 2019 to December 2019

Figure 1.113 Date and time added for the 12 months

Page | 87
Step 2 Create time series graph

Select Graphs > Legacy Dialogs > Scatter/Dot

Figure 1.114 Creating scatterplot (this will be a time series plot)

Figure 1.115 Scatter/Dot options

Choose Simple Scatter

Transfer Number_of_sales to the Y Axis: box

Transfer Month, period 12 [Month] to the X Axis: box

Page | 88
Figure 1.116 Simple Scatterplot

Click OK

Figure 1.117 Time series plot

Finally, edit the graph to give a final version as illustrated in Figure 1.118.

Page | 89
Figure 1.118 Time series plot for number of sales against time.

Check your understanding

X1.8 Obtain a scatter plot for the data in Table 1.28 and comment on whether there is
a link between road deaths and the number of vehicles on the road. Would you
expect this to be true? Provide reasons for your answer.

Countries Vehicles per 100 Road Deaths per


population 100,000
population
Great Britain 31 14
Belgium 32 30
Denmark 30 23
France 46 32
Germany 30 26
Irish Republic 19 20
Italy 35 21
Netherlands 40 23
Canada 46 30
U.S.A. 57 35
Table 1.28 Number of vehicles and road deaths

X1.9 Obtain a scatter plot for the data in Table 1.29 that represents the passenger
miles flown by a UK-based airline (millions of passenger miles) during 2017–
2018. Comment on the relationship between miles flown and quarter.

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4


2003 98.9 191.0 287.4 123.2
2004 113.4 228.8 316.2 155.7
Table 1.29 Passenger miles

Page | 90
Chapter summary
The methods described in this chapter are very useful for describing data using a
variety of tabulated and graphical methods. These methods allow you to make sense of
data by constructing visual representations of numbers within the data set. Table 1.30
provides a summary of which table/graph to construct, given the data type.

Which table or chart to be applied


Numerical data Categorical
data
Tabulating data Frequency distribution. Summary table.
Cumulative frequency distribution
Graphing data Histogram. Bar chart.
Frequency polygon. Pie chart.
Presenting a relationship Scatterplot. Contingency
between data variables Time series graph. table.
Table 1.30 Which table or chart to use?

In the next chapter we will look at summarising data using measures of average,
dispersion, and shape.

Test your understanding


TU1.1 A small company is promoting healthy lifestyles with its workforce and provides
during the tea break a piece of fruit. Table 1.31 represents the fruit chosen by 36
workers.

Apple Plum Banana Apple Plum


Apple Peach Pear Apple Peach
Orange Apple Pear Orange Plum
Apple Pear Peach Apple Plum
Pear Peach Apple Orange Apple
Orange Apple Banana Orange Apple
Table 1.31 Choice of fruit

a. Construct a tally chart for the type of fruit chosen.


b. Clearly state the frequency of occurrence for each fruit.
c. Construct an appropriate bar chart for the type of fruit chosen.

TU1.2 The company described in TU1.1 is interested in the concept that worker
productivity is dependent upon happy workers. One of the criteria to measure
this is the number of pets that workers own. Table 1.32 represents the number
of pets owned by 36 of the workers.

1 4 3 2 0 2 1 3
1 2 0 1 2 2 0 3
4 2 1 0 2 1 2 1
4 3 2 1 0 3 4 2
Table 1.32 Number of pets

Page | 91
a. Construct a tally chart for the number of pets owned.
b. Clearly state the frequency of occurrence for the number of pets owned.
c. Construct an appropriate bar chart for the number of pets owned.

TU1.3 The monthly sales for a second-hand car dealership are provided in Table 1.33.

Month Frequency Month Frequency


January 16 July 30
February 24 August 29
March 22 September 24
April 26 October 20
May 27 November 15
June 31 December 10
Table 1.33 Monthly sales

a. Draw a line graph for the number of sales.


b. Use the graph to describe how the sales are varying per month.
c. What would you predict as sales for the following month?

TU1.4 Six hundred people are surveyed on the mode of transport they use to get to
work, as shown in Table 1.34. Construct a suitable pie chart to represent these
data.

Type of Frequency
transport
Train 90
Bus 120
Car 300
Cycle 30
Walk 60
Table 1.34 Mode of transport

TU1.5 State the class boundaries for a data set that varies between 11.6 and 97.8 when
we would like to have 9 classes.

TU1.6 The data in Table 1.35 show the annual sales for a business over a period of 11
years.

Year Sales Year Sales


2007 13 2013 20.5
2008 17 2014 20
2009 19 2015 19
2010 20 2016 17
2011 20.5 2017 143
2012 20.5
Table 1.35 Annual sales

Page | 92
a. Construct a time series plot for sales against time.
b. Use the time series plot to comment on how the annual sales have changed
from 2007 to 2017.

TU1.7 The data in Table 1.36 show the cost of electricity (£) during June 2018 for 50
one-bedroom flats.

96 171 202 178 147 102 153 197


157 185 90 116 172 111 148
141 149 206 175 123 128 144
95 163 150 154 130 143 187
108 119 183 151 114 135 191
129 139 109 130 127 166 168
158 149 167 165 82 137 213
Table 1.36 Cost of electricity (£’s)

a. Construct an appropriate grouped frequency distribution.


b. Using this frequency distribution, construct a histogram.
c. Around what amount does the June 2018 electricity cost appear to be
concentrated?

TU1.8 During a manufacturing process a sample of fifty 2 litre bottles of pop are
checked and the amount of pop (litres) is measured as given in Table 1.37.

2.109 1.963 2.003 2.031 2.029 2.065 1.947 1.969


2.036 1.908 1.981 1.999 2.014 2.025 2.057
2.015 2.086 1.957 1.973 1.996 2.012 2.029
2.005 2.038 1.894 1.951 1.975 1.997 2.012
1.984 2.014 2.066 2.075 1.951 1.971 1.992
2.012 2.044 2.052 1.941 1.938 2.010 1.966
2.023 2.020 1.967 1.986 2.012 1.941 1.994
Table 1.37 Quantity of pop in each bottle (litres)

a. Construct an appropriate grouped frequency distribution.


b. Using this frequency distribution, construct a histogram.
c. Use this histogram to comment on whether the sample content concentrates
about specific values. Does this match the advertised amount of 2 litres per
bottle?

TU1.9 The sales of diesel cars over a period of 12 years in the UK are presented in Table
1.38.

Page | 93
Time point New diesel Time point New diesel
car sales car sales
March 2007 166667 March 2013 186667
March 2008 180000 March 2014 213333
March 2009 133333 March 2015 226667
March 2010 160000 March 2016 233333
March 2011 170000 March 2017 240000
March 2012 180000 March 2018 146667
Table 1.38 Sales of diesel cars

a. Construct a time series plot for new diesel car sales against time point.
b. What do you notice about the sales over time?
c. What happened to sales after the year from March 2017 to March 2018?
d. Based upon your business knowledge, why would we have this pattern?

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A1Wa Histograms with unequal class widths.


2. A1Wb Frequency polygon.
3. A1Wc Cumulative frequency curve.
4. A1Wd Superimposing two sets of data onto one graph.

Page | 94
Chapter 2 Descriptive statistics
2.1 Introduction and learning objectives
Introduction

A journey from data to knowledge takes you through different phases, but most of them
can be summarised by the following stages:

1. Acquire raw data – create or import data sets from sources.


2. Organise raw data – tabulate data or put them in a spreadsheet format.
3. Understand the meaning of the data – calculate statistics that describe data.
4. Present and visualise the data – plot or chart various aspects of the data set.
5. Make inferences and draw conclusions– use findings to understand how things
work in general, or how related phenomena might behave.

The first point about acquiring raw data is something that is usually covered in research
methods modules. The second point is the prerequisite that every student has already
learned through previous stages of education, and we have also covered some of the
specifics in Chapter 1. The third point, as well as partially the fourth point, will be a
subject of this chapter. The final point is covered in greater depth in the remainder of
this book.

In this chapter, we shall look at three key statistical measures that enable us to describe
a data set. They are the measures of central tendency, measures of dispersion and
measures of shape. The meaning of these three concepts is as follows:

1. The central tendency (or measure of average) is a single value that is defined as a
typical value for the whole data set. This defined typical value is usually the value
that best represents the whole set, and we can use an average value, for example.
However, there is more than one single measure of central tendency that can be
applied. The three most used are the mean (colloquially called an average),
median and mode. These can be calculated for both ungrouped data (individual
data values known) and grouped data (data values within class intervals) data
sets.

2. The dispersion (or measure of spread) is the amount by which all the data values
are dispersed around the central tendency value. The more closely data are
surrounding the central tendency value, the more representative this central
tendency value is for this given data set and the less the dispersion. Several
different measures of dispersion exist, and we will include the range,
interquartile range, semi-interquartile range, standard deviation, and variance.
These also can be calculated for both ungrouped and grouped data sets.

3. The shape of the distribution is the pattern that the data set will form. This shape
usually defines the peak and the symmetry of the pattern. Different shapes will
not only look different, but also will determine where different measures of
central tendency are placed and ordered. Shapes can be classified according to

Page | 95
whether the distribution is symmetric (or skewed) and whether there is
evidence that the shape is peaked. Skewness is defined as a measure of the lack
of symmetry in a distribution. Kurtosis is defined as a measure of the degree of
peakedness in the distribution.

Different students have different habits when it comes preparing for exams. Some are
early morning birds, and some like to do revision in the evening, or even late into the
night. Let us assume that you fall in the latter category. A friend of yours asks you: ‘How
do you stay awake?’ To which the answer is that you drink lots of coffee. A friend wants
to know what you mean by ‘lots of coffee’. How many cups in particular? You answer:
‘Well, it really depends, but on average it must be 4 cups of coffee per night.’ You just
used a summary statistic, called an average. Rather than describing what happens every
night when you do the revisions, you used a single number to summarise and describe
your typical revision night. This is how the statistical measures that we will cover in this
chapter work. They provide an instant summary and description of a much broader
pattern. However, there are various summary statistics, so we need to understand all of
them to make sure we use them appropriately.

The above example is anecdotal. You might be working on a serious project and
handling lots of tables summarising data. Although tables, diagrams and graphs provide
easy-to-assimilate summaries of data, they only go part of the way in fully describing
data. Often a concise numerical description is preferable as it enables us to interpret the
significance of the data. Measures of average (or central tendency) attempt to quantify
what we mean when we think of as the ‘typical’ or ‘average’ value for a data set. The
concept of central tendency is extremely important, and it is encountered in daily life.
For example:

• What are the average CO2 emissions for a particular car compared to other
similar cars?
• What is the average starting salary for new graduates starting employment with
a large city bank?

Measures of dispersion such as the standard deviation, on the other hand, provide an
additional layer of information. Sometimes stating the average value is just not enough.
Imagine you work for a company with two factories, one in Italy and one in the UK. Your
factory in Italy might be producing a product that contains some impurities and they
amount on average to 5.3 particles per million (ppm).

Your UK factory’s average is 4.7 ppm. At first glance, the UK factory is making a purer
product. However, the standard deviation for the average in Italy is 0.9 ppm and the
standard deviation for the UK average is 1.8 ppm. What does this tell you? Suddenly you
realise that although the UK product has a lower average value, the factory in Italy has
better quality control and much less variation in its manufacturing process. Figures 2.1–
2.3 depict these statistics.

Page | 96
Figure 2.1 UK factory amount of impurities

Figure 2.2 Italian factory amount of impurities

Figure 2.3 Impurity comparison between UK and Italian factories

This simple example illustrates that measures of central tendency (averages) provide
only a partial picture. With the additional measure of dispersion (the standard
deviation) we were able to gain deeper understanding of how the two factories
performed. A final concept in analysing the data set is the shape of the distribution,
which can be measured using the concepts of skewness and kurtosis. You will find
similar applications for the measures of shape, as we go through this chapter.

Page | 97
Learning objectives

On completing this chapter, you will be able to:

1. Understand the concept of an average and be able to recognise different types of


averages for raw data and summary data: mean, mode and median
2. Understand the concept of dispersion and be able to recognise different types of
dispersions for raw data and summary data: range, interquartile range, semi-
interquartile range, standard deviation, and variance
3. Understand the idea of distribution shape and calculate a value for symmetry
and peakedness
4. Apply exploratory data analysis to a data set
5. Use Microsoft Excel and IBM SPSS Statistics to calculate data descriptors.

2.2 Measures of average for a set of numbers


We know that there are four measurement scales (or types of data): nominal, ordinal,
interval and ratio. These are simply ways to categorise different types of variables.
Table 2.1 provides a summary of which statistical measures to use for different types of
data.

Summary statistic to be applied


Data type Average Spread or Dispersion
Nominal Mode N/A
Ordinal Mode Range
Median Range, interquartile range,
Ratio or Mode Range
interval Median Range, interquartile range
Mean Variance, Standard deviation, Skewness,
Kurtosis
Table 2.1 Which summary statistic to use?

Whenever you handle data, or even just look at a data set, you intuitively try to
summarise what you have in front of you. The summary statistics effectively give you
elementary descriptors that define the data set. Some of these descriptors, such as the
average value, you are already familiar with, but not necessarily with the others. By
understanding what data descriptors are available and how to use them, you will be
able to position your business arguments better and understand more clearly what has
been presented to you.

Imagine that you are doing price analysis and you are trying to define the competitor’s
average price in the market. If you use the mean as the average value, then your
competitor is operating at an average price of £75 per metre. If you use the median
value as an average, then their average price is £55 per metre. Which one represents
better what you have measured? Why should you use one and not the other to decide
about your pricing policy? The next section will provide answers to all these questions.

Page | 98
Mean, median and mode for a set of numbers

If the mean is calculated from the entire population of the data set, then it is called the
population mean. If we sample from this population and calculate the mean, then the
mean is called the sample mean. The population and sample mean are calculated using
the same formula:

Sum of data values


Mean =
Total number of data values

For example, if DIVE Ltd were interested in the mean time for a consultant to travel by
train from London to Newcastle and if we assume that DIVE Ltd has gathered the time
(rounded to the nearest minute) for the last five trips (445, 415, 420, 435 and 405), then
the mean time would be:

445+415+420+435+405 2120
Mean= = = 424
5 5

The mean time to travel between London and Newcastle was calculated to be 424
minutes. We can see that the mean uses all the data values in the data set (445 + 415 +
… + 405) and provides an acceptable average if we do not have any values that can be
considered unusually large or small. If we added an extra value of one single trip that
took 900 minutes, then the new mean would be 503.3 minutes. This would not be
representative of the other data values in the data set, which range in value from 405 to
445. Such extreme values, called outliers, tend to skew the data distribution. In this case
we would use a different measure to calculate the value of central tendency.

An alternative method to calculate the average is the median. The median is literally the
‘middle’ number if you list the numbers in order of size. The median is not as
susceptible to extreme values as is the mean. Beside the mean and median, there is the
third method for determining average. It is called the mode. The mode is defined as the
number that occurs most frequently in the data set. It can be used for both numerical
and categorical (or nominal) data variables. A major problem with the mode is that it is
possible to have more than one modal value representing the average for numerical
data variables.

Several examples are provided to demonstrate how these measures of central tendency
are calculated.

Example 2.1

Suppose the advertising spend by Rubber Duck Ltd are as illustrated in Table 2.2. We
can describe the overall advertising spend by calculating an ‘average’ score using the
mean, median, and mode.

Page | 99
2019-2020 Advertising
spend (£) by Rubber
ID Month Duck Ltd
1 January 15712
2 February 53527
3 March 66528
4 April 31118
5 May 95460
6 June 15712
7 July 29335
8 August 96701
9 September 38706
10 October 60389
11 November 35783
12 December 47190
Table 2.2 Advertising spend (£)

The Mean

In general, the mean can be calculated using the formula:

∑ n
Sum of data values Xi
Mean(X̄) = Total number of dat avalues = i=1 (2.1)
N

Where 𝑥̅ (‘x-bar’) represents the mean value for the sample data, ∑𝑛𝑖=1 𝑥𝑖 represents the
sum of all the data values, and N represents the number of data values. If the data
represent the population of all data values, then the mean would represent the
population mean. Alternatively, if the data represent a sample from the population then
the mean would be called a sample mean.

For the advertising spend example above, the mean is calculated as:

∑ni=1 Xi 15712 + 53527 + ⋯ + 47190 586161


𝑥̅ = = = = 48846.75
N 12 12

The mean advertising spend is £48,846.75

The Median

The median is defined as the middle number when the data are arranged in order of
size. Consider the data in Table 2.2. (Note that the data are written in order of size, that
is, ranked. If they were not ranked, the data would have to be put in the correct order
before this method is used manually.)

In Table 2.2, the median is positioned as the 7th number in the ordered list of data
values (53), which means that there are six numbers on each side of the median. If the
data set were much larger, we would need to rely on a formula rather than visually

Page | 100
positioning the value. The position of any percentile within the ordered list of numbers
is given by equation (2.2):

P
Position of median = (N + 1) (2.2)
100

where P represents the percentile value and N represents the number of


numbers in the data set.

A percentile is a value on a scale of 100 that indicates the percentage of a distribution


that is equal to or below it. In our case, the median is the 50th percentile (p = 50):

P 50
Position of median = (N + 1) = (12 + 1) = 6.5
100 100

Position of the median = 6.5th number from the data set which is listed in order of size
as illustrated in Table 2.3.

Advertising spend in size order Ascending


ID (ranked) rank order
1 15712 1
6 15712 2
7 29335 3
4 31118 4
11 35783 5
9 38706 6
12 47190 7
2 53527 8
10 60389 9
3 66528 10
5 95460 11
8 96701 12
Table 2.3 Numbers listed in order of size

6th number = 38706


7th number = 47190

Now, use linear interpolation to calculate the 6.5th number

6.5th number = 6th number + 0.5 *(7th number – 6th number)


6.5th number = 38706 + 0.5 *(47190 –38706)=42948

The median value is £42,948.

Page | 101
The Mode

The mode is defined as the number which occurs most frequently (the most ‘popular’
number). In our case only 15712 appears twice and it is, therefore, the most ‘popular’
value. Hence, the value of the mode is £15,712. If no number is repeated, then there is
no mode. We have already said that if two, three or more numbers are repeated equal
numbers of times, then it is also impossible to define the modal value.

Excel solution

Figures 2.4 and 2.5 illustrates the Excel solution.

Figure 2.4 Calculate measures of average

Figure 2.5 Calculate measures of average (Excel formulae)

Page | 102
The above values imply that, depending on what measure we use, the average mark for
the advertising spend can be £48,846.75 (the mean), £42,948 (median) or £15,712
(mode). The choice of the measure will depend on the type of numbers within the data
set and the context.

SPSS solution

Using the Table 2.2 data, we can extract the same statistics from SPSS in the following
manner.

Enter data into SPSS

Figure 2.6 Example 2.1 SPSS data set

With SPSS we have three methods to calculate descriptive statistics: Frequencies,


Descriptives and Explore. We have 3 SPSS Descriptives methods that can be used to
calculate these statistics:

1. Frequencies
2. Descriptives
3. Explore

Method 1: Frequencies

Select Analyze > Descriptive Statistics > Frequencies.

Figure 2.7 SPSS frequencies menu

Transfer Advertising_spend to the Variable(s) box

Page | 103
Figure 2.8 SPSS Frequencies menu

Click on Statistics.

Under Central Tendency choose Mean, Median, and Mode (Figure 2.9).

Figure 2.9 SPSS frequencies statistics options

Click Continue.

Click OK

Page | 104
SPSS output

Figure 2.10 SPSS frequencies solution

The SPSS values for the mean (£48,846.75), median (£42,948) and mode (£15,712)
agree with the Excel solutions in Figure 2.5.

Method 2: Descriptives

Warning: This method provides the mean but not the median or mode.

Select Analyze > Descriptives > Descriptives

Transfer variable StatsMark to the Variable(s) box

Figure 2.11 SPSS descriptives menu

Click on Options and choose: Mean

Page | 105
Figure 2.12 SPSS descriptives options

Click on Continue

Click on OK

SPSS output

Figure 2.13 SPSS descriptives solution

The output gives a mean of £48,846.75, but this method does not give the median or
mode.

Method 3: Explore

Warning: Gives the mean and median but not the mode.

Select Analyze > Descriptives > Explore.

Transfer variable Advertising_spend to the Variable(s) box

Page | 106
Figure 2.14 SPSS explore menu

Click on Statistics and choose Descriptives

Figure 2.15 SPSS explore statistics options

Click Continue

Click OK

SPSS output

Page | 107
Figure 2.16 SPSS explore solution

The output gives a mean of £48,846.75 and a median of £42,948. This method does not
give the value of the mode but provides a series of other summary statistics that we will
explore within this and other chapters.

Check your understanding

X2.1 In 12 consecutive innings a batman's scores were: 6, 13, 16, 45, 93, 0, 62, 87, 136,
25, 14, 31. Find his mean score and the median.
X2.2 The following are the IQs of 12 people: 115, 89, 94, 107, 98, 87, 99, 120, 100, 94,
100, 99. It is claimed that 'the average person in the group has an IQ of over 100'.
Is this a reasonable assertion?
X2.3 A sample of six components was tested to destruction, to establish how long they
would last. The times to failure (in hours) during testing were 40, 44, 55, 55, 64,
69. Which would be the most appropriate average to describe the life of these
components? What are the consequences of your choice?
X2.4 Find the mean, median and mode of the following set of data: 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5.

2.3 Measures of dispersion for a set of numbers


In Section 2.2 we looked at the concept of central tendency, which provides a measure
of the middle value of a data set: mean, median and mode. As useful as these statistics
are, they only provide a partial description. A fuller description can be obtained by also
obtaining a measure of dispersion (otherwise known as measure of spread, or measure

Page | 108
of variation) of the distribution. Measures of dispersion indicate whether the values in
the group are distributed closely around an average, or whether they are more
dispersed. These measures are also particularly useful when we wish to compare
distributions. To illustrate this, consider the two hypothetical distributions presented in
Figure 2.17, which measure the daily value of sales made by two salespersons in their
respective sales areas over a period of one year. Suppose the means of the two
distributions, A and B, were 3500 and 5500, respectively. But as you can see, their
shapes are very different, with B being far more spread out.

Figure 2.17 Comparison of two distributions

What would you infer from the two distributions given about the two salespersons and
the areas that they work in? We can see that the distributions, A and B, have different
mean values, with distribution B being more spread out (or dispersed) than distribution
A. Furthermore, distribution A is taller than distribution B. In this section, we will
introduce the methods that can be used to put a number to this idea of dispersion. The
methods we will explore include the range, interquartile range, semi-interquartile
range, variance, standard deviation, and coefficient of variation. Dispersion is also called
variability, scatter, or spread. A proper description of a set of data should include both
characteristics: average and dispersion.

Percentiles and quartiles for a set of numbers

As we already learned, the median represents the middle value of the data set, which
corresponds to the 50th percentile (P = 50). This is also known as the second quartile.
A data set always needs to be ranked in order of size – only then can we use the
technique described below to calculate the values that would represent individual
percentile or quartile values.

Example 2.2

Page | 109
Reconsider Example 2.1 with the monthly advertising spend to calculate the 25th
percentile and quartile values (Table 2.4). We will demonstrate how to calculate
percentiles and quartiles for the first and third quartiles.

2019-2020 Advertising
spend (£) by Rubber Duck
ID Month Ltd
1 January 15712
2 February 53527
3 March 66528
4 April 31118
5 May 95460
6 June 15712
7 July 29335
8 August 96701
9 September 38706
10 October 60389
11 November 35783
12 December 47190
Table 2.4 Advertising spend (£)

To calculate the quartiles, like the median, we need to list the numbers in order of size
as illustrated in table 2.5.

2019-2020
Advertising spend
(£) by Rubber Duck Advertising spend in size
ID Month Ltd ID order (ranked)
1 January 15712 1 15712
2 February 53527 6 15712
3 March 66528 7 29335
4 April 31118 4 31118
5 May 95460 11 35783
6 June 15712 9 38706
7 July 29335 12 47190
8 August 96701 2 53527
9 September 38706 10 60389
10 October 60389 3 66528
11 November 35783 5 95460
12 December 47190 8 96701
Table 2.5 Advertising data and data listed in order of size

Page | 110
First quartile, Q1

The first quartile corresponds to the 25th percentile and the position of this value
within the ordered data set is given by equation (2.2):

P
Position of 25th percentile = (N + 1)
100

P = 25, N = 12

25
Position of 25th percentile = (12 + 1) = 3.25th number
100

We therefore take the 25th percentile to be the number that is one quarter the
distance between the 3rd and 4th numbers. To solve this problem, we use linear
interpolation:

3rd number = 29335

4th number =31118

3.25th number = 3rd number + 0.25*(4th number – 3rd number)

3.25th number = 29335 + 0.25*(31118 – 29335) = 29780.75

The first quartile advertising spend is £29780.75. This means that 25% of the data have
a value that is equal to or less than £29780.75.

Third quartile, Q3

The third quartile corresponds to the 75th percentile, and the position of this value
within the ordered data set is also given by equation (2.2):

P = 75, N = 12

75
Position of 75th percentile = (12 + 1) = 9.75th number
100

We therefore take the 75th percentile to be the number that is three quarters the
distance between the 9th and 10th numbers. To solve this problem, we use linear
interpolation:

9th number = 60389

10th number =66528

9.75th number = 9th number + 0.75*(10th number – 9th number)

9.75th number = 60389 + 0.75*(66528 – 60389) = 64993.25

Page | 111
The third quartile advertising spend is £64993.25. This means that 75% of the data
have a value that is equal to or less than £64993.25.

Excel solution

Figures 2.18 to 2.20 illustrate the Excel solution.

Figure 2.18 Example data

Figure 2.19 Excel formula solution

Page | 112
Figure 2.20 Excel function solution

Note that in cells M5:M7 we use dedicated Excel functions.

From Excel, we observe:

1. 25th percentile = first quartile = £29780.75


2. 50th percentile = second quartile = median = £42948.00
3. 75th percentile = third quartile = £64993.25

These results agree with the manual results.

SPSS solution

Figure 2.21 SPSS data

Using the Explore method

Select Analyze > Descriptives

Figure 2.22 SPSS Descriptives menu

Page | 113
Select Explore

Transfer Advertising_spend to the Variable(s) box

Figure 2.23 SPSS Frequencies menu

Click on Statistics and choose Percentiles

Figure 2.24 SPSS explore statistics options

Click Continue

Click OK

SPSS output

Figure 2.25 SPSS explore solution

Page | 114
From SPSS, we observe:

1. 25th percentile = first quartile = £29780.75


2. 50th percentile = second quartile = median = £42948.00
3. 75th percentile = third quartile = £64993.25

These results agree with the Excel Quartile.EXC function results.

Example 2.3

How would we calculate the value of the 20th percentile?

2019-2020
Advertising spend
(£) by Rubber Duck Advertising spend in size
ID Month Ltd ID order (ranked)
1 January 15712 1 15712
2 February 53527 6 15712
3 March 66528 7 29335
4 April 31118 4 31118
5 May 95460 11 35783
6 June 15712 9 38706
7 July 29335 12 47190
8 August 96701 2 53527
9 September 38706 10 60389
10 October 60389 3 66528
11 November 35783 5 95460
12 December 47190 8 96701
Table 2.6 Advertising data and data listed in order of size

P = 20, N = 12
20
Position of 20th percentile = (12 + 1) = 2.6th number
100

We therefore take the 20th percentile to be the number that is 0.6 the distance
between the 2nd and 3rd numbers. To solve this problem, we use linear
interpolation:

2nd number = 15712

3rd number =29335

2.6th number = 2nd number + 0.6 * (3rd number – 2nd number)

Page | 115
2.6th number = 15712 + 0.6 * (29335 – 15712) = 23885.80

The 20th percentile advertising spend is £23885.80

This means that 20% of the data have a value that is equal to or less than £23885.80.

Note: If you wanted to calculate the 56th percentile then use the manual method to show
the 56th percentile value is £48964.36.

Excel solution

Figure 2.26 Example 2.3 Excel solution

From Excel, we observe:

1. 25th percentile = first quartile = £29780.75


2. 50th percentile = second quartile = median = £42948.00
3. 75th percentile = third quartile = £64993.25
4. 20th percentile = £23885.80
5. 56th percentile = £48964.36

These results agree with the manual results.

SPSS solution

Select Analyze > Descriptives > Frequencies.

Transfer StatsMark to the Variable(s) box

Page | 116
Figure 2.27 SPSS frequencies menu

Click on Statistics.
Click on Percentiles
Type 20 into the percentiles box.
Click on Add
Type 56 into the percentiles box
Click on Add

Figure 2.28 SPSS frequencies statistics options

Click Continue

Click OK

SPSS output

Page | 117
Figure 2.29 SPSS frequencies solution

From SPSS, we observe:

1. 25th percentile = first quartile = £29780.75


2. 50th percentile = second quartile = median = £42948.00
3. 75th percentile = third quartile = £64993.25
4. 20th percentile = £23885.80
5. 56th percentile = £48964.36

These results agree with the manual and Excel results.

Check your understanding

X2.5 A class of 20 students had the following scores on their most recent test: 75, 77,
78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90. Calculate: (a)
the mean, (b) the median, (c) the first quartile, (d) the third quartile, (e) the 20th
percentile. (f) What percentile does the value 85 represent?

The range

The range is one of the simpler measures of distribution. It indicates the 'length' of a
distribution. It is determined by finding the difference between the lowest and highest
values in a distribution. A formula for calculating the range, depending on the type of
data, is defined by equation (2.3) or (2.4):

RANGE (ungrouped data) = Maximum value – Minimum value (2.3)

RANGE (grouped data) = UCB Highest Class – LCB Lowest Class (2.4)

Where UCB represents the upper-class boundary and LCB represents the lower-class
boundary.

Page | 118
Example 2.4

We can use the data from Example 2.1 and take a look at Table 2.6. In this table the data
is ordered in ascending order. The minimum value is 15712 and the maximum value is
96701. According to equation (2.1), the range is calculated as:

RANGE = 96701 – 15712 = 80989

If, for example you had data for another similar company’s advertising expenses and
their range turned out to be 42357, then you would be able to conclude that this other
company has much narrower range over which their advertising expenses are spread.

The interquartile range and semi-interquartile range

The interquartile range (IQR) represents the difference between the third and first
quartiles and can be used to provide a measure of spread within a data set which
includes extreme data values. The interquartile range is little affected by extreme data
values in the data set and is a good measure of spread for skewed distributions. The
interquartile range is defined by equation (2.5):

Interquartile range, IQR = Q3 – Q1 (2.5)

The semi-interquartile range (SIR) is another measure of spread and is computed as


half of the interquartile range, as shown in equation (2.6):

Q3 − Q1
SIQR = (2.6)
2

Example 2.5

We will again use the data from Example 2.1 and Example 2.2. The interquartile range
(IQR) and semi-interquartile range (SIQR) are calculated, using equations (2.5) and
(2.6) as:

IQR = 64993.25 – 29780.75 = 35212.5

64993.25 – 29780.75
SIQR = = 17606.25
2

IQR od 35212.5 can be interpreted as follows: 50% of the data (advertising expenses)
that reside between the first and the third quartile (or, between the 25th and 75th
percentile) have a range of 35212.5. SIQR, on the other hand splits the middle 50% of all
the values exactly into half. This means that if our data was proportionally distributed,
the full range would be 70425, i.e. (4 x 17606.25 = 70425). As we can see from example
2.4, the full range is 80989, which implies that we do not have the values equally
distributed and that we might have some minor extremes present in our data set.

As both IQR and SIQR focus on the middle half of all the values in the dataset, they are
much less influenced by the extreme values.

Page | 119
The standard deviation and variance

The standard deviation is the measure of spread most used in statistics when the mean
is used to calculate central tendency. The goal of the standard deviation is to summarise
the spread of a data set (i.e. in general how far each data point is from the mean).
Imagine you have calculated differences from the mean for each data value (𝑥 − 𝑥̅ ). If
you did this, then some values would come up as positive and some as negative
numbers. If you were then to sum all these differences, then you would find that
∑(𝑥 − 𝑥̅ ) = 0, i.e. the positive and negative values would cancel out. To avoid this
problem, you need to square each individual difference before carrying out the
summation. The benefits of squaring include:

• Squaring always gives a positive value, so the sum will not be zero.
• Squaring emphasises larger values – a feature that can be good or bad (for
example, think of the effect of outliers).

The value calculated in this way, or the statistic, shows the average of squared
differences, which is known as the variance (VAR(X)). Variance is defined by equation
(2.7), where N is the total population size:

∑n ̅ 2
i=1(X−X)
VAR(X) = (2.7)
N

By algebraic manipulation we can also rearrange equation (2.7) to give equation (2.8):

∑n
i=1 X
2
VAR(X) = −̅
X2 (2.8)
N

Squaring, however, does have a problem as a measure of spread, and that is that the
units are all squared, whereas we might prefer the spread to be in the same units as the
original data. Hence, the square root allows us to return to the original units to give the
standard deviation, SD(X), as illustrated in equation (2.9):

SD ( X ) = VAR ( X )
(2.9)

If we substitute equation (2.5) into (2.6), we get the standard deviation as in equation
(2.10):

∑n ̅ 2
i=1(X− 𝑋)
SD(X) = √ (2.10)
N

Variance describes how much the data values are scattered around their mean value.
You can also say that it shows how tightly are the data values grouped around the mean.
This leads to the conclusion that the smaller the value of variance, the more
representative the mean value is. We will see later that the variance is also very useful
as a comparison measure between two data sets. Because the variance is based on
squared values (squared differences from the mean) , this means that it does not have
the same dimension as the data set, or the mean. In other words, if the data values are

Page | 120
percentages, inches, degrees Celsius, or any other unit, the variance is not expressed in
the same values, because it is expressed in squared units. Standard deviation is used to
bring the variance into the same units of measure as the data set. Standard deviation is
the square root of the variance value, as shown in equation (2.9).

Example 2.6

The mean and variance can be calculated for the Example 2.1 data set using equations
(2.1), (2.7) or (2.8), respectively:

Mean

∑ni=1 Xi
X̄ =
N

Variance

∑ni=1 X 2
VAR(X) = ̅2
−X
N

Standard deviation

∑ni=1(X − 𝑋̅)2
SD(X) = √
N

To calculate the mean, variance, and standard deviation we need to calculate:

a) The number of data values, N.


b) The sum of the data values, X.
c) From a) and b) calculate the mean.
d) The sum of the data values squared, X2.
e) From (a), c) and d) calculate the variance and the standard deviation.

2019-2020 Advertising spend (£) by Rubber


Duck Ltd X^2
15712 246866944
53527 2865139729
66528 4425974784
31118 968329924
95460 9112611600
15712 246866944
29335 860542225
96701 9351083401
38706 1498154436
60389 3646831321
35783 1280423089

Page | 121
47190 2226896100
Table 2.7

From the table, we can show:

a) Count the number of data values, N = 12


b) Sum the data values, X = 586161
c) Sum the data values squared, X2 = 36729720497

Mean

∑ni=1 Xi
𝑋̅ =
N

15712 + 53527 + ⋯ … . +47190


𝑋̅ =
12

𝑋̅ = 48846.75

Variance

∑ni=1 X 2
VAR(X) = ̅2
−X
N
36729720497
VAR(X) = − 48846.752
12

VAR(X) = 674805055.8542..

Standard deviation

SD(X) = √Variance (X)

SD(X) = 35977,01

Population data set

If the data set is the complete population then the same equations as (2.7)-(2.10) are
used, except that we change the notation. For population variance we use symbol 2 and
population standard deviation . And  is symbol for the mean. Equations (2.10), (2.7)
and (2.8) are rewritten as:

∑𝑛𝑖=1 𝑋 2
𝜎2 = − 𝜇2
𝑁

or

2
∑𝑛𝑖=1(𝑋 − 𝜇)2
𝜎 =
𝑁
Page | 122
Therefore:

𝜎 = √VAR(𝑋) = √𝜎 2

The Excel functions to calculate variance and standard deviation assuming we are using
the population data are: =VAR.P() and =STDEV.P() respectively. It should be noted that
VAR.P and STDEV.P are newer versions of the Excel functions VARP and STDEVP.

Sample from a population

If the data set is a sample from the population then the sample variance (s2) and sample
standard deviation (s) are given by equations (2.11) and (2.12).

∑𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )
2
Sample variance (𝑠 2 ) = (2.11)
𝑛−1

Sample standard deviation (𝑠) = √𝑠 2 (2.12)

The corresponding Excel functions to calculate sample variance and sample standard
deviation are: =VAR.S() and =STDEV.S(), respectively. Again, it should be noted that
VAR.S AND STDEV.S are newer versions of the Excel functions VAR and STDEV.

Excel solution

Figures 2.30 to 2.32 illustrate the Excel solutions.

Figure 2.30 Data set and column calculations

Page | 123
Figure 2.31

Figure 2.32

From Excel:

• Mean = 48846.70
• Population variance = 674805055.85
• Population standard deviation = 25977.01
• Range = 80989
• Q1 = 29870.75
Page | 124
• Median = 42948.00
• Q3 = 64993.25

SPSS solution

Method: Frequencies

Select Analyze > Descriptive Statistics > Frequencies.

Figure 2.33 SPSS frequencies menu

Transfer Advertising_spend to the Variable(s) box

Figure 2.34 SPSS Frequencies menu

Page | 125
Click on Statistics

Choose Quartiles, Mean, Median, Std. deviation, Variance, Range

Figure 2.35 SPSS frequencies statistics options

Click Continue.

Click OK

Figure 2.36 SPSS solutions

From SPSS:

• Mean = 48846.70
• Population variance = 674805055.85
• Population standard deviation = 25977.01
• Range = 80989

Page | 126
• Q1 = 29870.75
• Median = 42948.00
• Q3 = 64993.25

Example 2.7

Consider the following e-commerce module marks achieved by 13 students.

ID e-Commerce module marks


1 21
2 28
3 35
4 44
5 50
6 50
7 54
8 57
9 58
10 64
11 76
12 81
13 82
Table 2.8

If we solve this problem, we will find that the summary statistics are as follows:

• N = 13
• First quartile, Q1 = 39.5
• Median = second quartile, Q2 = 54
• Third quartile, Q3 = 70
• Interquartile range, IQR = Q3 – Q1 = 16
• Semi interquartile range, SIQR = (Q3 – Q1)/2 = 16/2 = 8

What is the meaning of these numbers?

Since half the scores in a distribution lie between Q3 and Q1, the semi-interquartile
range is 1/2 the distance needed to cover 1/2 the scores. In a symmetric distribution, an
interval stretching from one semi-interquartile range below the median to one semi-
interquartile above the median will contain 1/2 of the scores as illustrated in Figure
2.36.

Page | 127
Figure 2.37 Location IR and SIR for Example 2.5 data set

If this was a symmetric distribution and as SIQR is 8 and median Q2 is 54, this means
that in our case 50% of all the score is between 46 (= 54 - 8) and 62 (= 54 + 8). This
contradicts our statement above where we claimed that 50% of all the values are
between 39.5 and 70. The reason for this contradiction is because our sample is not
symmetric (it is a very short sample).

The interquartile and semi-interquartile ranges are more stable than the range because
they focus on the middle half of the data values. Therefore, they cannot be influenced by
extreme values. The SIQR is used in conjunction with the median for a highly skewed
distribution or to describe an ordinal data set. The interquartile range (and semi-
interquartile range) are more influenced by sampling fluctuations in normal
distributions than is the standard deviation. Therefore, they are not often used for data
that are approximately normally distributed. In general, for a normal distribution, the
interquartile range is about 30% larger than the standard deviation.

Although the SIR is inferior to standard deviation as a measure of dispersion, we can see
that sometimes it makes sense to use it.

Check your understanding

X2.6 The daily incomes (£) of workers in a factory are: 95, 110, 105, 130, 135, 155,
170. Calculate a measure of central tendency and dispersion. Provide an
explanation for your choice of average.

X2.7 Table 2.9 represents the time in days that a second-hand furniture store takes to
sell tables. Calculate the mean time and an appropriate measure of dispersion.
Provide an explanation for your choice of average.

24 27 36 48 52 52 53 55 59 60 85 90 92
Table 2.9 Time to sell tables (days)

X2.8 A local garden centre allows customers to order goods online via its own e-
commerce website. The company quality assurance process includes the

Page | 128
processing of customer orders and the time to deliver (working days). A sample
of 30 orders is presented in Table 2.10. Calculate an appropriate measure of
average and dispersion. Provide a rationale for your choice of average and
dispersion.

25 25 32 16 25 29 30 28 26 26
20 23 28 25 18 18 22 18 21 25
28 22 32 19 28 28 27 18 33 26
28 19 18 18 29 25 20 20 23 30
Table 2.10 Time to deliver orders (days)

X2.9 Greendelivery.com has recently decided to review the weekly mileage of its
delivery vehicles that are used to deliver shopping purchased online to customer
homes from a central parcel depot. The sample data collected and provided in
Table 2.11 is part of the first stage in analysing the economic benefit of
potentially moving all vehicles to biofuels from diesel.

10 9 9 6 7 5 12 8 6 8
2 9 4 10 5 5 5 7 9 9
6 7 7 8 6 4 8 7 5 6
Table 2.11 Weekly mileage for delivery vehicles

a. Use Excel to construct a frequency distribution and plot the histogram with
class intervals of 10 and classes 75–84, 85–94, …, 175–184. Comment on the
pattern in mileage travelled by the company vehicles.
b. Use the raw data to determine the mean, median, standard deviation and
interquartile range.
c. Comment on which measure you would use to describe the average and
measure of dispersion. Explain using your answers to (a) and (b).

Interpretation of the standard deviation

If we collect data from a population (or sample) then we can calculate the mean and
standard deviation for this data set. For any data set we can use the calculated standard
deviation to tell us about the proportion of data values that lie within a specified
interval about the population mean. Recall that we used the squared differences
between every data point and the mean to calculate both variance and standard
deviation. Therefore, neither the variance nor the standard deviation can ever be
negative.

We stated that the variance is not expressed in the same units as the data points and the
mean. However, when we convert variance into standard deviation, we get the same
units as the original data. If the original data are in square metres, then both the mean
and standard deviation are in square metres. If the original data are in degrees Celsius,
then the mean and standard deviation are in degrees Celsius.

You will notice, as you work on many other examples, that usually the standard
deviation is much smaller than the mean. Why? Because the standard deviation is a
measure of dispersion, in other words, it measures how data are dispersed around their

Page | 129
mean. Another way to express this is to say that the standard deviation measures how
well the mean represents the data. Let us explain this. If we have 100 data points and if
70 or 80 of them, for example, are very close to their mean in value, then we can say that
the mean represents this data set very well. Another way to say that is: if a great amount
of data is within the range that is defined as 𝑥̅ ± 1 standard deviation, then we have a
narrow spread (dispersion) of data and the mean represents the data set very well.

In the next chapter we will introduce the so called normal distribution. This is a classic
bell-shaped distribution. For such distributions, the standard deviation defines exactly
how wide the spread is for all the data points about the average value. Figure 2.38
illustrates this point.

Figure 2.38 Percentage points for the normal distribution

As we will see in Chapter 5, once we know the mean and the standard deviation, and if
we assume that the data follow the normal distribution, we will be able to say that
68.3% of all the values in our data set are within  ± 1 standard deviation, that 95.5% of
all the values in our data set are within  ± 2 standard deviations, and that 99.7% of all
the values in our data set are within  ± 3 standard deviations.

If, for example, we measured the height of students at the local secondary school and
found a mean height of 𝑥̅ =165cm, with standard deviation  = 15 cm, and under the
assumption that the height of the students follows the normal distribution, we can say
that 68% of all the students in this school are between 165 – 15 = 150 cm and 165 + 15
= 180 cm, and that 95% of all the students are between 165 – 2 × 15 = 135 cm and 165 +
2 × 15 = 195 cm.

The coefficient of variation

The coefficient of variation is a statistic calculated as the ratio of the standard deviation
to the mean. When comparing the degree of variation from one data set to another, this
is one of the invaluable statistics. If the two data sets have different units, or the values
are by magnitude different, then the coefficient of variation is particularly useful.

Page | 130
For example, the value of the standard deviation of a set of weights will be different,
depending on whether they are measured in pounds or kilograms. The coefficient of
variation, however, will be the same in both cases as it does not depend on the unit of
measurement. The coefficient of variation, V, is defined by equation (2.13):

Standard deviation 𝑠
𝑉= × 100 = 100 (2.13)
Mean 𝑥̅

For example, if the coefficient of variation is 10% then this means that the standard
deviation is equal to 10% of the average. For some measures, the standard deviation
changes as the average changes. If this is case, the coefficient of variation is the best way
to summarise the variation.

Example 2.8

Consider the following problem that compares UK factory and US factory average
earnings: (a) mean earnings in the UK are £125 per week with a standard deviation of
£10 and (b) mean earnings in the USA are $145 per week with a standard deviation of
$16.

10
For the UK, therefore, 𝑉 = 125 × 100 = 8%

16
For the USA, we have 𝑉 = 145 × 100 = 11.03%

Although one set of data is given in pounds sterling and the other in US dollars, the
coefficient of variation returns the values as percentages, so that two sets of data can be
compared. In the case of the UK, the standard deviation is 8% of the mean value and in
the case of the USA, the percentage is 11.03%. We can conclude that the spread of
earnings in the USA is greater than the spread in earnings in the UK.

Check your understanding

X2.10 A manufacturing company sells a new type of car tyre with a mean life of 27,000
miles and a standard deviation of 6000 miles. Calculate the coefficient of
variation.
X2.11 A salesman earns commission based upon the number of sales above a certain
value. Calculate the coefficient of variation if the mean commission is €200 with
a standard deviation of €40.

2.4 Measures of shape


Most of time when conducting statistical analysis, we are just trying to get the location
and variability of a data set. However, there are other measures that could be included
in this analysis and they are the measures of the distribution shape: skewness and
kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry.
A distribution, or data set, is symmetric if it looks the same to the left and right of the
centre point.

Page | 131
Measuring skewness: distribution symmetry
The histogram is an effective graphical technique for showing both the skewness and
kurtosis for a data set. Consider three distributions A, B and C as illustrated in Figures
2.39–2.41.

Figure 2.39 Symmetric distribution

Figure 2.40 Right-skewed (positive skewness)

Figure 2.41 Left-skewed (negative skewness)

Page | 132
Distribution A is said to be symmetrical. The mean, median and mode have the same
value. Distribution B has a high frequency of relatively low values and a low frequency
of relatively high values. Consequently, the mean is 'dragged' toward the right (the high
values) of the distribution. It is known as a right-skewed (or positively skewed)
distribution. Distribution C has a high frequency of relatively high values and a low
frequency of relatively low values. Consequently, the mean is 'dragged' toward the left
(the low values) of the distribution. It is known as a left-skewed (or negatively skewed)
distribution.

The skewness of a frequency distribution can be an important consideration. For


example, if your data set is salary, your employer would prefer a situation that led to a
positively skewed distribution of salary to one that is negatively skewed. To measure
skewness, we can use one of several different methods.

Pearson’s coefficient of skewness

One measure of skewness is Pearson's coefficient of skewness as defined by equation


(2.14):

3(Mean - Median)
PCS= (2.14)
Standard deviation

As we can see, equation (2.14) is relatively simple, but there are several points to
remember related to the measurement of skewness:

1. The direction of skewness is given by the sign.


2. A large negative value means the distribution is negatively skewed or left-
skewed.
3. A large positive value means the distribution is positively skewed or right
skewed.
4. A value of zero means no skewness at all (symmetric distribution).
5. The coefficient compares the sample distribution with a normal distribution.
6. The larger the value, the more the distribution differs from the normal
distribution.

Fisher–Pearson skewness coefficient

This is an alternative option to measure skewness, and Excel and SPSS use this
alternative measure based upon the Fisher–Pearson skewness coefficient as defined by
equation (2.15), where s is the standard deviation.

n ̅ )3
∑(X− X
Sample skewness = (2.15)
(n−1)(n−2) s3

If the skewness is positive, the data are positively skewed or right skewed, meaning that
the right tail of the distribution is longer than the left. If the skewness is negative, the
data are negatively skewed or left-skewed, meaning that the left tail is longer than the
right. If the skewness is zero, the data are perfectly symmetrical. However, a skewness

Page | 133
of exactly zero is quite unlikely for real-world data, so how can you interpret the
skewness? Here we list some simple rules of thumb:

1. If skewness is less than −1, or greater than +1, the distribution is highly skewed.
2. If skewness is between −1 and −0.5, or between +0.5 and +1, the distribution is
moderately skewed.
3. If skewness is between −0.5 and +0.5, the distribution is approximately
symmetric.

Example 2.9

Consider the e-commerce module marks achieved by 25 students as illustrated in Table


2.12.

ID E-Commerce marks, x
1 73
2 78
3 75
4 75
5 76
6 69
7 69
8 82
9 74
10 70
11 63
12 68
13 64
14 70
15 72
16 64
17 72
18 67
19 74
20 70
21 74
22 77
23 68
24 78
25 72
Table 2.12 e-Commerce marks

To calculate the sample skewness, we will solve equation (2.15).

Page | 134
n ∑(X − X̅) 3
Sample skewness =
(n − 1)(n − 2) s3

Step 1 From previous calculations we can show that the sample size (n), sample mean,
and sample standard deviation (s) have the following values

N = 25

∑X 73 + 78 + ⋯ … + 72
̅=
Mean X = = 71.76
N 25
̅ )2
∑(X− X
𝑆ample standard deviation, s = = 4.7371
n−1

Step 2 Calculate the column statistic (X – mean X)3 and sum all these values

ID E-Commerce marks, x (X - Xbar)^3


1 73 1.9066
2 78 242.9706
3 75 34.0122
4 75 34.0122
5 76 76.2250
6 69 -21.0246
7 69 -21.0246
8 82 1073.7418
9 74 11.2394
10 70 -5.4518
11 63 -672.2214
12 68 -53.1574
13 64 -467.2886
14 70 -5.4518
15 72 0.0138
16 64 -467.2886
17 72 0.0138
18 67 -107.8502
19 74 11.2394
20 70 -5.4518
21 74 11.2394
22 77 143.8778
23 68 -53.1574
24 78 242.9706
25 72 0.0138
Table 2.13 Column calculation for (X – mean X)3

 (X - Xbar)3 =4.108

Page | 135
Now, substitute these values into equation (2.15)

n ∑(X − ̅X)3
Sample skewness =
(n − 1)(n − 2) s3

25 4.108
Sample skewness =
(25 − 1)(25 − 2) 4.73713

Sample skewness = 0.0018

Since the calculated value is 0.0018 (very close to zero), this indicates that this
distribution is almost perfectly symmetrical.

Excel solution

Figure 2.42 Example 2.7 Excel solution

Observe in the Excel solution we created a column called (X – X ̅)3 and calculated these
values by placing the formula = (B4 - $G$5)^3 in Cell C4, and then copied the formula
down from C4:C28. Cells G9 and G11 show the same value, which is 0.0018. In cellGI9,
we used the manual formula as in equation (2.15), and in cell G11 we used the
equivalent Excel function =SKEW(). Since the calculated value is 0.0018 (very close to
zero), this indicates that this distribution is almost perfectly symmetrical.

SPSS solution

Using the data in table 2.13, we can extract the same statistics from SPSS in the
following manner.

Enter data into SPSS

Page | 136
Figure 2.43 Example 2.6 SPSS data set

With SPSS we have three methods to calculate descriptive statistics: Frequencies,


Explore, and Descriptives. To illustrate, let us choose the Frequencies method to
calculate the value of skewness.

Frequencies method

Select Analyze > Descriptives > Frequencies

Transfer variable eCommerceMarks into the Variable(s) box

Figure 2.44 SPSS descriptives menu

Click on Options and choose Skewness, as illustrated in Figure 2.45.

Page | 137
Figure 2.45 SPSS descriptives options

Click on Continue

Click on OK

SPSS output

The output is shown in Figure 2.46.

Figure 2.46 SPSS Frequencies solution

The value of skewness is given as 0.002, which agrees with the Excel value to 3 decimal
places.

Descriptives method solution

Page | 138
Figure 2.47

Explore method solution

Figure 2.48

Check your understanding

X2.12 A newspaper delivers newspapers to customer homes and earns a commission


rate based upon the volume of newspapers delivered per week. Table 2.14 shows
the commission earned (£) for the last 30 weeks. Calculate a measure of
skewness. Can we state that the distribution is symmetric?

80 165 159 143 140


136 138 118 120 124
159 131 93 145 109
163 136 163 142 80
106 111 123 161 179
144 145 91 112 146
170 105 131 141 122
137 152 109 122 126
114 155 92 143 165
Table 2.14 Commission earned (£’s)

Page | 139
Measuring kurtosis: distribution outliers and peakedness
The other common measure of shape is called the kurtosis. Traditionally, kurtosis has
been explained in terms of the central peak. You’ll see statements like this one: higher
values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.
Recent developments in the understanding of kurtosis suggest that higher kurtosis
means that more of the variance is the result of infrequent extreme deviations, as
opposed to frequent modestly sized deviations. In other words, it’s the tails that mostly
account for kurtosis, not the central region.

The reference standard is a normal distribution, which has a kurtosis of 3. In


recognition of this, it is the excess kurtosis that is often presented as the kurtosis. For
example, the ‘kurtosis’ reported by Excel and SPSS is actually the excess kurtosis. There
are several points, and expressions, to remember related to measurement of kurtosis:

1. A normal distribution has kurtosis exactly 3 (or excess kurtosis 0). Any
distribution with kurtosis ≈ 3 (excess kurtosis ≈ 0) is called mesokurtic.
2. A distribution with kurtosis less than 3 (excess kurtosis less than 0) is called
platykurtic. Compared to a normal distribution, its tails are shorter and thinner,
and often its central peak is lower and broader.
3. A distribution with kurtosis greater than 3 (excess kurtosis greater than 0) is
called leptokurtic. Compared to a normal distribution, its tails are longer and
fatter, and often its central peak is higher and sharper.

Figure 2.49 compares two normal population distributions with the same mean but
different standard deviations.

Figure 2.49 Comparison of two distributions

To assess the length of the tails and how peaked the distribution is we can calculate a
measure of kurtosis, and Excel and SPSS provide Fisher’s kurtosis coefficient as defined
by equation (2.16):

Page | 140
n (n+1) ∑(X− X̅ )4 3 (n−1)2
Sample kurtosis = − (2.16)
(n−1) (n−2) (n−3) s4 (n−2) (n−3)

Where s represents the sample standard deviation. Example 2.7 illustrates the
calculation details.

Example 2.10

Consider the e-commerce module marks achieved by 25 students as illustrated in


Example 2.9 Table 2.12.

To calculate the sample kurtosis, we will solve equation (2.16).

n (n + 1) ∑(X − ̅
X)4 3 (n − 1)2
Sample kurtosis = −
(n − 1) (n − 2) (n − 3) s4 (n − 2) (n − 3)

Step 1 From previous calculations we can show that the sample size (n), sample mean,
and sample standard deviation (s) have the following values

N = 25

∑X 73 + 78 + ⋯ … + 72
Mean ̅
X= = = 71.76
N 25
̅ )2
∑(X− X
𝑆ample standard deviation, s = = 4.7371
n−1

Step 2 Calculate the column statistic (X – mean X)4 and sum all these values

ID E-Commerce marks, x (X - Xbar)^4


1 73 2.3642
2 78 1516.1367
3 75 110.1996
4 75 110.1996
5 76 323.1941
6 69 58.0278
7 69 58.0278
8 82 10995.1163
9 74 25.1763
10 70 9.5951
11 63 5888.6593
12 68 199.8717
13 64 3626.1593
14 70 9.5951
15 72 0.0033

Page | 141
16 64 3626.1593
17 72 0.0033
18 67 513.3668
19 74 25.1763
20 70 9.5951
21 74 25.1763
22 77 753.9198
23 68 199.8717
24 78 1516.1367
25 72 0.0033
Table 2.15 Column calculation for (X – mean X)4

 (X - Xbar)4 = 29601.7352

Now, substitute these values into equation (2.16

25 (25 + 1) 29601.7352 3 (25 − 1)2


Sample kurtosis = −
(25 − 1) (25 − 2) (25 − 3) 4.73714 (25 − 2) (25 − 3)

Sample kurtosis = −0.2686

Sample kurtosis is – 0.2686.

The kurtosis value is given is equal to –0.2686. Since the calculated value is negative
(–0.2686) which indicates that this distribution is platykurtic.

Excel solution

Figures 2.50 and 2.51 illustrate the Excel solutions.

Page | 142
Figure 2.50 Example 2.7 Excel data set and calculation of column statistics

Observe in the Excel solution we created a column called (X – X̅)3 and calculated these
values by placing the formula = (B4 - $I$5)^3 in Cell C4, and then copied the formula
down from C4:C28. Observe in the Excel solution we created a column called (X –
Xbar)^4 and calculated these values by placing the formula = (B4 - $I$5)^4 in Cell E4,
and then copied the formula down from E4:E28.

Figure 2.51 Excel solution continued

The kurtosis value is given in cells I15 and I17 and is equal to –0.2686. In cell I15, we
used the manual formula as in equation (2.16) and in cell I17 the equivalent Excel

Page | 143
function =KURT(). Note that SPSS and Excel show excess kurtosis rather than the
proper kurtosis value. Excess kurtosis is the proper kurtosis value minus 3 (in our case,
the proper kurtosis value equals – 3.2686). Since the calculated value is –0.2686 (which
is negative excess kurtosis), this indicates that this distribution is platykurtic.

SPSS solution

With SPSS we have three methods to calculate descriptive statistics: Frequencies,


Explore, and Descriptives. To illustrate, let us choose the Frequencies method to
calculate the value of kurtosis. From the SPSS Statistics menu bar, Select Analyze >
Descriptives >Frequencies

Transfer eCommerceMarks to the Variable(s) box

Figure 2.52 SPSS descriptives menu

Click on Statistics and choose Skewness and Kurtosis, as illustrated in Figure


2.53.

Figure 2.53 SPSS descriptives options

Page | 144
Click on Continue

Click on OK

SPSS output

The output is shown in Figure 2.54.

Figure 2.54 SPSS Frequencies solution

The value of excess kurtosis is given as –0.269, which agrees with the Excel value to 3
decimal places.

Explore solution

Figure 2.55 Explore solution

Descriptives solution

Page | 145
Figure 2.56 Descriptives solution

Check your understanding

X2.13 A newspaper delivers newspapers to customer homes and earns a commission


rate based upon the volume of newspapers delivered per week. Table 2.16
represents the commission earned (£) for the last 30 weeks. Calculate a measure
of kurtosis.

19 28 17 16 18 23 19 21 24 17
20 20 21 25 20 21 17 20 20 22
15 16 17 21 21 21 13 16 15 19
Table 2.16 Commission earned (£’s)

Calculating a five-number summary


We will now discuss one very simple, yet very effective and intuitive method that
combines several measures we have covered so far (central tendency, dispersion, and
shape). The five-number summary is a simple method that provides measures of
average, spread and the shape of the distribution. This five-number summary consists
of the following numbers in the data set:

• Smallest value
• First quartile, Q1
• Median or second quartile, Q2
• Third quartile, Q3
• Largest value.

For symmetrical distributions, the following rule would hold:

Q3 – Median = Median – Q1

Largest value – Q3 = Q1 – smallest value

For non-symmetrical distributions, the following rule would hold:

Right-skewed distributions:

Largest value – Q3 greatly exceeds Q1 – Smallest value

Left-skewed distributions:
Q1 – Smallest value greatly exceeds Largest value – Q3

Page | 146
Example 2.11

Consider the student results obtained in a statistics examination as presented in Table


2.17.

73 69 63 64 74
78 69 68 72 77
75 82 64 67 68
75 74 70 74 78
76 70 72 70 72
Table 2.17 Statistics examination marks

Using the methods explored earlier in this chapter we can calculate the required
statistics.

Statistic Value
Minimum 63.00
Q1 68.50
Median 72.00
Q3 75.00
Maximum 82.00
Table 2.18

Excel solution

Figure 2.57 illustrates the Excel solution.

The five-number summary, provided in columns D:F, is as follows:

Page | 147
Figure 2.57 Example 2.9 Excel data set and solution

• Smallest value = 63.00


• First quartile, Q1 = 68.50
• Median or second quartile, Q2 = 72.00
• Third quartile, Q3 = 75.00
• Largest value = 82.00

To identify symmetry

Using the numbers in Figure 2.56, we conclude:

• The distance from Q3 to the median (75 – 72 = 3) is like the distance between
Q1 and the median (72 – 68.5 = 3.5).
• The distance between Q3 and the largest value (82 – 75 = 7) is not the same
as the distance between Q1 and the smallest value (68.5 – 63 = 5.5).

These summary values indicate that the distribution is right-skewed, because the
distance between Q3 and the largest value (82 - 75 = 7) is longer than the distance
between Q1 and the smallest value (68.5 – 63 = 5.5).

Page | 148
Please note that this is a very small right skewed distribution. If you calculate the
measure of skewness for this data set then the value is 0.002, which suggests the data is
symmetric.

To identify outliers

The following process is followed to identify possible outliers.

The interquartile range (IQR) is defined by equation (2.17) and represents the value of
the middle 50% of the data distribution.

IQR = Q3 – Q1 (2.17)

The IQR equation can then be used to identify any data outliers within the data set as
described by the following set of rules.

Construct inner fences:

Lower inner fence = Q1 – 1.5 IQR (2.18)


Upper inner fence = Q3 + 1.5 IQR (2.19)

Construct outer fences:

Lower outer fence = Q1 – 3 IQR (2.20)


Upper outer fence = Q3 + 3 IQR (2.21)

In Example 2.9, the interquartile range is IQR = 74 – 66 = 8. The inner and outer fence
values are as follows:

• Lower inner fence = Q1 – 1.5 IQR = 54


• Upper inner fence = Q3 + 1.5 IQR = 86
• Lower outer fence = Q1 – 3 IQR = 42
• Upper outer fence = Q3 + 3 IQR = 98

If data values are located between the inner and outer fences, then these data values
would be classified as mild outliers. If data values are located outside the outer fences,
then these would be classified as extreme outliers. We conclude we have no outliers
within the data set.

SPSS solution

The five-number summary can be calculated using the SPSS Statistics Frequency
command.

Input data into SPSS

Page | 149
Figure 2.58 Example 2.9 SPSS data set

Select Analyze > Descriptives > Frequencies.

Transfer eCommerceMarks to the Variable(s) box

Figure 2.59 SPSS frequencies menu

Click on Statistics.

Choose Quartiles, Minimum, Maximum

Page | 150
Figure 2.60 SPSS frequencies statistics

Click Continue

Click OK.

SPSS output

The output is shown in Figure 2.61.

Figure 2.61 SPSS frequencies solution

According to SPSS, the five-number summary is:

• Minimum = 63
• First quartile = 68.50
• Second quartile = 72.00
• Third quartile = 75.00
• Maximum = 82

We observe that the five-number summaries are the same in the manual, Excel and SPSS
solutions.
Page | 151
You can also use SPSS Explore menu to generate the same results.

Check your understanding

X2.14 The manager at Big Jim’s restaurant is concerned at the time it takes to process
credit card payments at the counter by counter staff. The manager has collected
the processing time data (time in minutes) shown in Table 2.19 and requested
that summary statistics are calculated.

73 73 73 73 73 73 73 73 73 73
78 78 78 78 78 78 78 78 78 78
75 75 75 75 75 75 75 75 75 75
75 75 75 75 75 75 75 75 75 75
Table 2.19 Time to process credit cards (minutes)

a. Calculate a five-number summary for this data set.


b. Do we have any evidence for a symmetric distribution?
c. Which measures would you use to provide a measure of average and spread?

X2.15 The local regional development agency is conducting a major review of the
economic development of a local community. One economic measure to be
collected is local house prices. These reflect the economic well-being of this
community. The development agency has collected the house price data (£)
shown in Table 2.20.

Processing credit cards (n= 40)


1.57 1.38 1.97 1.52 1.39
1.09 1.29 1.26 1.07 1.76
1.13 1.59 0.27 0.92 0.71
1.49 1.73 0.79 1.38 2.46
0.98 2.31 1.23 1.56 0.89
0.76 1.23 1.56 1.98 2.01
1.40 1.89 0.89 1.34 3.21
0.76 1.54 1.78 4.89 1.98
Table 2.20 House prices (£s)

a. Calculate a five-number summary.


b. Do we have any evidence for a symmetric distribution?
c. Which measures would you use to provide a measure of average and spread?

Creating a box plot


We have already discussed techniques for visually representing data (see histograms
and frequency polygons). In this section we present another important method, called
the box plot (also known as box-and-whisker plot). A box plot is a graphical method of
displaying the symmetry or skewness in a data set. It shows a measure of central
location (the median), two measures of dispersion (the range and interquartile range),

Page | 152
the skewness (from the orientation of the median relative to the quartiles) and potential
outliers.

Example 2.12

Consider the student marks presented in Example 2.9. Figure 2.62 shows the box-and-
whisker plot for the quantitative marks example, where the summary statistics are as
follows: first quartile Q1 = 68.5, minimum = 63, median = 72, maximum = 82 and third
quartile Q3 = 75.

Figure 2.62 Example 2.9 box plot

The box-and-whisker plot shows that the lowest 25% of the statistics marks are less
spread out than the highest 25% of the distribution. The plot also shows that the other
half are approximately equally spread out. This corresponds to the five-number
summary analysis in the previous section.

To identify symmetry

The box plot is interpreted as follows. If the median within the box is not equidistant
from the whiskers (or hinge), then the data are skewed. The box plot indicates right-
skewness because the distance between the median and the highest value is greater
than the distance between the median and the lowest value. Furthermore, the top
whisker (the distance between Q3 and the maximum) is longer than the lower whisker
(the distance between Q1 and the minimum).

To identify outliers

The box plot is interpreted as follows. The minimum and maximum points (or whiskers)
are identified and enable identification of any extreme values (or outliers). A simple rule
to identify an outlier (or suspected outlier) is that the whisker (maximum value –

Page | 153
minimum value) should be no longer than three times the length of the box (Q3 – Q1). In
this case the difference between the maximum and minimum is 21 and 3(Q3 – Q1) is 24.
The conclusion is that extreme values are not present in the data set and that the
distribution is somewhat right skewed. We have 3 methods we can use to create a box
plot:

1. Create a boxplot using the five-number summary – see result in Figure 2.62.
2. Create a boxplot using the Excel box-and-whisker plot method – see below.
3. Use SPSS to create boxplot – see below.

We will now look at the last two methods listed (2, 3).

Excel solution

Figure 2.63 Example 2.10 Excel solution for outliers

Highlight cells B3:B28 (complete data set, including label)

Select Insert Statistic Chart

Figure 2.64 Insert statistics chart

Select Box and Whisker (this is the same has a boxplot)

Page | 154
Figure 2.65 Choose box and whisker plot

Clicking on Insert Box and Whisker gives the solution shown in Figure 2.66.

Figure 2.66 Excel box and whisker plot

Now edit the chart by adding chart title ‘Box plot for statistics data’, vertical axis title
‘Statistics mark’, replace the number 1 on the horizontal axis with Value, remove the
chart border, and change the vertical axes from 0–100 to 50–90 as illustrated in Figure
2.67.

Page | 155
Figure 2.67Excel box plot

SPSS solution

Input data into SPSS as illustrated in Figure 2.68.

Figure 2.68 Example 2.12 SPSS data

Select Graphs > Legacy Dialogs > Boxplot

Page | 156
Figure 2.69 SPSS boxplot option

Select Simple

In the Data in Chart Are box choose Summaries of separate variables.

Figure 2.70 SPSS boxplot

Select Define

Transfer StatisticsMarks into the Variable box.

Page | 157
Figure 2.71 SPSS define simple boxplot

Click OK.

Finally, add the chart by double-clicking on the chart.

Figure 2.72 SPSS box plot

Check your understanding

X2.16 Create a boxplot for the data in X2.14.


X2.17 Create a boxplot for the data in X2.15.

Page | 158
2.5 Using the Excel Data Analysis menu
A selection of summary statistics can very easily be calculated in Excel by using the Data
Analysis menu add-ins. Almost all the measures we described in this chapter, and a few
extra, are included in the automatic printout. This tool will generate a report based
upon your univariate data set, including the mean, median, mode, standard deviation,
sample variance, kurtosis, skewness, range, minimum, maximum, sum, count, largest,
and smallest number. The skewness and kurtosis values can be used to provide
information about the shape of the distribution.

Example 2.13

Consider the e-Commerce marks example.

ID e-Commerce marks, x
1 73
2 78
3 75
4 75
5 76
6 69
7 69
8 82
9 74
10 70
11 63
12 68
13 64
14 70
15 72
16 64
17 72
18 67
19 74
20 70
21 74
22 77
23 68
24 78
25 72
Table 2.21

Page | 159
Excel Solution

The Descriptive Statistics procedure in the Excel ToolPak add-in can be used to calculate
the required statistics.

Enter the data into Excel

Figure 2.73 Example 2.10 Excel data set

From the Data tab, select Data Analysis.

Page | 160
Figure 2.74 Excel data Analysis Descriptive Statistics menu

Select Descriptive Statistics

Input data range: B3:B28,

Grouped By: Columns

Click on Labels in first row

Type in Output Range: D7.

Tick Summary statistics.

Figure 2.75 Excel Descriptive Statistics menu

Click OK

The Excel results would then be calculated and printed out in the Excel worksheet

Page | 161
Figure 2.76 Excel solution

Check your understanding

X2.18 The manager at Big Jim’s restaurant is concerned at the time it takes to process
credit card payments at the counter by counter staff. The manager has collected
the processing time data (time in minutes) shown in Table 2.22 and requested
that summary statistics are calculated.

73 73 73 73 73 73 73 73 73 73
78 78 78 78 78 78 78 78 78 78
75 75 75 75 75 75 75 75 75 75
75 75 75 75 75 75 75 75 75 75
Table 2.22 Time to process credit cards (minutes)

Use Excel Data Analysis to calculate:


a. A five-number summary for this data set.
b. Do we have any evidence for a symmetric distribution?
c. Which measures would you use to provide a measure of average and spread?

X2.19 The local regional development agency is conducting a major review of the
economic development of a local community. One economic measure to be
collected is the local house prices. These reflect the economic well-being of this
community. The development agency has collected the following house price
data (£) as presented in Table 2.23.

Processing credit cards (n= 40)


1.57 1.38 1.97 1.52 1.39
1.09 1.29 1.26 1.07 1.76
1.13 1.59 0.27 0.92 0.71

Page | 162
1.49 1.73 0.79 1.38 2.46
0.98 2.31 1.23 1.56 0.89
0.76 1.23 1.56 1.98 2.01
1.40 1.89 0.89 1.34 3.21
0.76 1.54 1.78 4.89 1.98
Table 2.23 House prices (£s)

Use Excel Data Analysis to calculate:


a. A five-number summary.
b. Do we have any evidence for a symmetric distribution?
c. Which measures would you use to provide a measure of average and spread?

Chapter summary
This chapter expanded from using tables and charts to summarising data using
measures of average and dispersion. The average provides a measure of the central
tendency (or middle value). The mean is the most calculated average to represent the
measure of central tendency. We learned that this measurement uses all the data within
the calculation and therefore outliers will affect the value of the mean. Accordingly, the
value of the mean may not be representative of the underlying data set. If outliers are
present in the data set, we also learned that we can either eliminate these outlier values,
or use the median to represent the average. The next calculation to perform is to
provide a measure of the spread of the data within the distribution. The standard
deviation is the most common measure of dispersion (or spread) but, like the mean, the
standard deviation is influenced by the presence of outliers in the data set. If outliers
are present, then again you can either eliminate these outlier values or use the semi-
interquartile range to represent the degree of dispersion.

The degree of skewness we learned can be estimated in the data set by calculating the
Pearson or Fisher–Pearson skewness coefficient. To estimate the degree of
‘peakedness’, we used the Fisher kurtosis coefficient. The last topic covered were the
Box plots, which are graph plots that allow you to visualise the degree of symmetry or
skewness in the data set. The chapter explored the calculation process for raw data and
frequency distributions. However, it is very important to note that the graphical method
will not be as accurate as the raw data method when calculating the summary statistics.
Table 2.24 provides a summary of which statistics measures to use for different types of
data.

Summary statistic to be applied


Data type Average Spread or Dispersion
Nominal Mode NA
Ordinal Mode Range
Median Range, interquartile range,
Ratio or Mode Range
interval Median Range, interquartile range
Mean Variance, Standard deviation, Skewness,
Kurtosis
Table 2.24 Which summary statistic to use?

Page | 163
Test your understanding
TU2.1 Calculate the mean and median for the following data set: 28, 23, 27, 19, 22, 19,
23, 26, 34, 30, 29, 25.

TU2.2 Calculate the 10th and 74th percentile values for the following data set: 38, 41,
38, 48, 56, 35, 44, 31, 46, 41, 54, 51, 33.

TU2.3 Table 2.25 represents the English language results for a sample of students
attending a pre-university course. Calculate: (a) mean, (b) sample standard
deviation, (c) median, (d) sample skewness, and (e) sample kurtosis. Use your
results to comment on the shape of the sample distribution.

70 90 82 80 88 80 69 90
86 97 79 79 71 84 71 81
82 83 93 82 75 61 88 75
77 73 74 85
Table 2.25 English language results

TU2.4 Table 2.26 represents the multiple-choice results out of 40 for a mid-term
economics module. Calculate: (a) mean, (b) sample standard deviation, (c)
median, (d) sample skewness, and (e) sample kurtosis. Use your results to
comment on the shape of the sample distribution.

26 16 22 28 28 15 28 26
28 29 25 26 24 31 29 28
29 32 19 26 28 32 22 35
22 25 27 18
Table 2.26 Midterm economics results

TU2.5 For the data in TU3 construct a five-number summary and a box plot. Comment
on the distribution shape. Does your answer agree with your answer to Exercise
2.3?

TU2.6 For TU4 construct a five-number summary and a box plot. Comment on the
distribution shape. Does your answer agree with your answer to TU4?

TU2.7 A local delivery company is assessing the times for delivery of an order for the
last 35 customer orders. The data are presented in Table 2.27. Calculate the
mean, median, sample standard deviation, interquartile range, and measures of
sample skewness and kurtosis. Use these summary statistics to comment on the
distribution shape.

28 28 28 28 28 28 28 28 28
23 23 23 23 23 23 23 23 23
27 27 27 27 27 27 27 27 27
19 19 19 19 19 19 19 19 19
Table 2.27 Sample data for the delivery times (minutes)

Page | 164
TU2.8 Maxim’s, the wine merchant, supplies vintage wine to restaurants. Concerns have
been raised at the sales to one restaurant during the last 12 months. Maxim’s has
collected the last 48 weeks of sales data as presented in Table 2.28. Based upon
the weekly data calculate: (a) mean, (b) standard deviation, (c) median, (d) five-
number summary, and (e) box plot. Based upon your answers, comment on the
central tendency and whether the data are skewed.

43 39 31 34 37 36 34 44 29 31
44 38 34 53 29 40 27 53 43 38
35 28 25 29 46 31 43 38 35 25
41 46 32 38 39 32 33 42 27 45
26 50 38 37 36 46 41 38
Table 2.28 Weekly sales data

TU2.9 Joe runs a business delivering packages on behalf of a national postal business to
customers. He keeps detailed records of the daily number of deliveries and has
provided data for the last 32 days (Table 2.29). Calculate: (a) mean, (b) standard
deviation, (c) median, (d) five-number summary, and (e) box plot. Based upon
your answers, comment on the central tendency and whether the data are
skewed.

43 39 31 34 37 36 34 44
29 31 44 38 34 53 29 40
27 53 43 38 35 28 25 29
46 31 43 38 35 25 43 18
Table 2.29 Daily number of deliveries over 30 days

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A2Wa Inferring the population skewness value from the sample


2. A2Wb Inferring the population kurtosis value from the sample
3. A2Wc Chebyshev’s theorem
4. A2Wd Measures of average and dispersion for a frequency distribution
5. A2We Generating a grouped frequency distribution from raw data using
SPSS

Page | 165
Chapter 3 Probability distributions
3.1 Introduction and learning objectives
The topics covered in previous chapters would conventionally be called descriptive
statistics. This means that the techniques we described are very useful to describe
variables and populations of interest. However, statistics has another category called
inferential statistics.

As the word ‘inference’ implies, we will be drawing conclusions from something. This is
precisely what inferential statistics does. It analyses the results from smaller samples
that describe the same phenomena as the whole population, and then draws
conclusions from these samples and applies them to the whole population. The
fundamental tool that enables us to make these conclusions is probability theory.

This chapter starts with an introduction to common probability terms, such as sample
and expected value. These will help the reader understand the concepts described later
in the chapter. The focus of this chapter is to introduce the reader to probability
distributions that are commonly used in statistical hypothesis testing and to describe
them. These include the normal distribution, Student’s t distribution, the F distribution,
chi-square distribution, binomial distribution, and Poisson distribution.

Learning objectives

On completing this chapter, you will be able to:

1. Understand key probability terms, such as: experiment, outcome, sample space,
relative frequency, sample probability, mutually exclusive events, independent
events, and tree diagrams.
2. Identify a continuous probability distribution and calculate the mean and
variance.
3. Identify a discrete probability distribution and calculate the mean and variance.
4. Solve problems using Microsoft Excel and IBM SPSS software packages.

The concept of probability is an important aspect of the study of statistics. In this


chapter we will introduce you to some of the concepts that are relevant to probability.
However, the main aim of Chapter 3 is to focus on the ideas of continuous and discrete
probability distributions and not on the fundamentals of probability theory.

We will first explore continuous probability distributions (normal, Student’s t, and F)


and then introduce the concept of discrete probability distributions (binomial and
Poisson). Table 3.1 summarises the most frequently used probability distributions
according to whether the data variables are discrete or continuous and whether the
distributions are symmetric or skewed.

Page | 166
Variable type
Measured Discrete distributions Continuous distributions
characteristic
Shape Symmetric Skewed Symmetric Skewed
Distributions - Uniform - Poisson - Uniform -F
- Binomial - Hypergeometric - Student’s t - Exponential
- Normal
Table 3.1 Variable type versus measurement characteristic

Suppose that you are conducting a survey among students about attitudes towards
smoking. In your sample you included young people between 18 and 23 years of age.
You get some interesting results. Can you assume that everyone of similar age has
similar views? Perhaps. To get a definitive answer, you need to apply some fundamental
principles of probability. You need to figure out what is the probability that the similar
views are held by the rest of the population.

Let us now assume that you have graduated and that you got a job as a market research
analyst with a telephone service provider. The company would like to conduct a pilot
study in Ireland and, if it works, to implement the findings not only in Ireland, but also
in the UK. Do you have the tools to confirm with confidence that the results are
applicable in the rest of Ireland, as well as in the UK? If you do, what is the level of
confidence, or, to put it another way, the level of risk that you are prepared to tolerate
for the things to go wrong?

The above examples are just two out of many, many possible problem areas that this
chapter will help you understand better. Probability plays a key role in statistics. As you
collect data, or record any kind of data set, you will quickly realise that the data set
behaves in a fashion and has some specific characteristics that describe it as unique. We
refer to this behaviour, and the associated characteristics, as the data distribution.
There are several well-known and well-defined types of distributions. The best-known
one is the so-called normal distribution. Every data point that belongs to this, or any
other distribution, is defined by certain rules and probability laws. By learning how
these probability laws apply, you will learn how to deal with the questions that we put
forward at the beginning of this section.

3.2 What is probability?


Introduction to probability

There are several words and phrases that encapsulate the basic concept of probability:
‘chance’, ‘probable’, ‘odds’ and so on. In all cases we are faced with a degree of
uncertainty and concerned with the likelihood of an event happening. These words and
phrases are too vague, so we need some measure of the likelihood of an event occurring.
This measure is termed probability and is measured on a scale between 0 and 1, with 0
representing no possibility of the event occurring and 1 representing certainty that the
event will occur (Figure 3.1). For all practical situations, the value of the probability will
lie between 0 and 1.

Page | 167
Figure 3.1 Range of probability values

To determine the probability of an event occurring, data must be collected. For example,
this can be achieved through experience, desk research, observation or empirical
methods. The term ‘experiment’ is used when we want to make observations for a
situation of uncertainty. The actual results of the uncertain situation are called the
outcome or sample point.

If the result of an experiment remains uncertain from one repetition to another then the
experiment is called a random experiment. In a random experiment, the outcome
cannot be stated with certainty. An experiment may consist of one or more
observations. If there is only a single observation, the term ‘random trial’ or ‘simple
trial’ is used. An LED bulb may be selected from a factory to examine if it is defective or
not. A single LED bulb being selected is a trial. We can select any number of LED bulbs.
The number of observations will be equal to the number of LED bulbs. A random
experiment has the following properties:

1. It may be repeated any number of times under similar conditions.


2. It has more than one possible outcome.
3. Outcomes vary from trial to trial even when the initial conditions are the same.

Here are some examples of random variables:

1. In an experiment involving measuring the time for a LED bulb to fail, the random
variable X would be the time taken for an LED bulb to fail.
2. In an experiment involving measuring the starting salary of recently graduated
students, the random variable X would be the value of this starting salary for
each student measured within the experiment.

Since random variables cannot be predicted exactly, they must be described in the
language of probability where every outcome of the experiment will have a probability
associated with it. The result of an experiment is called an ‘outcome’. It is the single
possible result of an experiment – for example, tossing a coin produces a ‘head’, or
rolling a die gives a 3.

If we accept the proposition that an experiment can produce a finite number of


outcomes, then we could in theory define all these outcomes. The set of all possible
outcomes is defined as the sample space. For example, the experiment of rolling a die
could produce the outcomes 1, 2, 3, 4, 5, 6 which would thus define the sample space.

Page | 168
Another basic notion is the concept of an event. Think of it as simply an occurrence of
one of the possible outcomes – this implies that an event is a subset of the sample space.
For example, in the experiment of rolling a die, the event of obtaining an even number
would be defined as the subset {2, 4, 6}. Finally, two events are said to be mutually
exclusive if they cannot occur together. By rolling a die, for example, the event stated as
‘obtaining a 2’, is mutually exclusive of the event ‘obtaining a 3’. The event ‘obtaining a
2’ and the event ‘obtaining an even number’ are not mutually exclusive since both can
occur together, since {2} is a subset of {2, 4, 6}.

This section provides a very basic overview and a refresher of some of the most
elementary concepts needed to be understood to follow the rest of the chapter. The
online chapters provide a more comprehensive introduction and a refresher into
elementary probability theory.

Relative frequency

Suppose we perform the experiment of throwing a die and note the score obtained. We
repeat the experiment many times. We will assign the symbol n to the number of times
we repeated the experiment. We also observe the occurrence of event A out of the total
number of experiments. We will the symbol m for the occurrence of this event A. The
ratio between m and n is called the relative frequency. In general, if event A occurs m
times, then your estimate of the probability that A will occur is given by equation (3.1).
𝑚
𝑃(𝐴) = (3.1)
𝑛

Example 3.1

Consider the result of running the die experiment where the die has been thrown 10
times and the number of times each possible outcome (1, 2, 3, 4, 5, 6) recorded. Now
consider the result of running the die experiment where the die has been thrown 1000
times and the number of times each possible outcome (1, 2, 3, 4, 5, 6) recorded. The
result of this die experiment is also illustrated in Table 3.2.

Score 1 2 3 4 5 6

Frequency for 10 runs 3 1 2 1 0 3

Relative frequency for 10 0.3 0.1 0.2 0.1 0 0.3

Frequency for 1000 runs 173 168 167 161 172 159

Relative Frequency for 1000 0.173 0.168 0.167 0.161 0.172 0.159
Table 3.2 Calculation of relative frequencies

We can see big differences between the relative frequencies for 10 throws when
compared with 1000 throws of a die. As the number of experiments increases, the
relative frequency stabilises and approaches the true probability of the event. Thus, if
we had performed the above experiment 2000 times we might expect 'in the long run'

Page | 169
the frequencies of all the scores to approach 0.167. This implies that P(1) = P(2) = P(3)
= P(4) = P(5) = P(6) = 0.167. Actually, for this experiment the theoretical values for each
event would be P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6.

It is a very common practise to calculate probabilities through this relative frequency


approach. This approach is also called the ‘empirical approach’ or the ‘experimental
probability’ approach. A good example for this approach is a scenario where a
manufacturer indicates that he is 99% certain (P = 99% or 0.99) that an electric light
bulb will last 200 hours. This figure will have been arrived at from experiments which
have tested numerous samples of light bulbs. You can also read this statement
differently. You can interpret it that there is 1% chance (risk) that the bulb will not last
200 hours.

Several important issues are assumed, and should be remembered, when approaching
probability problems: The probability of each event within the probability experiment
lies between 0 and 1. The sum of probabilities of all events in this experiment equals 1.
If we know the probability of an event occurring in the experiment, then the probability
of it not occurring is P(event not occurring) = 1 – P(event occurring).

Sample space

We already know that the sample space contains all likely outcomes of an experiment
and that one or more outcomes constitute an event. Here, rather than resort to the
notion of relative frequency, we will look at probability as defined by equation (3.2):

Number of outcomes in the event


P(Event)= (3.2)
Total number of outcomes

An example below is used to illustrate this notion via the construction of the sample
space.

Example 3.2

If an experiment consists of rolling a die then the possible outcomes are 1, 2, 3, 4, 5, 6.


The probability of obtaining a 3 from one roll of the dice can then be calculated using
equation (3.2):

Number of outcomes producing a 3 1


P(Obtaining a 3)= =
Total number of outcomes 6

2
The probability of obtaining a 3 is 0.166…, or 16.7%, or 16 3 %.

Discrete and continuous random variables

A random variable is a variable that provides a measure of the possible values


obtainable from an experiment. For example, we may wish to count the number of times
that the number 3 appears on the tossing of a fair die, or we may wish to measure the
weight of people participating in a new diet programme. They are both random
variables.

Page | 170
The probabilities of a particular outcome for a random variable are distributed in a
certain way. These probability distributions will be different, depending on our random
variable being either discrete or continuous.

Here is an example of discrete random variable. Let the random variable consist of the
numbers 1, 2, 3, 4, 5, 6 (a six-sided die, for example). If the die were fair, then on each
toss of the die each possible number (or outcome) will have an equal chance of
occurring. The numbers 1, 2, 3, 4, 5, 6 represent the values of the random variable for
this experiment. As the values are the whole-number answers (not a continuum such as
1.1, 1.2, etc.), this is an example of a discrete random variable. Several discrete
probability distributions will be discussed in this and online chapters, including
binomial and Poisson.

If the numbers can take any value with respect to measured accuracy (160.4 lbs, 160.41
lbs. 160.414 lbs, etc.), then this is an example of a continuous random variable. In this
chapter, we will explore the concept of a continuous probability distribution with the
focus on introducing the reader to the normal probability distribution. However, several
other continuous probability distributions will be discussed in this and online chapters,
including Student’s t distribution, the chi-square distribution, and the F distribution.

3.3 Continuous probability distributions


Introduction

In probability theory, an expected value is the theoretical mean value of a numerical


experiment over many repetitions of the experiment. The phrases “the mean” and “the
expected value” can be used interchangeably.

For any continuous probability distribution, the expected value, E(X), and variance,
VAR(X), can be found by solving the integral equations (3.3) and (3.4), with the function
f (x) known:

b
E ( X ) =  xf ( x) dx
a (3.3)

Equation (3.3) represents the expected value E(X) of X, which is the total area under the
function, x f (x), between the lower limit (a) and upper limit (b). Essentially this implies
that when we are referring to the whole distribution, then the expected value is equal to
the mean value. We already said that we will use the phrases “expected value” and “the
mean” interchangeably.

b
VAR ( X ) = E ( X −  )2 =  ( x −  )2 f ( x) dx
a (3.4)

Equation (3.4) represents the variance VAR(X) of X, which is the total area under the
function (x – )2 f (x) between the lower limit (a) and upper limit (b). We will provide

Page | 171
more detailed explanations of this concept in the context of a specific continuous
probability distribution.

The normal distribution

In probability theory, the normal distribution is a very common continuous probability


distribution. The normal distribution is important because of the central limit
theorem. The central limit theorem states that, under certain conditions, averages of
samples of observations of random variables independently drawn from independent
distributions converge in distribution to the normal. This statement is a powerful tool
that will enable us to make several inferences and is the foundation of inferential
statistics.

The statement about the central limit theorem means that if we took many, many
samples from any kind of distribution (even non-normal distribution), when we
calculate the averages for each and every one of these samples, these averages will
follow the normal distribution, regardless of what the distribution of the original
population happens to be. Many of the real-life variables that have a normal distribution
can, for example, be found in manufacturing (weights of tin cans) or can be associated
with the human population (people’s heights).

The probability density of the normal distribution is defined by equation (3.5):

1  1  x −  2 
f (X ) = exp  −   
 2  2     (3.5)

Where:

• µ is the population mean or expectation of the distribution.


•  is the population standard deviation.

The probability density is the function that gives you the probability that at any point
the value of the sample x is close to the value of the function X. Or to put it another way,
if you integrate the function in each interval, then the area under the function in this
interval is equal to the probability that a random variable will be in this interval.

Think of the probability density function as a function that defines the probability of
occurrence of every value from a random variable. How these probabilities are shaped
and distributed is determined by the probability density function. The following
conventions are often used in relation to normal distribution:

1. The population mean and population standard deviation are represented by the
notation  and  respectively.
2. If a variable X follows a normal distribution, we write X ~ N(, 2), which is read
as ‘X varies in accordance with a normal distribution, whose mean is  and
whose variance is 2.
3. The total area under the curve represents the total probability of all events
occurring, which equals 1.

Page | 172
4. The mean of the random variable is , which is the same as saying that the
expected value E(X) = .
5. The variance of the random variable is 2, which is the same as saying that the
variance value VAR(X) = E[(X – )2] = 2.

Equation (3.5) can be represented graphically by Figure 3.2, which illustrates the
symmetrical characteristics of the normal distribution. For the normal distribution the
mean, median, and mode are all aligned and have the same numerical value. The normal
distribution is sometimes called the ‘bell curve’.

Figure 3.2 Percentage points of the normal distribution

It is a property of the normal curve that 68.3% of all the values reside between   1,
95.5% of all the values reside between   2 and 99.7% of all the values reside
between   3.

To calculate the probability of a value of X occurring we would use Excel (or SPSS, or
statistical tables) to find the corresponding value of the probability.

Example 3.3

A manufacturing firm quality department assures components manufactured, and


historically the length of a tube is found to be normally distributed with a population
mean of 123 cm and a population standard deviation of 13 cm. Calculate the probability
that a random sample of one tube will have a length of at least 136 cm.

From the information provided we define X as the tube length in centimetres, with
population mean µ = 123 and standard deviation σ = 13. This can be represented using
the notation X ~ N(123, 132). The problem we must solve is to calculate the probability
that one tube will have a length of at least 136 cm. This can be written as P(X ≥ 136) and
is represented by the shaded area illustrated in Figure 3.3.

Page | 173
Figure 3.3 Region represents P(X ≥ 136)

Excel solution

The Excel solution is illustrated in Figure 3.5. This problem can be solved by using the
Excel function =NORM.DIST(X, µ, , TRUE). This function calculates the shaded region to
the left of P(X  136). Therefore, P(X  136) = 1 – NORM.DIST(136, 123, 13, TRUE).

Figure 3.4 Relationship between P(X ≥ 136) and NORM.DIST Excel function

Figure 3.5 Example 3.3 Excel solution

Page | 174
From Excel:

P(X ≥ 136) = 0.1587

We observe that the probability that an individual tube length is at least 136 cm is
0.1587, or 15.87%.

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.6. Please note you can enter any
number– see Figure 3.6.

Figure 3.6 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.

Select Transform > Compute Variable

Target Variable: Example 3


Numeric expression = 1 – CDF.NORMAL (136, 123, 13).

Figure 3.7 Use of compute variable to calculate P(X ≥ 136)

Click OK

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example 3.

Figure 3.8 SPSS solution, P(X ≥ 136) = 0.158655

Page | 175
The probability that an individual tube length is at least 136 cm is 0.1587 (or 15.87%).
This agrees with the Excel solution illustrated in Figure 3.5.

Example 3.4

Using the same assumptions as in Example 3.3, calculate the probability that X lies
between 110 and 136 cm. In this example, we are required to calculate P(110 ≤ X ≤ 136)
which represents the area shaded in Figure 3.9. The value of P(110 ≤ X ≤ 136) can be
calculated using Excel’s =NORM.DIST() function.

Figure 3.9 Shaded region represents P(110 ≤ X ≤ 136)

Excel solution

The Excel solution is illustrated in Figure 3.10.

Figure 3.10 Excel solution for P(110 ≤ X ≤ 136)

The =NORM.DIST() function can be used to calculate P(110 ≤ X ≤ 136) = 0.682689492.


Thus, the probability that an individual tube length lies between 110 and 136 cm is
0.6827 or 68.27%.

Page | 176
SPSS solution

Enter data into SPSS

As before, note that in these examples we have no data to input but we must enter a
data value to be able to use the methods described below. In this example, we have
entered the number 1 into column 1 (VAR00001).

Figure 3.11 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.


Repeat the calculation above but this time use:

Select Transform > Compute Variable


Target Variable: Example 4
Numeric expression = CDF.NORMAL (136, 123, 13) – CDF.NORMAL (110, 123,
13).

Figure 3.12 Use computer variable to calculate P(110 ≤ X ≤ 136)

Click OK.

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example 4.

Figure 3.13 SPSS solution, P(110 ≤ X ≤ 136) = 0.682689

The probability that an individual tube length lies between 110 and 136 cm is 0.6827
(or 68.27%). This agrees with the Excel solution shown in Figure 3.10.

Page | 177
Check your understanding

X3.1 Calculate the following probabilities, where X ~ N(100, 25): (a) P(X  95), (b)
P(95  X  105), (c) P(105  X  115), (d) P(93  X  99). For each
probability identify the region to be found by shading the area on the normal
probability distribution graph.

The standard normal distribution (Z distribution)

Assume you are researching two different populations, both following normal
distributions. However, it could be difficult to compare if the units are different, or the
means and variances might be different. If this was the case, we would like to be able to
standardise these distributions so that we can compare them. This is possible by
creating the standard normal distribution. The corresponding probability density
function f (z) is given by equation (3.6):

1 1
𝑓(𝑧) = exp (− 2 𝑧 2 ) (3.6)
√2𝜋

The standard normal distribution is a normal distribution whose mean is always 0


(=0) and whose standard deviation is always 1 (=1).

This means that every value of X in a normal distribution, can be transformed to a value
of Z in the standard normal distribution. To achieve this, we are using equation (3.7):

X −
Z=
 (3.7)

Where X, µ, and σ are the variable score value, population mean, and population
standard deviation respectively, taken from the original normal distribution. Equation
(3.6) can also be solved for X, which means that if we know the values of Z,  and , we
can calculate the value of X.

This is done by rearranging equation (3.6) into Z = X – , which ultimately yields X =


Z + , or X =  + Z.

The advantage of this method is that the Z values are not dependent on the original data
units, and this allows tables of Z values to be produced with corresponding areas under
the curve. This also allows for probabilities to be calculated if the Z value is known, and
vice versa, which allows a range of problems to be solved.

Figure 3.14 illustrates the standard normal distribution (or Z distribution) with Z scores
between –3 and +3 and how they correspond to the actual X values.

Page | 178
Figure 3.14 Normal and standard normal curve

The Z-value is effectively a standard deviation from a standard normal distribution.


Because Z is always identical to 1 (standard deviation), this means that the standard
normal distribution will always have 68.3% of all the values between 1, 95.5% of all
the values between 2, and 99.7% of all the values between 3.

The Excel function =NORM.S.DIST() (not to be confused with the =NORM.DIST()


function) calculates the probability P(Z ≤ z) as illustrated in Figure 3.15.

Figure 3.15 Shaded region represents P(Z ≤ z)

If Z corresponds to the standard deviation of the standard normal distribution, and in


the box above we said 2 covers 95.5% of the distribution, how does this translate into
the statements that we made about the ordinary (non-standard) normal distribution?

Just before Example 3.3 we stated that any normal distribution covers 68.3% of the
values for   1, 95.5% of the values for   2, and 99.7% of the values for   3. In
the case of the standard normal distribution  = 0, which means that we need 1.96Z (not
2 or 2Z) to cover exactly 95% of the values and 2.58Z (not 3 or 3Z) to cover exactly
99% of the values.

Page | 179
We can show, for example, that the proportion of values between ±1, ±2, and ±3
population standard deviations from the population mean of zero is 68.3%, 95.5%, and
99.7% respectively as illustrated in Figure 3.16.

Figure 3.16 Population proportions within 1, 3, 3 population standard deviations

We’ll illustrate the method of calculating the area between the mean ± 1 standard
deviation (  1), or ± 1z. Note that these values can be found using critical tables or
using software such as Excel and SPSS.

Say, we want to calculate the probability that Z lies between -1 and + 1. This is
represented by the statement P(- 1 ≤ Z ≤ +1).Remember that the total area underneath
the curve but equal to or above the y-axis represents the total probability which equals
1. We can write this as:

P(- infinity  Z  + infinity) = 1

Therefore,

1 = P( - 1  Z) + P(- 1  Z) + P(Z  +1) + P(Z  1)

Rearranging this equation to give

P(- 1 ≤ Z ≤ +1) = 1 – P(Z  - 1) – P(Z  + 1)

Due to the normal distribution being symmetric, then P(Z  - 1) = P(Z  + 1).

Therefore,

P(- 1 ≤ Z ≤ +1) = 1 – P(Z  + 1) – P(Z  + 1)

P(- 1 ≤ Z ≤ +1) = 1 – 2 * P(Z  + 1)

Page | 180
From table 3.3, we can look up the probabilities associated with certain Z values. The Z
values listed in this table provide the right-hand tail probabilities for positive values of Z
i.e. P(Z  + z).

Z 0.00 0.01 0.02 0.03 0.04


0.0 0.500 0.496 0.492 0.488 0.484
0.1 0.460 0.456 0.452 0.448 0.444
0.2 0.421 0.417 0.413 0.409 0.405
0.3 0.382 0.378 0.374 0.371 0.367
0.4 0.345 0.341 0.337 0.334 0.330
0.5 0.309 0.305 0.302 0.298 0.295
0.6 0.274 0.271 0.268 0.264 0.261
0.7 0.242 0.239 0.236 0.233 0.230
0.8 0.212 0.209 0.206 0.203 0.200
0.9 0.184 0.181 0.179 0.176 0.174
1.0 0.159 0.156 0.154 0.152 0.149
1.1 0.136 0.133 0.131 0.129 0.127
1.2 0.115 0.113 0.111 0.109 0.107
1.3 0.097 0.095 0.093 0.092 0.090
Table 3.3 Use of critical values to find P(Z  1)

From table 3.3, P(Z  + 1) = P(Z  + 1.00) = 0.159 to 3 decimal places.

P(- 1 ≤ Z ≤ +1) = 1 – 2 * 0.159

P(- 1 ≤ Z ≤ +1) = 0.682

For the standard normal distribution where z = 1 is the same as one standard deviation,
the proportion of all values between  ± 1σ is 0.682, or 68.2%.

Other values can be looked up in a similar manner from table 3.3. For example, to look
up P(Z  1.84), find the row for 1.8 and the 0.04 column. At the intersection of the row
and column (1.8 + 0.04 = 1.84) it can be seen that P(Z  1.84)=0.033.

Example 3.5

Reconsider Example 3.3 but find P(X  136) by calculating the value of Z. If a variable X
varies as a normal distribution with a mean of 123 and a standard deviation of 13, then
the value of Z when X = 136 would be given by equation (3.6):

136 − 123
𝑍= = +1
13

From table 3.3:

P(X  136) = P(Z  1) = 0.159.

Therefore, the probability that X  136 is 15.9%. This agrees with the answer given in
Example 3.3.

Page | 181
Excel solution

The value of P(Z ≥ 1) can be calculated using Excel’s =NORM.S.DIST() function. The
Excel solution is illustrated in Figure 3.18.

Figure 3.18 Example 3.5 Excel solution P(Z ≥ 1)

We used two functions to calculate that P(X ≥ 136) or P(Z ≥ 1). The first function in cell
C10 is the Excel =NORM.DIST() function and the other one in cell C15 is the Excel
=NORM.S.DIST() function. The results are the same, although the input parameters are
different.

The first function (cell C10) requires as an input the X values with the corresponding
mean and the standard deviation values. The second function (cell C15) requires only
the Z value to calculate the same probability. In cell C14, instead of using manual
formula, we could have used the Excel function =STANDARDIZE(x, mean, standard-dev).
Either way, we get the same Z value. This solution can be represented graphically by
Figure 3.19. From Excel, the =NORM.S.DIST() function can be used to calculate P(Z ≥ +1)
= 0.1588655.

Figure 3.19 Shaded region represents P(Z ≥ 1)

Page | 182
We observe that the probability that an individual tube length is at least 136 cm is
0.1589 or 15.89% (P(X ≥ 136) = P(Z ≥ 1) = 0.1589).

Take a note of the following remarks:

1. The Excel function =NORM.DIST() calculates the value of the normal distribution
for the specified mean and standard deviation.
2. The Excel function =NORM.S.DIST() calculates the value of the normal
distribution for the specified Z score value.
3. The value of the Z score can also be calculated using the Excel function
=STANDARDIZE().
4. If the mean is equal to 0 and the standard deviation is equal to 1, then
=NORM.S.DIST() and =NORM.DIST() produce identical results. If 0 and σ1,
then only the =NORM.DIST() function can be used.

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.20 (please note you can enter any
number).

Figure 3.20 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.

Select Transform > Compute Variable


Target Variable: Example 5
Numeric expression = 1-CDF.NORMAL (136, 123, 13)

Figure 3.21 Use compute variable to calculate P(X ≥ 136)

Click OK.

Page | 183
The value will not be in the SPSS output file but in the SPSS data file in a column
called Example5.

Figure 3.22 SPSS solution P(Z ≥ 1) = 0.158655

The probability that an individual tube length is at least 110 cm is 0.158655 (or 15.9%).
This agrees with the Excel solution illustrated in Figure 3.16.

Example 3.6

A local authority purchases 250 emergency lights to be used by the emergency services.
The lifetime in hours of these lights follows a normal distribution, where X ~ N(230,
182). Calculate: (a) what number of lights might be expected to fail within the first 215
hours; (b) what number of lights may be expected to fail between 227 and 235 hours;
and (c) after how many hours would we expect 10% of the lights to fail?

Excel solution

a. From this information we have population mean, µ, of 230 hours and a variance, σ2,
of 324 hours (which is 182). This problem can be solved using either the
=NORM.DIST() or =NORM.S.DIST() Excel function.

This solution can be represented graphically by Figure 3.21. This problem


involves finding P(X ≤ 215), and then multiplying it by the number of lights
purchased (250) to obtain the number expected to fail within the first 215 hours.

Figure 3.23 Shaded area represents P(X ≤ 215)

The Excel solution is illustrated in Figure 3.23. The =NORM.DIST() or


=NORM.S.DIST() function can be used to calculate P(X ≤ 215) = 0.2023. The
number of lights that are expected to fail out of the 250 lights purchased is E(fail)
= 250 × P(X ≤ 215) = 50.5 or 51 of the purchased lights.

Page | 184
Figure 3.24 Example 3.6 (a) Excel solution P(X ≤ 215)

b. The second part of the problem requires the calculation of the probability that X lies
between 227 and 235 hours, and the estimated number of purchased lights out of
250 which will fail.

This problem consists of finding P(227 ≤ X ≤ 235), as shown graphically by


Figure 3.25.

Figure 3.25 Shaded region represents P(227 ≤ X ≤ 235)

The Excel solution is illustrated in Figure 3.26. The =NORM.DIST() or


=NORM.S.DIST() function can be used to calculate P(227 ≤ X ≤ 235) =
0.175592357.

The number of purchased lights that are expected to fail between 227 and 235
hours out of the 250 lamps is then E(fail) = 250 × P(227 ≤ X ≤ 235) = 48.8980 or
49 purchased lights.

Page | 185
Figure 3.26 Example 3.6(b) Excel solution

c. The final part of this problem involves calculating the number of hours for the first
10% to fail. This corresponds to calculating the value of x where P(X ≤ x) = 0.1. To
solve this problem, we need two new Excel functions: =NORM.INV() and
=NORM.S.INV(). Figures 3.27 and 3,28 illustrate the graphical and Excel solutions.

Figure 3.27 Shaded region represents P(X ≤ x) = 0.1

Page | 186
Figure 3.28 Example 3.6(c) Excel solution

From Excel, the expected time for 10% to fail is 206.93, or to round it up, in 207 hours.
We show two different ways to solve this in cells C11 and C14. In cell C11 we calculate X
directly using Excel =NORM.INV() function. The result is 206.93 hours.

The Excel function =NORM.INV() calculates the value of X from a normal distribution for
the specified probability, mean and standard deviation. The Excel function
=NORM.S.INV() calculate the value Z from normal distribution for the specified
probability value. In cell C14 we calculate X directly from equation (3.7), which was
solved for X:

X −
Z= → Z = X −  → X = Z + 

We find that P(X ≤ x) = 0.1 corresponds to Z = –1.28 (cell C13). We can now use the
above equation to obtain X = (–1.28 × 18) + 230 = 206.96 (slight error here due to the
use of 2 decimal places in the Z value).

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.29. Please note you can enter any
number. Now we can use SPSS Statistics to calculate the associated probabilities.

Figure 3.29 Enter number 1 to represent VAR00001

a. Repeat the calculation above, but this time use:

Page | 187
Select Transform > Compute Variable
Target Variable: Example 6
Numeric expression = CDF.NORMAL(215, 230, 18)

Figure 3.30 Use computer variable to calculate P(X ≤ 215)

Click ok

Figure 3.31 SPSS solution P(X ≤ 215) = 0.202328

Target Variable: Example6_expected

Numeric expression = Example6*250

Figure 3.32 Use computer variable to calculate expected value

Click OK.

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example6_expected.

Figure 3.33 SPSS solution E(X ≤ 215) = 50.582095 or 51

The number of lights that are expected to fail out of the 250 lights is E(fail) = 250 × P(X
≤ 215) = 51 lamps. This agrees with the Excel solution illustrated in Figure 3.24.

b. Repeat the calculation above, but this time use:

Page | 188
Select Transform > Compute Variable
Target Variable: Example6b
Numeric expression = CDF.NORMAL(235, 230, 18) – CDF.NORMAL(227, 230, 18).

Figure 3.34 Use computer variable to calculate P(227 ≤ X ≤ 235)


Click OK.

Target Variable: Example6b_expected


Numeric expression = Example6b_expected*250

Figure 3.35 Calculate expected value

Click OK.

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example6b_expected.

Figure 3.36 SPSS solution E(227 ≤ X ≤ 235) = 43.898089 or 44.

The number of lights that are expected to fail between 227 and 235 hours out of the 250
lights is E(fail) = 250 × P(227 ≤ X ≤ 235) = 44. This agrees with the Excel solution shown
in Figure 3.26.

c. Repeat the calculation above, but this time use:

Select Transform > Compute Variable


Target Variable: Example6c
Numeric expression = IDF.NORMAL(0.1, 230, 13).

Page | 189
Figure 3.37 Use compute variable to calculate x given P(X ≤ x) = 0.1

Click OK.

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example6c.

Figure 3.38 SPSS solution x = 206.932072 or 207.

The expected time for 10% to fail is 207 hours. This agrees with the Excel solution
shown in Figure 3.28.

Check your understanding

X3.2 Calculate the following probabilities, where X ~ N(100, 25): (a) P(X  95), (b)
P(95  X  105), (c) P(105  X  115), (d) P(93  X  99)? In each case
convert X to Z. Compare with your answers from X3.1.

X3.3 Given that a normal variable has a mean of 12 and a variance of 36, calculate the
probability that a member chosen at random is: (a) 15 or greater, (b) 15 or
smaller, (c) 5 or smaller, (d) 5 or greater, (e) between 5 and 15.

X3.4 The lifetimes of certain brand of car tyres are normally distributed with a mean
of 60285 km and standard deviation of 7230 km. If the supplier guarantees them
for 50000 km, what proportion of tyres will be replaced under guarantee?

X3.5 Audio sensor have a design frequency of max 48 KHz. The sensors are produced
on a line with an output distributed as N(48.1, 1.03). Maximum frequency range
below 47.9 KHz and above 48.2 KHz are rejected. Find: (a) the proportion that
will be rejected; (b) the proportion that would be rejected if the mean were
adjusted so as to minimise the proportion of rejects; (c) by how much the
standard deviation would need to be reduced (leaving the mean at 48.1 KHz) so
that the proportion of rejects below 47.9 KHz would be halved.

Checking for normality


Normality tests assess the likelihood that the given data set comes from a normal
distribution. This is an important concept in statistics, given that the parametric
assumption relies on the data being normally distributed or approximately normally

Page | 190
distributed. Several statistical tests exist to test for normality, such as the Shapiro–Wilk
test. However, several visual tests can also be used, such as:

1. Constructing a five-number summary and box plot


2. Constructing a normal probability plot.

We are already familiar with the five-number summary and box plot. The second
approach, a normal probability plot, involves constructing a graph of data values against
corresponding Z values, where Z is based upon the ordered value.

Example 3.7

The manager at Big Jim’s restaurant is concerned at the time it takes to process credit
card payments at the counter by counter staff. The manager has collected the processing
time data (time in minutes for each of 19 cards) shown in Table 3.4 and requested that
the data be checked to see if they are normally distributed.

0.64 0.71 0.85 0.89 0.92 0.96 1.07 0.76 1.09 1.13
1.23 0.76 1.18 0.79 1.26 1.29 1.34 1.38 1.5
Table 3.4 Processing cards (n=19)

Excel solution

The method to create the normal probability plot is as follows (refer to Figure 3.39):

1. Order the data values (1, 2, 3, …, n) with 1 referring to the smallest data value
and n representing the largest data value (column E).
2. Show the data (y) sorted in ascending order (column F)
3. For the first data value (smallest) calculate the cumulative area using the
formula: = 1/(n + 1) (cell G4).
4. Repeat for the other values, where the cumulative area is given by the formula:
=old area + 1/(n + 1) (cells G5 down).
5. Calculate the value of Z for this cumulative area using the Excel Function:
=NORM.S.INV(Z ) (column I).
6. Plot data values y (column F) against Z values (column I) for each data point
(Figure 3.40).

Figure 3.39 illustrates the Excel solution.

Page | 191
Figure 3.39 Example 3.7 Use excel to create a normal probability plot

Figure 3.40 shows the normal probability curve plot for this example.

Figure 3.40 Example 3.7 Normal probability plot

We observe from the graph that the relationship between the data values y and Z is
approximately a straight line. For data that are normally distributed we would expect
the relationship to be linear. In this situation, we would accept the statement that the
data values are approximately normally distributed.

SPSS solution

Enter the data into SPSS in one column – we have called the column variable
Normalcheck.

Page | 192
Figure 3.41 Example 3.7 SPSS data

Select Analyze > Descriptive Statistics > Explore.

Figure 3.42 SPSS Explore menu

Transfer variable into Dependent List box

Page | 193
Figure 3.43 SPSS explore menu

Click on Plots.

Click on Normality plots with tests

Figure 3.44 SPSS explore plots options


Click Continue
Click OK

SPSS output

This will output: (a) descriptive statistics, (b) tests of normality, and (c) normal Q-Q
plot.

Page | 194
Figure 3.45 SPSS descriptives solution

From the descriptive statistics in Figure 3.45, we see that the skewness is 0.115 and the
kurtosis –1.156. The following can be inferred:

a. Descriptive statistics output already indicates that this data is normally distributed,
and you can see it from the skewness and kurtosis values. Remember that SPSS and
Excel show excess kurtosis rather than the proper kurtosis value. Excess kurtosis is
the proper kurtosis value minus 3 (in our case, the proper kurtosis value equals –
4.156). It is expected that both the ratio of skewness/skewness standard error and
excess kurtosis/standard error for excess kurtosis to lie within the range ± 1.96 for a
95% confidence interval. In our example (0.115/0.524=0.219 and -1.156/1.014=-
1.140), both ratios are within the range - 1.96 to + 1.96, and therefore from just
these descriptives we can conclude that the distribution is normally distributed.

b. SPSS test of normality in Figure 3.45 presents the results from two well-known tests
of normality, namely the Kolmogorov–Smirnov test and the Shapiro–Wilk test.

The Shapiro–Wilk test is more appropriate for small sample sizes (less than 50) but
can also handle sample sizes as large as 2000. For this reason, we will use the
Shapiro–Wilk test as our numerical means of assessing normality. If the p-value (Sig.
in Figure 3.45) of the Shapiro–Wilk test is greater than 0.05, the data are
approximately normally distributed. If it is below 0.05, the data significantly deviate

Page | 195
from a normal distribution. In this example, the significance value is 0.584 > 0.05, and
we conclude that the data are approximately normally distributed. The rationale
behind this conclusion will become much clearer after we introduced hypothesis
testing in Chapter 6.

c. To determine normality graphically, we can use the output of a normal Q-Q plot. If
the data are normally distributed, the data points will be close to a straight line. If
the data points stray from the line in an obvious nonlinear fashion, the data are not
normally distributed. From Figure 3.46, we observe the data are approximately
normally distributed. This agrees with the Excel solution illustrated in Figure 3.40.

Figure 3.46 Line fit to normal probability plot

If you are at all unsure of being able to correctly interpret the graph, rely on the
numerical methods instead because it can take a fair bit of experience to correctly judge
the normality of data based on plots.

To conclude, let us illustrate just three possible scenarios to better understand how
decisions are made on the symmetry of a distribution and the shape of the normal
probability curve.

First, Figure 3.47 illustrates a normal distribution where largest value minus Q3 equals
Q1 minus smallest value.

Second, Figure 3.48 illustrates a left-skewed distribution where Q1 minus smallest


value greatly exceeds largest value minus Q3.

Finally, Figure 3.49 illustrates a right-skewed distribution where largest value minus Q3
greatly exceeds Q1 minus smallest value.

Page | 196
Figure 3.47 Line fit to a distribution that is normally distributed

Figure 3.48 Line fit to a distribution that is left skewed

Figure 3.49 Line fit to a distribution that is right skewed

Check your understanding

X3.6 Use SPSS to assess whether the data set in Table 3.5 may have been drawn from
a normal distribution by comparing (a) skewness, (b) kurtosis, (c) Shapiro–Wilk
test statistic, and (d) normal Q-Q plot. [Hint: Access the want to learn more –
common assumptions about data document to help you answer this question.

Page | 197
3.4 3.8 3.9 2.7 4.8 4.7
4.3 3.2 3.7 3.2 3.8 4.7
4.4 3.3 3.9 4.4 3.3 3.1
3.7 3.2 3.5 4.5 4.1 3.2
4.1 4.0 4.2 4.7 3.5 4.1
Table 3.5 Data set

Student’s t distribution
In probability and statistics, Student’s t distribution (or simply the t distribution) is any
member of a family of continuous probability distributions that arise when estimating
the mean of a normally distributed population in situations where the sample size is
small and the population standard deviation is unknown.

By ‘small sample’, we typically mean a sample with a maximum of 30 observations. It


was developed by William Gosset under the pseudonym Student. Whereas a normal
distribution describes a full population (it can also describe a sample), t distributions
always describe samples drawn from a full population. Accordingly, the t distribution
for each sample size is different, and the larger the sample, the more the distribution
resembles a normal distribution. For interested readers the probability density function
(pdf) of the t distribution is defined by equation (3.8):

  df + 1  
 2 
 df +1 
− 
1  t2 
    2 
f (t ) =  1 + 
df       
df  df 
  2  
(3.8)

Where the degrees of freedom df > 0 (to be explained shortly) and –∞ < t < +∞.

The t distribution is symmetric and bell-shaped, like the normal distribution, but has
heavier tails, meaning that it is more prone to producing values that fall far from its
mean.

Figure 3.50 provides a comparison between the effects of different sample sizes on the t
distribution compared to the standard normal curve. As the sample size increases, the t
distribution approaches the standard normal curve and the t distribution can be used in
place of the normal distribution when the population standard deviation (or variance)
is unknown.

Page | 198
Figure 3.50 Different Student’s t distributions compared to the normal
distribution

If we take a sample of n observations from a continuous distributed population with


population mean , then the sample mean, and sample variance are given by equations
(3.9) and (3.10), respectively:

𝑥1 +𝑥2 +𝑥3 +⋯+𝑥𝑛


𝑥̅ = (3.9)
𝑛

1 n
 ( xi − x )
2
S2 =
n −1 i =1 (3.10)

Given equations (3.9) and (3.10), we can calculate the t-value using equation (3.11):

𝑥̅ −𝜇
t= 𝑆 (3.11)
√𝑛

The t distribution with n – 1 degrees of freedom is the sampling distribution of the t-


value when the samples consist of independent and identically distributed
observations from a normally distributed population.

This t-test equation will prove very important in later chapters when we would like to
conduct a confidence interval or hypothesis test on data collected from a normal or
approximately normal population and we do not know the value of the population
variance (unknown standard deviation).

Example 3.8

A sample has been collected from a normal distribution where the population standard
deviation is unknown. After careful consideration, the business analyst decides that a t
test would be appropriate for the required analysis. Use Excel and SPSS to calculate the

Page | 199
value of the t statistic (assuming all the area is in the upper right-hand tail), when the
number of degrees of freedom is 18 and with a significance level of 0.1, 0.05 and 0.01.

Excel solution

Figure 3.51 illustrates the Excel solution. The t-values for a t distribution with 18
degrees of freedom are 1.33, 1.73 and 2.55 when the significance levels are 0.1, 0.05 and
0.01, respectively.

Figure 3.51 Example 3.8 Excel solution

The t-values for the significance levels of 0.1, 0.05 and 0.01 (cells C4:E4) are 1.33, 1.73
and 2.55 respectively (cells C6:E6).

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.52. Please note you can enter any
number.

Figure 3.52 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.

For a significance value of 0.1, repeat the calculation above but this time use:

Select Transform > Compute Variable


Target Variable: Example 8a
Numeric expression = IDF.T(1-0.1,18).

Page | 200
Figure 3.53 Use compute variable to calculate t(0.1, 18)

Click OK.

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example8a.

Figure 3.54 SPSS solution t(0.1, 18) = 1.33

The t-value for a t distribution with 18 degrees of freedom is 1.33 when the significance
level is 0.1. This agrees with the Excel solution shown in Figure 3.51.

For a significance value of 0.05, repeat the calculation above but this time use:

Select Transform > Compute Variable


Select Target Variable: Example8b
Numeric expression = IDF.T(1-0.05,18).

Figure 3.55 Use compute variable to calculate t(0.05, 18)

Click OK.

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column called
Example8b.

Page | 201
Figure 3.56 SPSS solution t(0.05, 18) = 1.73

The t-value for a t distribution with 18 degrees of freedom is 1.73 when the significance
level is 0.05. This agrees with the Excel solution shown in Figure 3.51.

For a significance value of 0.01, repeat the calculation above but this time use:

Select Transform > Compute Variable


Select Target Variable: e3p14c
Numeric expression = IDF.T(1-0.01,18).

Figure 3.57 Use compute variable to calculate t(0.01, 18)

Click OK.

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example8c.

Figure 3.58 SPSS solution t(0.01, 18) = 2.55

The t-value for a t distribution with 18 degrees of freedom is 2.55 when the significance
level is 0.01. This agrees with the Excel solution shown in Figure 3.51.

Example 3.9

The business analyst in the previous example finds that the value of the t statistic equals
1.24 with 18 degrees of freedom. Estimate the value of the area such that the variable is
less than 1.24 and the value of the probability density function at this value of t and df.

Excel solution

Figure 3.59 illustrates the Excel solution.

Page | 202
Figure 3.59 Example 3.9 Excel solution

The probability that the t value, when we have 18 degrees of freedom, is less than or
equal to 1.24 is 0.88 (or 88%). The value of the probability density function is 0.18 (or
18%) when t = 1.24, df = 18. This is the right tail distribution, and we will explain the
details in the following chapter.

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.60. Please note you can enter any
number.

Figure 3.60 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities. Repeat the
calculation above but this time use:

Select Transform > Compute Variable


Select Target Variable: Example9a
Numeric expression = CDF.T(1.24,18).

Figure 3.61 Use compute variable to calculate P(t ≤ 1.24)

Page | 203
Click OK.

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example9a.

Figure 3.62 SPSS solution P(t ≤ 1.24) = 0.88

Now, calculate f(x)

Select Target Variable: Example9b


Numeric expression = PDF.T(1.24,18).

Figure 3.63 Use compute variable to calculate probability density function f(x)
when t = 1.24, df = 18. The value will not be in the SPSS output file but in the
SPSS data file in a column called Example9b.

SPSS output

Figure 3.64 SPSS solution f(x) = 0.18

The probability that the t value is less than or equal to 1.24 is 0.88 (or 88%) when we
have 18 degrees of freedom. The value of the probability density function is 0.18 when t
= 1.24, df = 18. This agrees with the Excel solution illustrated in Figure 3.58.

Check your understanding

X3.7 Calculate the following t-distribution probabilities when df = 6: (a) P(t ≤ –1.45),
(b) P(t ≤ 0), (c) P(0.3 ≤ t ≤ 1.4), (d) P(–1.34 ≤ t ≤ 1.8).

X3.8 Calculate the area in the right-hand tail if the t-value equals 2.05, P(t  2.05), and
the t distribution has 15 degrees of freedom.

Page | 204
F distribution
In probability theory and statistics, the F distribution, also known as Snedecor's F
distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W.
Snedecor) is another continuous probability distribution. You will see in the chapters
that follow that this distribution arises frequently as the null distribution of a test
statistic, most notably in the analysis of variance. The F-test statistic is defined by
equation (3.12):

s12
F= 2
s2 (3.12)

Where 𝑆12 and 𝑆22 are the sample 1 and sample 2 variances, respectively. The shape of
the distribution depends upon the numerator and denominator degrees of freedom (df1
= n1 – 1, df2 = n2 – 1) and the F distribution is written as a function of n1, n2 as F(n1,
n2). The probability density function (pdf) of the F distribution is defined by equation
(3.13):

  df1 + df 2  
  2  
df1
−1
 
df1 df 2
x 2
f ( x) =   ( df1 ) 2 ( df 2 ) 2 df1 + df 2
   df1    df 2  
  2   2   ( df 2 + df 1 x ) 2

(3.13)

Where x > 0 and Γ(df) = (df – 1)! denotes the gamma function. The gamma function is
one of the ‘standard’ functions in mathematics and it is used to extend the factorial
function for use on all sorts of fractions and complex numbers. The factorial n! is
defined for a positive integer n as n! = n (n – 1) (n – 2) ……. (2) (1). For example, 5! = 5 
4  3  2  1 = 120. From a calculation perspective, we do not need to worry about using
equation (3.13), given we have access to published tables or software like Excel and
SPSS to do the calculations. Figure 3.65 illustrates the shape of the F distribution for dfA
= 17 and dfB = 24.

Figure 3.65 F distribution with dfA = 17, dfB = 24

Page | 205
Example 3.10

Calculate the probability of F ≤ 4.03 if the numerator and denominator degrees of


freedom are 9 and 9, respectively. Based on this answer, calculate the P(F  4.03).

Excel solution

Figure 3.66 illustrates the Excel solution. The probability that F will be less than 4.03 is
0.98 (or 98%) and greater than 4.03 is 0.02 (or 2%), given the numerator and
denominator degrees of freedom are 9.

Figure 3.66 Example 3.10 Excel solution

Cells C9:C11 in Figure 3.66 show three different ways to achieve the same result.

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.67. Please note you can enter any
number.

Figure 3.67 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.

a. To obtain P(F ≤ 4.03), repeat the calculation above but this time use:

Select Transform > Compute Variable


Target Variable: Example10a

Page | 206
Numeric expression = CDF.F(4.03,9,9).

Figure 3.68 Use compute variable to calculate P(F ≤ 4.03)

Click OK

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example10a.

Figure 3.69 SPSS solution P(F ≤ 4.03) = 0.98

P(F ≤ 4.03) = 0.98

The probability that F will be less than 4.03 is 0.98 (or 98%). This agrees with
the Excel solution shown in Figure 3.66.

b. To obtain P(F  4.03), repeat the calculation above but this time use:

Select Transform > Compute Variable


Target Variable: Example10b
Numeric expression = 1-CDF.F(4.03,9,9).

Figure 3.70 Use compute variable to calculate P(F ≥ 4.03)

Click OK

SPSS output

Page | 207
The value will not be in the SPSS output file but in the SPSS data file in a column
called Example10b.

Figure 3.71 SPSS solution P(F ≥ 4.03) = 0.02

P(F  4.03) = 0.02

The probability that F will be greater than 4.03 is 0.02 (or 2%)

The probability that F will be less than 4.03 is 0.98 (or 98%) and greater than 4.03 is
0.02 (or 2%), given the numerator and denominator degrees of freedom are 9. This
agrees with the Excel solution shown in Figure 3.66.

Check your understanding

X3.9 Calculate F when  = 0.1 and numerator (df1) and denominator (df2) degrees of
freedom are 5 and 7, respectively.
X3.10 Calculate the probability that F  2.34 when numerator and denominator
degrees of freedom are 12 and 18, respectively.

Chi-square distribution
The chi-square (χ2) distribution is a popular distribution that is used to solve many
statistical inference problems involving contingency tables and can be used to test if a
sample of data came from a population with a specific distribution.. The probability
density function (pdf) of the chi-square distribution is defined by equation (3.14):

k  x
 −1  −
x 2 
e 2
f ( x, k ) = k , x0
 k 
22   
2 (3.14)

Where x > 0 and  denotes the gamma function that we have already briefly described.
From a calculation perspective, we again do not need to worry about using this
equation, given we have access to published tables or software like Excel and SPSS to do
the calculations. Figure 3.72 illustrates the variation in the value of the F distribution
with the degrees of freedom varying between 3 and 9.

Page | 208
Figure 3.72 Chi square distribution curves for different degrees of freedom

Example 3.11

Calculate: (a) the probability that the chi-square test statistic is 1.86 or less if the
number of degrees of freedom is 8; (b) find the value of x given P(χ2  x) = 0.04 and 10
degrees of freedom.

Excel solution

Figure 3.73 illustrates the Excel solution.

Figure 3.73 Example 3.11 Excel solution

We can see that (a) P(χ2 ≤ 1.86) with 8 degrees of freedom yields probability 0.015, and
(b) P(χ2  x) = 0.04 with 10 df gives x = 19.02.

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.74. Please note you can enter any
number.

Page | 209
Figure 3.74 Enter number 1 to represent variable VAR00001

Now we can use SPSS Statistics to calculate the associated probabilities.

a. To find P(χ2 ≤ 1.86), repeat the calculation above but this time use:

Select Transform > Compute Variable


Target Variable: Example11a
Numeric expression =CDF.CHISQ(1.86, 8).

Figure 3.75 Use compute variable to calculate P(chi-square  1.86)


Click OK.

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example11a.

Figure 3.76 SPSS solution P(chi-square  1.86) = 0.015

P(χ2 ≤ 1.86) is 0.015 (or 1.5%). This agrees with the Excel solution shown in
Figure 3.73.

b. To find the value of x given P(χ2  x) = 0.04, repeat the calculation above but this
time use:

Select Transform > Compute Variable


Target Variable: Example11b
Numeric expression =IDF.CHISQ(1-0.04, 10).

Page | 210
Figure 3.77 Use compute variable to calculate x, given P(chi-square  x) = 0.04
Click OK

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example11b.

Figure 3.78 SPSS solution, x = 19.02

P(χ2  x) = 0.04 with 10 df gives x = 19.02. This agrees with the Excel solution
shown in Figure 3.73.

Check your understanding

X3.11 Calculate the probability that χ2  13.98 given 7 degrees of freedom.

X3.12 Find the value of x given P(χ2 ≤ x) = 0.0125 with 12 degrees of freedom.

3.4 Discrete probability distributions


Introduction

In this section, we shall explore discrete probability distributions when dealing with
discrete random variables. The two specific distributions we include are the binomial
probability distribution and the Poisson probability distribution.

Example 3.12

To illustrate the idea of a discrete probability distribution, consider the frequency


distribution presented in Table 3.6, representing the distance travelled by delivery vans
per day. The table tells us that 4 vans out of 100 travelled between 400 and 449 miles,
44 travelled out of 100 travelled between 450 and 499 miles, etc.

Page | 211
Distance Frequency, f
400 - 449 4
450 - 499 44
500 - 549 36
550 - 599 15
600 - 649 1
Total = 100
Table 3.6 Frequency distribution

From Table 3.6 we can work out the relative frequency distribution. Table 3.7 illustrates
the calculation process. We observe from Table 3.7, the relative frequency for 440–459
miles travelled, for example, is 4/100 = 0.4. This implies that we have a chance or
probability of 4/100 that the distance travelled lies within this class.

Distance Frequency, f Relative frequency


400 - 449 4 0.040000
450 - 499 44 0.440000
500 - 549 36 0.360000
550 - 599 15 0.150000
600 - 649 1 0.010000
Total = 100 1.000000
Table 3.7 Calculation of relative frequencies

Excel solution

Figure 3.79 Example 3.12 Excel solution

Thus, relative frequencies provide estimates of the probability for that class, or value, to
occur. If we were to plot the histogram of relative frequencies, we would in fact be
plotting out the probabilities for each event: P(400–449) = 0.4, P(450–449) = 0.44, etc.
The distribution of probabilities given in Table 3.7 and the graphical representation in
Figure 3.80 are different ways of illustrating the probability distribution.

Page | 212
Figure 3.80 Histogram for the distance travelled

For the frequency distribution, the area under the histogram is proportional to the total
frequency. However, for the probability distribution, the area is proportional to total
probability (= 1). Given a probability distribution, we can determine the probability for
any event associated with it. For example, P(400 – 549) = P(400  X  459) is the area
under the distribution from 400 to 549, or P(400 – 4490) + P(450 – 499) + P(500 – 549)
= 0.04 + 0.44 + 0.36 = 0.84. Thus, we have a probability estimate of approximately 84%
for the distance travelled to lie between 400 and 549 miles.

Now, imagine that in Figure 3.80 we decreased the class width towards zero and
increased the number of associated bars observable. Then the outline of the bars in
Figure 3.80 would approximate a curve – the probability distribution curve. The value
of the mean and standard deviation can be calculated from the frequency distribution
by using equations (3.15) and (3.16), respectively. If you undertake the calculation (see
the next example) then the mean value is 507.00 with a standard deviation of 40.85
miles. By using relative frequencies to determine the mean we have in fact found the
mean of the probability distribution.

The expected value of a discrete random variable X, which is denoted in many ways,
including E(X) and µ, is also known as the expectation or mean. For a discrete random
variable X under probability distribution P, the expected value of the probability
distribution E(X) is defined by equation (3.15):

E(X) = ∑ X × P (3.15)

Where X is a random variable with a set of outcome variables X1, X2, X3, …, Xn occurring
with probabilities P1, P2, P3, …, Pn. Equation (3.15) can be written as:

E(X) = ∑ X × P

E(X) = X1P1 + X2P2 + ……. + XnPn

Page | 213
Further thought along the lines used in developing the notion of expectation would
reveal that the variance of the probability distribution, VAR(X), represents a measure of
how broadly distributed the random variable X tends to be and is defined as the
expectation of the squared deviation from the mean:
2
VAR(X) = E [(X − E(X)) ]

What does this mean? First, let us rewrite the definition explicitly as a sum. If X takes
values X1, X2, …, Xn, with each X value having an associated probability, P(X), then the
variance equation can be written as follows:
2
VAR(X) = ∑(X − E(X)) × P(X) (3.16)

In words, the formula for VAR(X) says to take a weighted average of the squared
distance to the mean. By squaring, we make sure we are averaging only non-negative
values, so that the spread to the right of the mean will not cancel that to the left. From
equation (3.16), the standard deviation for the probability distribution is calculated
using the relationship given in equation (3.17):

SD(X) = √VAR(X) (3.17)

Example 3.13

Returning to the delivery vans travelled we can easily calculate the mean number and
the corresponding measure of dispersion as illustrated in Figures 3.81–3.83. Remember
that LCB and UCB stand for the lower-class boundary and upper-class boundary,
respectively.

Figure 3.81 Example 3.13 Excel solution

Page | 214
Figure 3.82 Example 3.13 Excel solution continued

Figure 3.83 Example 3.13 Excel solution continued

We can see that the mean, or expected value, is 507.0 miles (cell C14), with a standard
deviation of 40.85 miles (cell C16).

Check your understanding

X3.13 Give an appropriate sample space for each of the following experiments:

a. A card is chosen at random from a pack of cards.


b. A person is chosen at random from a group containing 5 females and 6 males.
c. A football team records the results of each of two games as 'win', 'draw' or 'lose'.

X3.14 A dart is thrown at a board and is likely to land on any one of eight squares
numbered 1 to 8 inclusive. A represents the event the dart lands in square 5 or 8.
B represents the event the dart lands in square 2, 3 or 4. C represents the event
the dart lands in square 1, 2, 5 or 6. Which two events are mutually exclusive?

X3.15 Table 3.8 provides information about 200 school leavers and their destination
after leaving school. Determine the following probabilities that a person selected
at random:

Leave school at Leave school at


16 years a higher age
Full time education, E 14 18
Full time job, J 96 44
Other 15 13
Table 3.8 Destination of school leavers

Page | 215
a. Went into full-time education
b. Went into a full-time job
c. Either went into full-time education or went into a full-time job
d. Left school at 16
e. Left school at 16 and went into full-time education.

Binomial probability distribution


One of the most elementary discrete random variables, the binomial, is associated with
questions that only allow yes or no type answers, or a classification such as male or
female, or recording a component as defective or not defective. If the outcomes are also
independent, (e.g., the possibility of a defective component does not influence the
possibility of finding another defective component) then the variable is a binomial
variable.

Consider the example of a supermarket that runs a two-week television campaign to


increase its volume of trade. During the campaign, all customers are asked if they came
to the supermarket because of the television advertising. Each customer response can
be classified as either yes or no. At the end of the campaign the proportion of customers
who responded ‘yes’ is determined. For this study, the experiment is the process of
asking customers if they came to the supermarket because of the television advertising.

The random variable, X, is defined as the number of customers who responded ‘yes’ or
‘no’. Clearly the random variable consists of n number of customers, each with the value
of just 1 (“yes”) or 0 (“no”). Consequently, the random variable is discrete. Consider the
characteristics that define the binomial experiment. The experiment consists of n
identical trials:

1. Each trial results in one of two outcomes which for convenience we can define as
either a success or a failure.
2. The outcomes from trial to trial are independent.
3. The probability of success (p) is the same for each trial.
4. The probability of failure is q = 1 – p.
5. The random variable equals the number of successes in the n trials and can take
a value from 0 to n.

These five characteristics define the binomial experiment and are applicable for
situations of sampling from finite populations with replacement or for infinite
populations with or without replacement.

Example 3.14

A group of archers are interested in calculating the probability of hitting the centre of
the target from the recommended beginners’ distance of 5 yards from the target as
illustrated in Figure 3.84 (not to scale). The historical data collected from the archery
club gives a probability of 0.3 for beginners after attending the required training
courses. If the archer makes 3 attempts calculate the probability distribution.

Page | 216
Figure 3.84 Archery target (not to scale)

This experiment can be modelled by a binomial distribution since:

1. We have three identical trials (n = 3).


2. Each trial can result in hitting the target (success) or not hitting the target
(failure).
3. The outcome of each trial is independent.
4. The probability of a success (P(hitting target) = p = 0.3) is the same for each trial.
5. The random variable is discrete.

Let T represent the event that the marksman hits the target, and T′ represents the event
that the target is missed. The corresponding individual event probabilities are:

Probability of hitting target, P(T) = 0.3

Probability of missing target, P(T′) = 1 – P(T) = 1 – 0.3 = 0.7

Figure 3.85 illustrates the tree diagram that represents the described experiment.

Figure 3.85 Example 3.14 tree diagram


Page | 217
From this tree diagram we can identify the possible routes that could be achieved by
having 3 attempts on the target, where X represents the number of targets hit on 3
attempts:

1. Target missed on all attempts out of the 3 attempts, X = 0


2. Target hit on 1 occasion out of the 3 attempts, X = 1
3. Target hit on 2 occasions out of the 3 attempts, X = 2
4. Target hit on 3 occasion out of the 3 attempts, X = 3

This completes the complete set of possible options for this experiment which consists
of allowing 3 attempts on the target. We can now use the tree diagram to identify the
routes to achieve each of these alternative possibilities and the associated probabilities,
given the probability of hitting the target at each attempt, p = 0.3 (therefore, probability
of missing target on each attempt q = 1 – p = 0.7).

Target missed on all attempts out of the 3 attempts, X = 0

This can be achieved via the route: 1st attempt missed, 2nd attempt missed, and
3rd attempt missed. We can write this has a probability equation as follows:

Probability that target missed on all 3 attempts = P(X = 0)

P(X = 0) = P(1st attempt missed, 2nd attempt missed, and 3rd attempt missed)

P(X = 0) = P(T’ and T’ and T’)

Given that each of these 3 attempts are independent events, then we can show
that

P(T’ and T’ and T’) = P(T’)  P(T’)  P(T’)

P(X = 0) = 0.7 × 0.7 × 0.7

P(X = 0) = (0.7)3

P(X = 0) = 0.343

The important lesson is to note how we can use the tree diagram to calculate an
individual probability but also note the pattern identified in the relationship between
the probability, P(X = x), and the individual event probability of success, p, or failure, q.
If we continue the calculation procedure we find:

P(1 target hit) = P(X = 1 success)

P(1 target hit) = P(TT′T′ or T′TT′ or T′T′T)

P(1 target hit) = 0.3 × 0.7 × 0.7 + 0.7 × 0.3 × 0.7 + 0.7 × 0.7 × 0.3

Page | 218
P(1 target hit) = 0.3 × (0.7)2 + 0.3 × (0.7)2 + 0.3 × (0.7)2

P(1 target hit) = 3 × 0.3 × (0.7)2

P(1 target hit) = 3pq2 = 0.441

P(2 targets hit) = P(X = 2 successes)

P(2 targets hit) = P(TTT′ or TT′T or T′TT)

P(2 targets hit) = 0.3 × 0.3 × 0.7 + 0.3 × 0.7 × 0.3 + 0.7 × 0.3 × 0.3

P(2 targets hit) = (0.3)2 × 0.7 + (0.3)2 × 0.7 + (0.3)2 × 0.7 = 3 × (0.3)2 × 0.7

P(2 targets hit) = 3p2q = 0.189

P(3 targets hit) = P(X = 3 successes)

P(3 targets hit) = P(TTT)

P(3 targets hit) = 0.3 × 0.3 × 0.3

P(3 targets hit) = (0.3)3 = p3 = 0.027

From these calculations, we can now note the probability distribution for this
experiment (see Table 3.9 and Figure 3.86).

x Formula P(X = x)
0 q 3 0.343
1 3pq 2 0.441
2 3p2q 0.189
3 p 3 0.027
Total = 1.000
Table 3.9 Probability distribution table

Page | 219
Figure 3.86 Bar chart representing the binomial distribution event

From the probability distribution given in table 3.8, we observe that the total
probability equals 1. This is expected since the total probability would represent the
total experiment. We can express the total probability for the experiment by equation
(3.18):

∑ 𝑃(𝑋 = 𝑥) = 1 (3.18)

If we increase the size, n, of the experiment, then it becomes quite difficult to calculate
the event probabilities. We really need to develop a formula for calculating binomial
probabilities. Using the ideas generated earlier, we have

Total probability = P(X = 0 or X = 1 or X = 2 or X = 3)

Total probability = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

Using the information given in table 3.8 we can re-write this equation

Total probability = (0.7)3 + 3 × (0.3) × (0.7)2 + 3 × (0.3)2 × (0.7) + (0.3)3

Total probability = q3 + 3pq2 + 3p2q + p3

Total probability = p3 + 3p2q + 3pq2 + q3

Repeating this experiment for increasing values of n would enable the identification of a
pattern that can be used to develop equation (3.19). This equation can then be used to
calculate the probability of x successes given n attempts of the binomial experiment.

P(X = x) = 𝐶𝑥𝑛 px qn−x (3.19)

Page | 220
The term 𝐶𝑥𝑛 calculates the binomial coefficients which are the numbers in front of the
letter terms in the binomial expansion. In the previous example we found that the total
probability p3 + 3p2q + 3pq2 + q3, with the numbers 1, 3, 3, 1 in front of the letters. These
numbers are called the ‘binomial coefficients’ and are calculated using equation (3.20):

𝑛!
𝐶𝑥𝑛 = 𝑥! (𝑛−𝑥)! (3.20)

Where n! (pronounced ‘n factorial’) is factorial of a positive integer n, and is defined by


equation (3.21):

n! = n  (n – 1)  (n – 2)  (n – 3).........3  2  1 (3.21)

The term 𝐶𝑥𝑛 calculates the number of ways of obtaining x successes from n attempts of
the experiment. For example, the values of 3! 2! 1!, and 0! are as follows:

3! = 3 × 2 × 1 = 6

2! = 2 × 1 = 2

1! = 1

0! = 1.

It can be shown that the mean and variance of a binomial distribution are given by
equations (3.22) and (3.23):

Mean of a binomial distribution, E(X) = np (3.22)

Variance of a binomial distribution, VAR(X) = npq (3.23)

In Example 3.14, we note that n = 3, p = 0.35, and q = 1 – p = 1 – 0.35 = 0.65.

Now. Let us calculate using Equation (3.19) the probability that the target will not be
hot on the 3 attempts, P(X = 0).

P(X = 0)

Substituting these values into equation (3.19) gives for P(X = 0):

P(X = x) = 𝐶𝑥𝑛 px qn−x

P(X = 0) = 𝐶03 0.350 0.653−0

P(X = 0) = 𝐶03 0.350 0.653

Inspecting this equation, we have three terms that are multiplied together to provide
the probability of the target not hit on the three attempts:

Page | 221
𝐶03

(0.35)0 = 1

(0.65)3 = 0.65  0.65  0.65

The second and third terms are straightforward to calculate, and the first can be
calculated from equation (3.20) as follows:

3!
𝐶03 =
0! (3 − 0)!

3!
𝐶03 =
0! 3!

3 ×2 ×1
𝐶03 =
1 × 3 ×2 ×1

𝐶03 = 1

Substituting this value into the problem solution gives:

P(X = 0) = 1 × 1 × 0.653

P(X = 0) = 0.274625

Equation (3.19) can now be used to calculate P(X = 1), P(X = 2), and P(X = 3):

P(X = 1)

P(X = 1) = 𝐶03 0.351 0.653−1

3!
𝐶13 =
1! (3 − 1)!

3!
𝐶13 =
1! 2!

𝐶13 = 3

P(X = 1) = 3  0.351 0.652

P(X = 1) = 3  0.35  0.65  0.65

P(X = 1) = 0.443625

P(X = 2)

Page | 222
P(X = 2) = 3C2 0.352 0.653−2

3!
𝐶23 =
2! (3 − 2)!

3!
𝐶23 =
2! 1!

𝐶23 = 3

P(X = 2) = 3  0.352 0.651

P(X = 2) = 3  0.35  0.35  0.65

P(X = 2) = 0.238875

P(X = 3)

P(X = 3) = 3C3 0.353 0.653−3

3!
𝐶33 =
3! (3 − 3)!

3!
𝐶33 =
3! 0!

𝐶33 = 1

P(X = 3) = 1  0.353 0.650

P(X = 3) = 1  0.35  0.35  0.35

P(X = 3) = 0.042875

Given that we have calculated all the probabilities for this binomial experiment when
we had 3 attempts, then equation (3.18) should be true.

∑ 𝑃(𝑋 = 𝑥) = 1

Check by adding the individual probabilities together

Total probability = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

Total probability = 0.274625 + 0.443625 + 0.238875 + 0.042875

Total probability = 1

Page | 223
Excel solution

Figure 3.87 illustrates the Excel solution.

Figure 3.87 Example 3.14 Binomial distribution

SPSS solution

To use SPSS to calculate values we require data in the SPSS data file. If no data is
present, then enter a number to represent variable VAR00001. In this example, we
entered the number 1 as illustrated in Figure 3.88. Please note you can enter any
number.

Figure 3.88 Enter number 1 to represent VAR00001

Now we can use SPSS Statistics to calculate the probability distribution and individual
associated probabilities.

Probability distribution

In SPSS type 0, 1, 2, 3 in one column and label X.


Select Transform > Compute Variable
Target Variable: Prob
Numeric expression = PDF.BINOM(X, 3, 0.35)

Page | 224
Figure 3.89 Compute variable

Click Ok

The value will not be in the SPSS output file but in the SPSS data file in a column
called X and Prob, as illustrated in Figure 3.90.

Figure 3.90 Binomial distribution

These results agree with the manual and Excel solutions.

Individual probabilities

We can use SPSS to calculate the individual probabilities as illustrated for P(X = 3):

P(X = 3 given binomial and n = 3, p = 0.35)

SPSS can be used to calculate the probability of a binomial event occurring using
the PDF.BINOM function.

Select Transform > Compute Variable


Target Variable: Example 14a
Numeric expression = PDF.BINOM(3, 3, 0.35)

Figure 3.91 Use compute variable to calculate P(X = 3)

Page | 225
Click OK

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column
called Example 14a

Figure 3.92 SPSS solution P(X = 3) = 0.0270

P(X = 3 given binomial and n = 3, p = 0.35) = 0.042875. This agrees with the Excel
solution illustrated in Figure 3.87.

Calculate P(X  2)

If you wanted to calculate the P(X  2) then we can solve this as follows:

P( 2) = P(X = 0) + P(X = 1) + P(X = 2)

P( 2) = 0.274625 + 0.443625 + 0.238875

P( 2) = 0.957125

Therefore, we have a 95.71% chance that we have at least 2 target hits out of the
3 attempts.

Excel solution

We can use Excel to solve this problem by using the Excel function:

P(X  2) = BINOM.DIST(2, n, p, true)

P( 2) = 0.957125

SPSS solution

We can use SPSS to solve this problem as follows:

Select Transform > Compute Variable


Target Variable: Prob_X
Numeric expression = CDF.BINOM(2, 3, 0.35)

Page | 226
Figure 3.93 Use compute variable to calculate P(X  2)

Figure 3.94 SPSS solution P(X  2) = 0.957125

Calculate P(X < 2)

If you wanted to calculate the probability that X < 2 then we can solve this as
follows:

P(X < 2) = P(X = 0) + P(X = 1)

P(< 2) = 0.274625 + 0.443625

P(< 2) = 0.71825

Therefore, we have a 71.83% chance that we have less than 2 target hits out of
the 3 attempts.

Excel solution

We can use Excel to solve this problem by using the Excel function

P(X < 2) = BINOM.DIST(1, n, p, true)

P(< 2) = 0.71825

SPSS solution

We can use SPSS to solve this problem as follows:

Select Transform > Compute Variable


Target Variable: Prob_Xa
Numeric expression = CDF.BINOM(1, 3, 0.35)

Page | 227
Figure 3.95 Use compute variable to calculate P(X < 2)

Figure 3.96 SPSS solution P(X < 2) = 0.718250

Notes:

1. The mean of a binomial distribution is given by equation (3.22), mean = np = 3 


0.35 = 1.05.
2. The variance of a binomial distribution is given by equation (3.33), variance –
npq = 3  0.35  0.65 = 0.6825.
3. The standard deviation of a binomial distribution is the square root of the
variance, standard deviation = 0.8261 to 4 decimal places.

Useful Excel functions:

1. Binomial coefficients 𝐶𝑥𝑛 can be calculated using the Excel function =COMBIN(n,
x). For example, COMBIN(3, 0) = 1, COMBIN(3, 1) = 3, COMBIN(3, 2) = 3 and
COMBIN(3, 1) = 1.
2. Factorial values n! can be calculated using the Excel function =FACT(). For
example, FACT(0) = 1, FACT(1) = 1, FACT(2) = 2, FACT(3) = 6, FACT(4) = 24,
FACT(5) = 120, and so on.

Normal approximation to the binomial distribution

The normal distribution is generally considered to be a pretty good approximation for


the binomial distribution when np ≥ 5 and n(1 – p) ≥ 5. For values of p close to 0.5, the
number 5 on the right side of these inequalities may be reduced somewhat, while for
more extreme values of p (especially for p < 0.1 or p > 0.9) the value 5 may need to be
increased.

Check your understanding

X3.16 Evaluate the following: (a) 𝐶13 , (b) 𝐶310 , (c) 𝐶02 .

X3.17 A binomial model has n = 4 and p = 0.6.

a. Find the probabilities of each of the five possible outcomes (i.e. P(0), P(1), …, P(4)).
b. Construct a histogram of the data.

Page | 228
X3.18 Attendance at a cinema has been analysed and shows that audiences consist of
60% men and 40% women for a film. If a random sample of six people were
selected from the audience during a performance, find the following
probabilities: (a) all women are selected; (b) three men are selected; and (c) less
than three women are selected.

X3.19 A quality control system selects a sample of three items from a production line. If
one or more is defective, a second sample is taken (also of size 3), and if one or
more of these are defective then the whole production line is stopped. Given that
the probability of a defective item is 0.05, what is the probability that the second
sample is taken? What is the probability that the production line is stopped?

X3.20 Five people in seven voted in an election. If four of those on the roll are
interviewed, what is the probability that at least three voted?

X3.21 A small tourist resort has a weekend traffic problem and is considering whether
to provide emergency services to help mitigate the congestion that results from
an accident or breakdown. Past records show that the probability of a
breakdown or an accident on any given day of a four-day weekend is 0.25. The
cost to the community caused by congestion resulting from an accident or
breakdown is as follows: a weekend with 1 accident day costs £20,000, a
weekend with 2 accident days costs £30,000, a weekend with 3 accident days
costs £60,000, and a weekend with 4 accident days costs £125,000. As part of its
contingency planning, the resort needs to know:

a. The probability that a weekend will have no accidents


b. The probability that a weekend will have at least two accidents
c. The expected cost that the community will have to bear for an average
weekend period
d. Whether to accept a tender from a private firm for emergency services of
£20,000 for each weekend during the season.

Poisson probability distribution


In the previous section, we explored the concept of the binomial distribution, a discrete
probability distribution that enables the probability of achieving r successes from n
independent experiments to be calculated. Each experiment (or event) has two possible
outcomes (‘success’ or ‘failure’) and the probability of ‘success’ (p) is known.

The Poisson distribution is a discrete probability distribution that enables the


probability of x events occurring during a specified interval (time, distance, area, and
volume) to be calculated if the average occurrence is known and the events are
independent of the specified interval since the last event occurred. It has been usefully
employed to describe probability functions of phenomena such as product demand,
demand for service, numbers of accidents, numbers of traffic arrivals, and numbers of
defects in various types of lengths or objects.

Page | 229
Like the binomial, it is used to describe a discrete random variable. With the binomial
distribution, we have a sample of definite size and we know the number of successes
and failures. There are situations, however, when to ask how many 'failures' would not
make sense and/or the sample size is indeterminate. For example, if we watch a football
match, we can report the number of goals scored but we cannot say how many were not
scored. In such cases we are dealing with isolated cases in a continuum of space and
time, where the number of experiments (n) and the probability of success (p) and
failure (q) cannot be defined. What we can do is divide the interval (time, distance, area,
volume) into very small sections and calculate the mean number of occurrences in the
interval. This gives rise to the Poisson distribution defined by equation (3.24):

λx e− λ
P(X = x) = (3.24)
x!

Where:

• P(X = x) is the probability of event x occurring.


• The symbol r represents the number of occurrences of an event and can take the
value 0 → ∞ (infinity).
• x! is the factorial of r calculated using the Excel function =FACT().
•  (Greek letter lambda) is a positive real number that represents the expected
number of occurrences for a given interval. For example, if we found that we had
an average of 4 stitching errors in 1 metre of cloth, then for 2 metres of cloth we
would expect the average number of errors to be  = 4 × 2 = 8.
• The symbol e represents the base of the natural logarithms (e = 2.71828…).

If we determine the mean and variance of a Poisson distribution, either using the
frequency distribution or the probability distribution, we will find that the relationship
is as given in equation (3.25):

 = VAR(X) (3.25)

The characteristics of a Poisson distribution are:

1. The variance is equal to the mean.


2. Events are discrete and randomly distributed in time and space.
3. The mean number of events in each interval is constant.
4. Events are independent.
5. Two or more events cannot occur simultaneously.

Once it has been identified that the mean and variance have the same numerical value,
ensure that the other conditions above are satisfied, and this will indicate that the
sample data most likely follow the Poisson distribution.

Example 3.15

An ice cream parlour estimates the footfall arriving in the parlour is 16 customers per
hour.

Page | 230
a) Use equation (3.34) to calculate the probability that thirteen customers will
arrive in one hour? Compare this answer to the Excel and SPSS solutions.
b) Use Excel and SPSS to create the Poisson probability distribution for this
distribution from X = 0 to X = 30. Comment on the shape of the distribution.

Answer a)

a) Use equation (3.34) to calculate the probability that thirteen customers will arrive in
one hour? Compare this answer to the Excel and SPSS solutions.

Equation (3.24) can be used to calculate the Poisson probability P(X = 13) given  =
16 per hour, x = 13.

λ x e− λ
P(X = x) =
x!

1613 e− 16
P(X = 13) =
13!

Use a calculate to show:

• 1613 = 4.503599627  1015


• 13! = 6227020800
• e-16 = 1.12535  10-7

Substituting these values gives

P(X = 13) = 0.081389

Probability that 13 customers will arrive per hour is 0.081389 or 8.1%.

Excel solution

Figure 3.97 illustrates the Excel solution for P(X = 13).

Page | 231
Figure 3.97 Excel solution

SPSS solution

SPSS can be used to calculate P(X = 13)

Select Transform > Compute Variable


Target Variable: Example15a
Numeric expression = CDF.POISSON(13, 16)

Figure 3.98 Use compute variable to calculate P(X = 13)

Figure 3.99 SPSS solution P(X = 13) = 0.081389

Answer b)

a) Use Excel and SPSS to create the Poisson probability distribution for this
distribution from X = 0 to X = 30. Comment on the shape of the distribution.

Excel solution

Thus, we can now determine the probability distribution using equation (3.24) as
illustrated in Figure 3.100.

Page | 232
Figure 3.100 Example 3.15 Excel solution

Figure 3.101 illustrates Poisson probability plot for the variation with P(X = x) with
individual values of X from 0 to 30. We can see from the graph that the distribution
shape looks quite symmetric and looks like a normal probability distribution.

Remember for the Poisson distribution:

Poisson mean =  = 16

Poisson variance =  = 16

Poisson standard deviation = √16 = 4

Page | 233
Figure 3.101 Poisson distribution with  = 16 customers per hour

Normal approximation to the Poisson distribution

Furthermore, the shape of the Poisson distribution below tells us that under certain
conditions we can approximate a Poisson distribution with a normal approximation
with population mean  = np, and population variance 2 = npq, when the number of
trials, n  20.

SPSS solution

Thus, we can now determine the probability distribution using equation (3.24) as
illustrated in Figure 3.102.

Enter data into SPSS

Page | 234
Figure 3.102 SPSS data file

Select Transform > Compute Variable


Target Variable: Prob
Numeric expression = PDF.POISSON(X, 16)

Figure 3.103 Use compute variable to calculate P(X = x)

Click Ok

SPSS output

The value will not be in the SPSS output file but in the SPSS data file in a column called
Prob.

Page | 235
Figure 3.104 SPSS solutions P(X = x) for X = 0 to 30

From this data we can ask SPSS to create a scatterplot for the P(X = x) and x as
illustrated in Figure 3.105.

Figure 3.105 Poisson distribution with  = 16 customers per hour

What is a real difference between a binomial and Poisson distribution, and could they be
used interchangeably? When students need to decide which distribution to use, they
often get confused between the two. The binomial distribution should be used when we
try to fit the distribution to n cases, each with a probability of success p. The Poisson

Page | 236
distribution is used when we have an infinite number of cases n, but there is a very, very
small (infinitesimally small) probability of success p.

To use mathematical language, if you start with a binomial distribution and you let n
approach infinity (n → ) and p approach zero (p → 0), then the binomial distribution
approaches a Poisson distribution with parameter . In practical terms, if n is large and
you happen to know the mean value, you should use a Poisson distribution and not a
binomial. In the case of a binomial distribution, you will know the probability of a case n
happening, but you will not know the mean value.

Check your understanding

X3.22 Calculate P(0), P(1), P(2), P(3), P(4), P(5), P(6) and P(>6) for a Poisson variable
with a mean of 1.2. Using this probability distribution determine the mean and
variance.

X3.23 In a machine shop the average number of machines out of operation is 2.


Assuming a Poisson distribution for machines out of operation, calculate the
probability that at any one time there will be: (a) exactly one machine out of
operation; and (b) more than one machine out of operation.

X3.24 A factory estimates that 0.25% of its production of small components is


defective. These are sold in packets of 200. Calculate the percentage of the
packets containing one or more defectives.

X3.25 The average number of faults in a metre of cloth produced by a particular


machine is 0.1.

a. What is the probability that a length of 4 metres is free from faults?


b. How long would a piece have to be before the probability that it contains no
flaws is less than 0.95?

X3.26 A garage has three cars available for daily hire. Calculate the following
probabilities if the variable is Poisson with a mean of 2 cars hired per day:

a. Find the probability that on a given day that exactly 0, 1, 2, and 3 cars will be
hired.
b. The charge to hire a car is £25 per day and the total outgoings per car,
irrespective of whether it is hired, are £5 per day. Determine the expected
daily profit from hiring these three cars.

X3.27 Accidents occur in a factory randomly and on average at the rate of 2.6 per
month. What is the probability that in each month (a) no accidents will occur,
and (b) more than one accident will occur?

Chapter summary
In this chapter, we introduced the concept of probability using the idea of relative
frequency. Then we defined key terms such as experiment, sample space, and the
Page | 237
relationship between a relative frequency distribution and a probability distribution.
Further, we used the concept of relative frequency to introduce probability distribution
and expectation. Then we covered continuous and discrete probability distributions and
provided examples to illustrate the different types of continuous (normal, Student’s t, F
distribution and chi square distribution) and discrete (binomial, Poisson) distributions.

To start with we introduced the normal continuous probability distributions, and we


showed the probability density functions that define the shapes of the distribution and
some of the associated parameters. We described how, by knowing the mean and the
variance of a distribution, we can determine the probability that a certain value is
within a range. The inverse, where knowing the probability, we can work out the value
of X, was also shown.

The standard normal distribution was introduced, and the concept of Z-value was
explained. We demonstrated how to calculate these values either from the statistical
table or using Excel/SPSS. Several examples illustrated applications of how Z-values and
the standard normal distribution can be used to answer business questions. And finally,
a very useful test for testing for normality of the data was introduced and explained.
Three further continuous distributions were introduced, namely, Student’s t
distribution, the F distribution, and the chi-square distribution. All have specific uses
when dealing with different types of samples and situations. Just like with the normal
distributions, we showed examples of how to calculate certain parameters, given that
we have some limited knowledge of every distribution.

Finally, we introduced two discrete distributions, the binomial and Poisson


distributions. Just like continuous distributions, they are designed to handle different
scenarios, with the exception that they are specifically suited to discrete data. As before,
for each distribution we showed how to execute calculations based on certain
parameters that are known upfront. The material from this chapter will enable us to
explore, in the following chapter, the concept of data sampling from normal and non-
normal population distributions and to introduce the central limit theorem. In Chapters
4–5, we will apply the central limit theorem to provide point and interval estimates to
certain population parameters (mean, variance, proportion) based on sample
parameters (sample mean, sample variance, sample proportion).

Test your understanding


TU3.1 A gas boiler engineer is employed to assess customer gas boilers and to service,
repair or replace boilers if required. The engineer records the reasons for failure
as either an electrical fault, gas fault, or other fault. The current records collected
by the engineer involving either electrical or gas faults are as shown in Table
3.10.

Electrical fault
Yes No
Gas fault Yes 53 11
No 23 13
Table 3.10 Type of faults

Page | 238
a. Calculate the probability that failure involves an electrical fault given that it
involves a gas fault.
b. Calculate the probability that failure involves a gas fault given that it involves
an electrical fault.

TU3.2 Bakers Ltd is currently in the process of reviewing the credit line available to
supermarkets which they have defined as a ‘good’ or ‘bad’ risk. Based upon a
£100,000 credit line, the profit is estimated to be £25,000 with a standard
deviation of £5,000. Calculate the probability that the profit is greater than
£28,000.

TU3.3 The quality controller at a sweet factory monitors a range of quality control
variables associated with the production of a range of chocolate bars. Based
upon current information, the probability that a chocolate bar is overweight is
0.02. At the end of a shift the quality controller samples 70 chocolate bars and
rejects the shift’s production if more than 4 bars are overweight. Calculate the
probability that he will reject the output of chocolate bars created by this shift. Is
there a problem with the production process if the factory manager considers
rejection rates over 2% to be an issue?

TU3.4 A local DIY store receives an average of 4 complaints per day from customers.
Calculate on a given day the following probabilities: (a) no complaints, (b)
exactly one complaint, (c) exactly two complaints, and (d) more than 2
complaints.

TU3.5 The number of bookings at a local gym can be modelled using a Poisson
distribution with mean value of 1.8 per hour. Karen works for 3 hours between
tea breaks and is surprised that there are no bookings during this 3-hour period.
What is the probability that no bookings occur during Karen’s shift? Should
Karen be surprised?

TU3.6 A dentist estimates that 23% of patients will requires reappointments for dental
treatment. Over any given week the dentist handles 100 patients. What are the
mean and standard deviation for the number of subsequent reappointments?
Calculate the probability that the dentist will have to see more than 30
reappointments in any week.

TU3.7 The Skodel Ltd credit manager knows from experience that if the company
accepts a 'good risk' applicant for a £60,000 loan the profit will be £15,000. If it
accepts a 'bad risk' applicant, it will lose £6000. If it rejects a 'bad risk' applicant
nothing is gained or lost. If it rejects a 'good risk' applicant, it will lose £3000 in
good will.

a. Complete the profit and loss table (Table 3.11) for this situation.
DECISION
Accept Reject
Good
Type of Risk
Bad
Table 3.11 Profit and loss table

Page | 239
b. The credit manager assesses the probability that a applicant is a 'good risk' is
1/3 and a 'bad risk' is 2/3. What would be the expected profits for each of the
two decisions? Consequently, what decision should be taken for the applicant?

c. Another manager independently assesses the same applicant to be four times


as likely to be a bad risk as a good one. What should this manager decide?

d. Let the probability of being a good risk be x. What value of x would make the
company indifferent between accepting and rejecting an applicant?

TU3.8 A local hospital records the average rate of patient arrival at its accident and
emergency department during the weekend. Calculate the probability that a
patient arrives less than 23 seconds after the previous patient, if the average rate
is 0.45 patients per minute.

TU3.9 At a bottling plane the quantity of chutney that is bottled is expected to be


normally distributed with a population mean of 36 oz and a standard deviation
of 0.1 oz. Once every 30 minutes a bottle is selected from the production process
and its content measured. If the amount goes below 35.8 oz or above 36.2 oz,
then the bottling process will be out of control and the bottling stopped.

a. Calculate the probability that the process is out of control.


b. Calculate the probability that the number of bottles found out of control in
16 inspections will be zero.
c. Calculate the probability that the number of bottles found out of control in
16 inspections will be equal to 1.
d. Calculate the probability of the process being out of control if the analysis of
the historical data shows that the population mean and standard deviation
are actually 37 oz and 0.4 oz, respectively.

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A3Wa The probability laws – introduction to the concept of probability and


probability laws.
2. A3Wb Probability distributions and approximations – explores how certain
probability distributions can be used to approximate other probability
distributions and introduces two other probability distributions that have
business applications: the hypergeometric and exponential probability
distributions.
3. A3Wc Other useful probability distributions – introduces a few other
distributions, such as the hypergeometric discrete probability distribution and
binomial approximation, as well as the exponential distribution.

Page | 240
Chapter 4 Sampling distributions
4.1 Introduction and learning objectives
In Chapter 3 we introduced the concept of a probability distribution and learned about
two distinct types of distribution: continuous and discrete. This was followed by the
definition of the key probability distributions in the context of business applications: the
normal distribution, Student’s t distribution, F distribution, chi-square distribution,
binomial distribution, and finally the Poisson distribution. Furthermore, the online
resource includes information on probability distribution approximations and an
introduction to the exponential and hypergeometric probability distributions.

In this chapter we will see that we can take numerous samples from a population and
calculate their means, for example. The collection of these means will also create a
distribution, just like the actual variable values create a distribution. These distributions
of the mean, or proportion if this is the statistic that we are calculating, are called
sampling distributions. We will specifically investigate the sampling distribution of the
means and the sampling distribution of the proportions.

The chapter begins with simulation, during which we generate numerous samples and
calculate the mean values for every sample. We then show how these sample means are
distributed to form the sampling distribution. The concept that follows shows that it is
not necessary to draw numerous samples and that we can use some of the properties of
the sampling distribution to infer the properties of the population based on one single
sample.

The middle section of the chapter shows that sampling distribution of the mean will
always follow the normal distribution, even if we used data from a non-normal
distribution to calculate multiple sample means. This unique property, called the central
limit theorem, is a useful tool that we can then use to estimate any population,
regardless of what distribution it follows, just based on the sample we took from this
distribution.

The chapter concludes by defining yet another sampling distribution, the sampling
distribution of the proportion. Principles similar to those for the sampling distribution
of the mean are used to infer the population proportions based on the sample
proportions.

Learning objectives

On completing this chapter, you will be able to:

1. Distinguish between the concept of the population and sample


2. Recognise different types of sampling – probability (simple, systematic, stratified,
cluster) and non-probability (judgement, quota, chunk, convenience)
3. Recognise reasons for sampling error – coverage error, non-response error,
sampling error, measurement error
4. Understand the concept of a sampling distribution: mean and proportion

Page | 241
5. Understand sampling from a normal population
6. Understand sampling from a non-normal population – central limit theorem
7. Estimate an appropriate sample size given the confidence interval
8. Solve problems using Microsoft Excel and IBM SPSS software packages.

4.2 Introduction to sampling


‘Sampling’ is one of the key words in the title of this chapter, and intuitively we
understand the word. In fact, we probably engage in sampling activities more often that
we think we do. A good example is browsing through the television channels to find a
programme we may wish to watch. Once we have seen a short sample of the
programme that is currently on, we decide to stay on that channel. This process is very
similar to the scientific approach of sampling. In our case, based on a few seconds of a
programme (a sample), we draw conclusions about the whole programme (a
population). In science, and practical research, the sample taken leads us to make
conclusions about the whole population. This methodology is called making an
inference about the population based upon the sample observations.

There is nothing rigorous or scientific about watching a sample of a TV programme and


making an inference, while in real life this inference must be based on some structured
rules. But why do we even bother with sampling? In our TV case, the reason is that we
do not have time to watch the whole programme and then decide that we will or will
not like it. We are looking for a shortcut. In real life, with various research questions, the
size of the population is such that it is impractical to measure all members of the
population. In this situation a proportion of the population would be selected from the
population of interest to the researcher. This proportion is achieved by sampling from
the population and the proportion selected is called the sample.

The first question we would like to illuminate is: what kinds of inference are we likely to
make once we start working as professionals?

Depending on your job there will be many research questions you will wish to answer
that involve populations that are too large to measure every member of the population.
How have the wages of German car workers changed over the past ten years? What are
the management practices of foreign exchange bankers working in Paris? How many
voters are planning to vote for a political party at a local election?

These are all relevant topics, but the second question is: what formal methods do we
use to collect the relevant data, and how do we go about making inference? This is really
the heart of this chapter. It will teach you the formal procedures and the shortcuts that
can be used to make inferences about the whole population.

Some form of a survey instrument is the most common way of conducting the sampling,
but could also be achieved by observation, archival record, or other methods. However,
no matter what method is used to collect the data, the purpose is typically to make
estimates of the population parameters. It is then crucial to determine how, and how
well, the data set can be used to generalise the findings from the sample to the
population. It is important to avoid data collection methods that maximise the
associated errors. A bad sample may well render findings misleading or meaningless.

Page | 242
Sampling in conjunction with survey research is one of the most popular approaches to
data collection in business research. The concept of random sampling provides the
foundational assumption that allows statistical hypothesis testing to be valid.

The primary aim of sampling is to select a sample from the population that has the same
characteristics as the population itself. For example, if the population average height of
grown men between the ages of 20 and 50 is 176 cm, then the sample average height
would also be expected to be 176 cm, unless we have sampling error (which will be
discussed later in this chapter). Ideally, sample and population values should agree. To
put it another way, we expect the sample to be representative of the population being
measured.

Before we describe the main sampling methods, we need to define the terminology we
will use in this and later chapters. A few statements hold true in general when dealing
with sampling:

• Samples are always drawn from a population.


• A sample should reflect the properties of the target population. Sometimes, for
reasons of practicality or convenience, the sampled population is more restricted
than the target population. In such cases, precautions must be taken to ensure
that the conclusions only refer to the sampled population.
• It is a common practise to divide the population into parts that are called
sampling units. These units must cover the whole of the population and they
must not overlap. In other words, every element in the population must belong
to one and only one sampling unit. For example, in sampling the supermarket
spending habits of people living in a town, the unit may be an individual or
family or a group of individuals living in a postcode.
• To put the sampling units into what is called a sampling frame, is often one of the
major practical problems. The frame is a list that contains the population you
would like to measure. For example, market research firms will access local
authority census data to create a sample. The list of registered students may be
the sampling frame for a survey of the student body at a university. The practical
problems can arise in sampling frame bias. Telephone directories are often used
as sampling frames, for instance, but many people have unlisted numbers, or
perhaps they opted out of landlines and use mobile phones only.
• Samples can be collected using either probability sampling or non-probability
sampling.

Types of sampling

There are several different ways to create a sample. Samples can be divided into two
types: probability samples and non-probability samples.

Probability sampling

The idea behind probability sampling is random selection. More specifically, each
sample from the population of interest has a known probability of selection under a
given sampling scheme. There are five kinds of probability sampling: simple random

Page | 243
sampling; systematic random sampling; stratified random sampling; cluster sampling;
and multi-stage sampling.

Simple random sampling

Simple random sampling is the most widely known type of random sampling. Every
member of the population has the same probability of selection. A sample of size n from
a population of size N is selected and every possible sample of size n has equal chance of
being drawn.

Example 4.1

Consider the task that a cinema chain is facing when selecting a random sample of 300
viewers who visited a local cinema during a given period. The researcher notes that the
cinema chain would like to seek the views of its customers on a proposed refurbishment
of the cinema. The total number of viewers within this period is 8,000. With a
population of this size we could employ several ways of selecting an appropriate sample
of 300. For example, we could place 8,000 consecutively numbered pieces of paper (1,
…, 8,000) in a box, draw a number at random from the box, shake the box, and select
another number to maximise the chances of the second pick being random, continuing
the process until all 300 numbers are selected.

These numbers would then be used to select a viewer purchasing the cinema ticket,
with the customer chosen based on the number selected from the random process. To
maximise the chances that customers selected would agree to complete the survey we
could enter them into a prize draw. These 300 customers will form our sample, with
each number in the selection having the same probability of being chosen. When
collecting data via random sampling it is generally difficult to devise a selection scheme
to guarantee that we have a random sample. For example, the selection from a
population might not be from the total population that you wish to measure.
Alternatively, during the time interval when the survey is conducted, we may find that
the customers sampled may by unrepresentative of the population due to unforeseen
circumstances.

Systematic random sampling

With systematic random sampling, we create a list of every member of the population.
From this list, we will sample 200 numbers from the population of 8,000 number
values. From the list, we choose an initial sample value and then select every (N/n)th
sample value. The term (N/n) is called the sampling interval or sampling fraction (f).
This method involves choosing the nth element from a population list as follows:

1. Divide the number of cases in the population by the desired sample size
(f=N/n=8000/200=40).
2. Select a random number between 1 and the value found in step 1. For example,
we could pick the number 23 (x1=23). This tells us that the first number chosen
from the list is the 23rd number value.
3. We start with the sampling number value chosen in step 2 (23rd number) and the
sampling fraction f = 8,000/200 = 40. This tells us to sample the 23rd number,

Page | 244
63rd number, 103rd number, 143rd number, and so on until you have chosen 200
sample points. The formula to calculate the nth sample number is given by the
formula: the nth sample value = first sample value + sampling fraction (nth sample
value – 1). Therefore, the 300th sample value would be the 23 + 40 (200 – 1) =
7,983rd number value.

Systematic sampling compared to simple random sampling has some advantages. It is


easier to draw the sample from the population. We also avoid potential bias and
clustering of members of population, which could happen unintentionally. The
disadvantages are that the sample points are not equally likely to be selected.

Stratified random sampling

The procedure for stratified random sampling is to divide the population into two or
more groups (subpopulations or strata). This could be by age, region, or some other
research area of interest. Each stratum must be mutually exclusive (i.e., non-
overlapping), and together the strata must include the entire population. A random
sample is then drawn from each group. As an example, suppose we conduct a national
survey in England. We might divide the population into groups (or strata) based on the
counties in England. Then we would randomly select from each group (or stratum). The
advantage of this method is that it guarantees that every group within the population is
selected and provides an opportunity to carry out group comparisons.

Stratified random sampling nearly always results in a smaller variance for the estimated
mean or other population parameters of interest. However, the main disadvantage of a
stratified sample is that it may be costlier to collect and process the data compared to a
simple random sample.

Two different categories of stratified random sampling are available, as follows:

• Proportionate stratification. With proportionate stratification, the sample size of


each stratum is proportionate to the population size of the stratum (same
sampling fraction). The method provides greater precision than for simple
random sampling with the same sample size, and this precision is better when
dealing with characteristics that are the same between strata.
• Disproportionate stratification. With disproportionate stratification, the
sampling fraction may vary from one stratum to the next. If differences are
explored in the characteristics being measured between strata, then
disproportionate stratification can provide better precision than proportionate
stratification. In general, given similar costs, you would always choose
proportionate stratification.

Example 4.2

Consider the task where we wish to sample the views of graduate job applicants to a
major financial institution. The nature of this survey is to collect data on the application
process from the applicants’ perspective. The survey will therefore have to collect the
views from the different specified groups within the identified population. For example,

Page | 245
this could be based on gender, race, type of employment requested (full- or part-time),
or whether an applicant is classified as disabled.

Remember that a simple random sample might fail to obtain a representative sample
from one of these groups. This could happen due, for example, to the size of the group
relative to the population. This is the reason why we would employ stratified random
sampling. We want to ensure that appropriate numbers of sample values are drawn
from each group in proportion to the percentage of the population.

Stratified sampling offers several advantages over simple random sampling: it guards
against an unrepresentative sample (e.g., all-male samples from a predominately female
population); it provides sufficient group data for separate group analysis; it requires a
smaller sample; and it can achieve greater precision compared to simple random
sampling for a sample of the same size.

Cluster sampling

Cluster sampling is a sampling technique in which the entire population of interest is


divided into groups, or clusters, and a random sample of these clusters is selected. Each
cluster must be mutually exclusive, and together the clusters must include the entire
population. Once clusters are chosen, then all data points within the chosen clusters are
selected. No data points from non-selected clusters are included in the sample. This
differs from stratified sampling, in which some data values are selected from each
group.

When all the data values within a cluster are selected, the technique is referred to as
‘one-stage cluster sampling’. If a subset of units is selected randomly from each selected
cluster, it is called ‘two-stage cluster sampling’. Cluster sampling can also be made in
three or more stages: it is then referred to as ‘multistage cluster sampling’. The main
reason for using cluster sampling is that it is usually much cheaper and more
convenient to sample the population in clusters rather than randomly. In some cases,
constructing a sampling frame that identifies every population element is too expensive
or impossible. Cluster sampling can also reduce costs when the population elements are
scattered over a wide area.

Multistage sampling

With multistage sampling (not to be confused with multistage cluster sampling), we


select a sample by using combinations of different sampling methods. For example, in
stage 1, we might use cluster sampling to choose clusters from a population. Then, in
stage 2, we might use simple random sampling to select a subset of elements from each
chosen cluster for the final sample.

Non-probability sampling

In many situations, it is not possible to select the kinds of probability samples used in
large-scale surveys. For example, we may be required to seek the views of local family-
run businesses that are experiencing financial difficulties. In this situation, there are no
easily accessible lists of businesses experiencing difficulties, or there may never be a list

Page | 246
created or available. The question of obtaining a sample in this situation is achievable
by using non-probability sampling methods. The two primary types of non-probability
sampling methods are convenience sampling and purposive sampling.

Convenience sampling

Convenience sampling is a method of choosing subjects who are available or easy to


find. This method is also sometimes referred to as haphazard, accidental, or availability
sampling. The primary advantage of the method is that it is very easy to carry out,
relative to other methods. Problems can occur with this survey method in that you can
never guarantee that the sample is representative of the population. Convenience
sampling is a popular method with researchers and provides some data that can
analysed, but the type of statistics that can be applied to the data is compromised by
uncertainties over the nature of the population that the survey data represent.

Example 4.3

When a student researcher is eager to begin conducting research with people as


subjects but may not have a large budget or the time and resources that would allow for
the creation of a large, random sample, they may choose to use the technique of
convenience sampling. This could mean stopping people as they enter and leave a
supermarket, or surveying other students, or others to whom the researcher has regular
access.

For example, suppose that a business researcher is interested in studying online


shopping habits among university students. The researcher is enrolled on a course and
decides to give out surveys during class for other students to complete and hand in. This
is an example of a convenience sample because the researcher is using subjects who are
convenient and readily available. In just a few minutes the researcher can conduct an
experiment with possibly a large research sample, given that introductory courses at
universities can have as many as several hundreds of students enrolled.

Purposive sampling

Purposive sampling is a sampling method in which elements are chosen based on the
purpose of the study. Purposive sampling may involve studying the entire population of
some limited group (the accounts department at a local engineering firm) or a subset of
a population (chartered accountants). As with other non-probability sampling methods,
purposive sampling does not produce a sample that is representative of a larger
population, but it can be exactly what is needed in some cases – a study of an
organisation, community, or some other clearly defined and relatively limited group.
Examples of two popular purposive sampling methods include: quota sampling and
snowball sampling.

Quota sampling

Quota sampling is designed to overcome the most obvious flaw of convenience


sampling. Rather than taking just anyone, you set quotas to ensure that the sample you
get represents certain characteristics in proportion to their prevalence in the

Page | 247
population. Note that for this method, you must know something about the
characteristics of the population ahead of time. There are two types of quota sampling:
proportional and non-proportional.

• In proportional quota sampling you want to represent the major characteristics


of the population by sampling a proportional amount of each. For instance, if you
know the population has 25% women and 75% men, and that you want a total
sample size of 400, you will continue sampling until you get those percentages
and then you will stop. So, if you've already got the 100 women for your sample,
but not the 300 men, you will continue to sample men; if legitimate women
respondents come along, you will not sample them because you have already
‘met your quota’. The primary problem with this form of sampling is that even
when we know that a quota sample is representative of the characteristics for
which quotas have been set, we have no way of knowing if the sample is
representative in terms of any other characteristics. If we set quotas for age, we
are likely to attain a sample with good representativeness on age, but one that
may not be very representative in terms of gender, education, or other pertinent
factors.
• In non-proportional quota sampling you specify the minimum number of
sampled data points you want in each category. In this case you are concerned
not with having the correct proportions but with achieving the numbers in each
category. This method is the non-probabilistic analogue of stratified random
sampling in that it is typically used to ensure that smaller groups are adequately
represented in your sample.

Finally, researchers often introduce bias when allowed autonomy in selecting


respondents, which is usually the case in this form of survey research. In choosing
males, interviewers are more likely to choose those who are better-dressed, seem more
approachable and less threatening. That may be understandable from a practical point
of view, but it introduces bias into research findings.

Snowball sampling

In snowball sampling, you begin by identifying someone who meets the criteria for
inclusion in your study. You then ask them to recommend others they may know who
also meet the criteria. Thus, the sample group appears to grow like a rolling snowball.
This sampling technique is often used in hidden populations which are difficult for
researchers to access, including firms with financial difficulties, or students struggling
with their studies. The method creates a sample with questionable representativeness,
and it can be difficult to judge how a sample compares to a larger population.
Furthermore, an issue arises in how respondents choose others to refer you to; for
example, friends will refer you to friends but are less likely to refer you to those they
don't consider as friends, for whatever reason. This creates a further bias within the
sample that makes it difficult to say anything about the population.

The primary difference between probability methods of sampling and non-probability


methods is that in the latter you do not know the likelihood that any element of a
population will be selected for study.

Page | 248
Types of errors

In this chapter we are concerned with sampling from populations using probability
sampling methods, which are the prerequisite for the application of statistical tests.
However, if we base our decisions on a sample, rather than the whole population, by
definition, we are going to make some errors. The concept of sampling implies,
therefore, that we’ll also have to deal with several types of errors, including sampling
error, coverage error, measurement error, and non-response error. Let us just briefly
define these terms:

• Sampling error is the calculated statistical imprecision due to surveying a


random sample instead of the entire population.
• The margin of error provides an estimate of how much the results of the sample
may differ due to chance when compared to what would have been found if the
entire population were interviewed.
• Coverage error is associated with the inability to contact portions of the
population. Telephone surveys usually exclude people who do not have access to
a landline telephone in their homes. They will also miss people who are not at
home (e.g., at work, in prison, or on holiday).
• Measurement error is error, or bias, that occurs when surveys do not measure
what they intended to measure. This type of error results from flaws in the
measuring instrument (e.g. question wording, question order, interviewer error,
timing, and question response options). This is the most common type of error
faced by the polling industry.
• Non-response error results from not being able to interview people who would
be eligible to take the survey. Many households use telephone answering
machines and caller identification that prevent easy contact, or people may
simply not want to respond to calls. Non-response bias is the difference in
responses of those people who complete the survey against those who refuse to
for any reason. While the error itself cannot be calculated, response rates can be
calculated by dividing the number of responses by the number invited to
respond.

Now we understand what types of samples are possible, we will focus on the key type of
samples. The samples that have been randomly selected. We will explore the statistical
techniques that can be applied to a randomly selected data sets.

Check your understanding

X4.1 Compare random sampling and systematic (non-sampling) errors.


X4.2 Explain the reasons for taking a sample rather than a complete census.
X4.3 Name and describe the types of non-probability sampling.
X4.4 Name and describe the types of probability sampling.
X4.5 List the stages in the selection of a sample.

Page | 249
4.3 Sampling from a population
So, we said that when we wish to know something about a population it is usually
impractical, especially when considering large populations, to collect data from every
unit of that population. It is more efficient to collect data from a sample of the
population under study and then to make estimates of the population parameters from
the sample. Essentially, we make generalisations about a population based on a sample.

Population versus sample

To describe the difference between a population and a sample, we reiterate the two
terms as follows:

• A population is a complete set of counts or measurements derived from all


objects possessing one or more common characteristics. For example, if you
want to know the average height of the residents of London, that is your
population, the residents of London. Measures such as means, and standard
deviations derived from the population data are known as population
parameters.
• A sample is a set of data collected by a defined procedure from a proportion of
the population under study. Measures such as means, and standard deviations
derived from samples are known as sample statistics or estimate.

In this section we will explore the concept of taking a sample from a population and use
this sample to provide population estimates for the mean, standard deviation and
proportion.

The method of using samples to estimate population parameters is known as statistical


inference. Statistical inference draws upon the probability results discussed in previous
chapters. To distinguish between population parameters and sample statistics, very
often the symbols presented in Table 4.1 are used (the symbols µ, σ, π, ρ are the Greek
symbols ‘mu’, ‘sigma’, ‘pi’ and ‘rho’, respectively).

Parameter Population Sample


Size N n
Mean µ x
Standard deviation  s
Proportion π 
Table 4.1 Symbols employed to differentiate between population and sample

One of the easiest ways to generate a random sample from a sampling distribution is to
use Excel. Excel can be used to generate random samples from a range of probability
distributions, including normal, binomial, and Poisson distributions. To generate a
random sample, select Tools > Data Analysis.

Select Data > Data Analysis > Random Number Generation (Figure 4.1)

Page | 250
Figure 4.1 Excel Data Analysis menu

Click OK

Provide the information required in the dialog box that appears (Figure 4.2):

• Input number of variable (or samples)


• Input number of data values in each sample
• Select the distribution
• Input distribution parameters e.g. for normal: µ, σ.
• Decide on where the results should appear (Output range).

Figure 4.2 Excel random number generator


Click OK.

Example 4.4

Consider the problem of sampling from a population which consists of the salaries of
public sector employees employed by a national government. The historical data
suggest that the population data are normally distributed with mean £45,000 and
standard deviation £1,000. We can use Excel to generate N random samples, with each
sample containing n data values.

Page | 251
a. Create 10 random samples each with 1000 data points.
b. Calculate the mean for each random sample.
c. Plot the histogram representing the sampling distribution for the sample mean.

Excel solution

a. Generate N = 10 samples (variables) with n = 1000 data values.

Select Data > Data Analysis > Random Number Generation (Figure 4.3).

• Inputs:
• Number of Variables = 10
• Number of Random Numbers = 1000
• Distribution: Normal distribution
• Parameter Mean,  = 45000
• Parameter Standard deviation,  = 1000.
• Output range: cell B5.

Figure 4.3 Example 4.4 Excel random number generator

Click OK.

The N samples are in the columns of the table of values (sample 1, B5:B1004; sample 2,
C5:C1004; …; sample 10, K5:K1004).

b. Calculate N sample means.

Page | 252
Now, we can create 1000 mean values of size 10 by assuming that each row for the
10 samples represents a sample of size 1000. Figure 4.4 illustrates the first four
samples and sample means.

Figure 4.4 Calculate the first 4 row means (sample means in cells L5:L1004)

c. Create histogram bins and plot a histogram of sample means.

We note that from the spreadsheet the smallest and largest sample means are
43,917.62 and 46,045.67 , respectively. Based upon these two values, we then
determine the histogram bin range as 43,900 to 46,300 with step size 600 (43,900,
44,500, … , 46,300), as illustrated in Figure 4.5.

Figure 4.5 Creation of the histogram based upon the 1000 row means

To create the histogram, select Data > Data Analysis > Histogram and select
values as illustrated in Figure 4.6:

Input Range: L4:L1004


Bin Range: N13:N18
Click on Labels
Output Range: P9.

Page | 253
Figure 4.6 Excel histogram menu

Click OK.

Figures 4.7 and 4.8 illustrate the frequency distribution and corresponding
histogram.

Figure 4.7 Frequency distribution

Now, use Excel to create a histogram for the frequency against Bin as illustrated
in Figure 4.8. Highlight data cells N20:O25. Select Insert > Insert Column or Bar
Chart > 2-D Column (option 1). Now, edit the bar chart to remove bin value, add
title, axes titles, and reduce bar gap to zero to give the histogram illustrated in
Figure 4.8.

Page | 254
Figure 4.8 Histogram

The histogram shows a normal distribution for the sample means. The overall
mean value of all 1000 sample means is £45,005 with a standard deviation of
£329.

From the histogram we note that the histogram values are centred about the
population mean value of £45,000. If we repeated this exercise for different
values of sample size n, we would find that the range would reduce as the sample
sizes increase. If you don’t like the Excel method for generating a histogram
described above, then you could just plot frequency against the sample means as
illustrated in Figure 4.9.

Figure 4.9 Plot of frequency against the sample means

Page | 255
SPSS solution

We can use SPSS to re-create the sampling distribution for this example.

a. Create the variables X1, X2, …, X10.

Name the first column X1.

Enter a number (any) in the 1000th cell of the first column to define the variable
size (i.e., the size of the sample). If you have a problem with selecting the 1000th
value, then create the 1000 case data values in Excel and copy to the SPSS data
file as illustrated in Figure 4.10.

Figure 4.10 Enter ID values into SPSS

Now enter any number into the 1000th cell for X1

Figure 4.11 Add any number into ID = 1000

Now create the sample values in column X1

Transform > Compute Variable


Target Variable: X1.
Enter in Numeric Expression: RV.NORMAL(45000, 10000)

Figure 4.12 Use compute variable to randomly select from the normal population

Click OK.

SPSS will now carry out the calculation and store the result in the data file under
the column labelled X1 (the first 10 of the 1000 values are shown in Figure 4.13).

Page | 256
Figure 4.13 First 10 values for the first sample

Since we want 10 samples, we need to repeat this calculation for X2, …, X10. The
first 10 values for the first five samples are shown in Figure 4.14.

Figure 4.14 First 10 values for the first 5 samples

b. Calculate the average values for each of the 10 samples of size 1000.

Now, calculate the mean values for each of the rows to create 1000 row sample
means of size 10.

Transform > Compute Variable


Target Variable: Xbar.
Enter in Numeric Expression: MEAN(X1,X2,X3,X4,X5,X6,X7,X8,X9,X10).

Figure 4.15 Calculate the row sample means

The first three average values are presented in Figure 4.16.

Figure 4.16 The first 3 row means

Page | 257
c. Calculate the overall Xbar mean and standard deviation and construct the histogram.

Analyze > Descriptive Statistics > Frequencies.

Transfer Xbar to the Variables box.


Uncheck Display frequency tables (ignore the warning).
Click on Charts and select Histogram.
Click Continue.
Click on Statistics and select Mean, Std. deviation.
Click Continue.
Click OK

Figure 4.17 SPSS solution representing the mean of the 1000 row means

Summary statistics are shown in Figure 4.17. The overall mean value of all 1000
sample means gives a mean of £45,007 with a standard deviation of £318. The
histogram (Figure 4.18) shows a fairly normal distribution for the sample means.

Figure 4.18 Histogram for the sampled data

From Excel, the overall sample average of the means is £45,005 with a standard
deviation of £329. From SPSS, the overall sample average of the means is £45,007 with a
standard deviation of £318. We observe that the average of the sample means from
Excel and SPSS is approximately equal to the population mean that we sampled from,
and which we know is £45,000. Furthermore, the standard deviation of these sample
means is £329 and £318 from Excel and SPSS respectively. What is important to observe
is that these values are much smaller than the standard deviation of the population we

Page | 258
sampled from which is equal to £1000. Why the difference? This will be explained soon
when we talk about the Central Limit Theorem.

In this section we have generated several random samples and then calculated the mean
value for every random sample. This has enabled us to estimate the true mean of the
overall population. However, in practice we do not have to do that. The central limit
theorem provides us with a shortcut. What we noticed in the previous example was that
when we gather the mean values calculated from numerous random samples, they
begin to follow the normal distribution. In other words, the mean values from multiple
samples are creating a sampling distribution. We could have calculated any other
statistic for each one of these random samples, and we would see that they also form
their own sampling distributions.

This brings us to the key point. Any variable can be distributed in a number of different
ways, and many, though not all, of them follow the normal distribution. Each one of
these distributions that depict a variable is defined by certain parameters, such as the
mean and standard deviation. If we collected a large number of samples from that one
distribution and calculated their statistics (such as the mean of the standard deviation),
then these statistics will also create a distribution. We call this the sampling
distribution. The two sampling distributions that we will explore are the sampling
distribution of the mean and the sampling distribution of the proportion.

Sampling distribution of the mean

In this section we will continue to explore what is understood by the phrase sampling
distribution of the mean. Our Example 4.4 already illustrated that we can take random
samples from a population and calculate the mean value for every sample. If we
generate a large number of samples and have calculated the mean value for every
sample, then these mean values will also form a distribution. This is what we mean by
the sampling distribution of the mean. What is important here is that the mean of all the
sample means has some interesting properties. It is identical to the overall population
mean. A sample mean is called an unbiased estimator since the mean of all sample
means of size n selected from the population is equal to the population mean, µ.

Example 4.5

To illustrate this property, consider the problem of tossing a fair die. We know that the
die has 6 numbers (1, 2, 3, 4, 5, 6), with each number likely to have the same frequency
of occurrence. As an example, we can then take all possible samples of size 2 with
replacement from this population. Let us to illustrate two important results of the
sampling distribution of the sample means using this example. To refresh our memory,
the population mean, and population standard deviation are calculated using equations
(4.1) and (4.2), respectively:

∑X
μ= (4.1)
n

∑ 𝑋2
𝜎=√ − 𝜇2 (4.2)
𝑛

Page | 259
The sample mean is calculated exactly the same way as equation (4.1), but if we have
grouped data, then we can use equation (4.3):
∑f x
̅=
X (4.3)
∑f

If we take several samples and for every sample calculate a sample mean, we’ll end up
with a number of sample means. The mean of all these sample means (𝑋̿) is calculated
using equation (4.4):
̅
∑f X
̅
X = (4.4)
∑f

Can we calculate the standard deviation of the sample means around their mean? If you
understood the question (read it again!), the answer is: yes. The standard deviation of
the means around their central mean is called the standard error of the sample
means. Effectively, the standard error of the sample means measures the standard
deviation of all sample means from the overall mean. Equation (4.5) shows the formula
for the standard error (sometimes the second part of the phrase, ‘of the sample means’,
is dropped):
̅2
∑f X
σx̅ = √ ̅2
− X (4.5)
∑f

Equation (4.5) can be re-written as:

∑(𝑥̅ -𝑥̿ )2
𝜎𝑥̅ = √ (4.6)
𝑛−1

Remember, this equation applies only if we have a very large population and draw
numerous random cases to calculate their respective means 𝑥̅ . From the population data
values (1, 2, 3, 4, 5, 6) we can calculate the population mean and standard deviation
using equations (4.1) and (4.2), together with Table 4.2.

Die value, X X2
1 1
2 4
3 9
4 16
5 25
6 36
N= 6
 X = 21
 X2 = 91
Mean  = 3.5
Population standard deviation  = 1.7078
Table 4.2 Calculation table for population mean and standard deviation

Using Table 4.2:

Page | 260
∑X 21
μ= = = 3.5
n 6

∑ 𝑋2 91
𝜎=√ − 𝜇 2 = √ − 3.52 = 1.7078
𝑛 6

Let us now take all possible samples of size 2 (n = 2) with replacement from this
population. From a population of size N = 6, there are 36 possible samples of size n = 2.
Table 4.3 illustrates the results with pairs such as (1, 2) and (2, 1) combined to give a
frequency of 2, for example.

Sample pairs
ID Value mean,
2
Value 1 Value 2 ̅
X f ̅
fX fX̅
1 1 1 1 1 1 1
2 1 2 1.5 2 3 4.5
3 1 3 2 2 4 8
4 1 4 2.5 2 5 12.5
5 1 5 3 2 6 18
6 1 6 3.5 2 7 24.5
7 2 2 2 1 2 4
8 2 3 2.5 2 5 12.5
9 2 4 3 2 6 18
10 2 5 3.5 2 7 24.5
11 2 6 4 2 8 32
12 3 3 3 1 3 9
13 3 4 3.5 2 7 24.5
14 3 5 4 2 8 32
15 3 6 4.5 2 9 40.5
16 4 4 4 1 4 16
17 4 5 4.5 2 9 40.5
18 4 6 5 2 10 50
19 5 5 5 1 5 25
20 5 6 5.5 2 11 60.5
21 6 6 6 1 6 36
Table 4.3 Calculation for sample mean and standard deviation

We can calculate the mean of these sample means and corresponding standard
deviation of the sample means using the Table 4.3 frequency distribution. For example,
for sample pair (2, 6) the sample mean is equal to 4. The pair (2, 6) occurs twice, given
we can have (2, 6) or (6, 2).

From Table 4.3, recalling from equation (4.4) that the mean of the sample means is
denoted by 𝑋̅ , and denoting the standard deviation of the means by 𝜎𝑋̅ , we have:

∑ f = 36

Page | 261
̅ = 126
∑f X

∑f ̅
X 2 = 493.5

̅
∑fX 126
̅
X= = = 3.5
∑f 36

∑ 𝑓 𝑋̅ 2 493.5
𝜎𝑋̅ = √ − 𝑋̅ 2 = √ − 3.52 = 1.2076
∑𝑓 36

The mean of the sample means is equal to 3.5. We already stated that the mean of the
sample means is an unbiased estimator of the population mean, which means that:

𝑋̅ = 𝜇 (4.7)

The standard deviation of the sample means by 𝜎𝑋̅ is equal to 1.2076. Recall, however,
that the population standard deviation was 1.7078. If we take the value of  = 1.7078
and divide it by √𝑛 (which is √2), we get 1.2076.

First, we see that the standard deviation of the sample means is not equal to the
population standard deviation (𝜎𝑋̅ < σ). In fact, the standard deviation of the sample
means (or, as we will call it from now on, the standard error of the sample means) is a
biased estimate of the population standard deviation. Secondly, it can be shown that
the relationship between sample and population is given by equation (4.8):
σ
σX̅ = (4.8)
√n

We have shown this in our example:

𝜎 1.7078
= = 1.2076 = 𝜎𝑋̅
√𝑛 √2

Although the full name for equation (4.8) is the standard error of the sample means,
more often it is called just the standard error. From equation (4.8) we observe that as
n increases, the value of the standard error approaches zero (𝜎𝑋̅ → 0). In other words,
as n increases, the spread of the sample means decreases to zero. This means that 𝜎𝑋̅ is
approaching the true value of . Make sure you understand the difference between the
standard error and the standard deviation:

1. The standard deviation measures how much individual values are scattered
around their mean (in either a sample or in the population).
2. The standard error measures how much sample means are scattered around the
overall mean, in other words, how representative is our sample mean when
compared to the true mean value.

Page | 262
Remember: the standard deviation is a descriptive statistic and the standard error is an
inferential statistic. The standard error of the mean is effectively the standard deviation
of a number of sample means around their overall mean.

Excel solution

Figures 4.19 and 4.20 illustrate the Excel solution.

Figure 4.19 Example 4.5 Excel solution

The formulae in Figure 4.19 use two different methods to calculate the mean and
standard deviation in Excel. One method is using Excel functions =AVERAGE() and
=STDEV(), and the other method shows manual calculations. Now let us look at all
possible samples of 2 as illustrated in Figure 4.20.

Page | 263
Figure 4.20 Example 4.5 Excel solution continued

SPSS solution

No built-in SPSS solution.

Sampling from a normal population

If we select a random sample X from a population that is normally distributed with


population mean µ and standard deviation σ, then we can state this relationship using
the notation X ~ N(µ, σ2). Figure 4.21 shows the distribution of X.

Figure 4.21 Normal distribution X ~ N(µ, 2)

If we choose a number of samples from a normal population then we can show that the
sample means are also normally distributed with a mean of µ and a standard deviation

Page | 264
of the sampling mean given by equation (4.7), where n is the sample size on which the
sampling distribution was based. Figure 4.22 shows the distribution of X .

Figure 4.22 Normal distribution for sample means 𝑋̅~ N(µ, 2/n)

Example 4.6

We return to Example 4.4 with 40,000 random samples from a population that is
assumed to be normally distributed with mean £45,000 and standard deviation
£10,000. The population values are based on 40,000 data points and the sampling
distribution is illustrated in Figure 4.23. We observe from Figure 4.23 that the
population data are approximately normal.

Figure 4.23 Shape of the histogram when n = 40

Figures 4.24-4.27 show the sampling distributions of the mean where n = 2, 5, 10, and
40 respectively.

From Figures 4.24–4.27 we observe that all the sampling distributions of the mean are
approximately normal, but the shape changes depending on the size n. We observe that
the sample means are less spread out about the mean as the sample sizes increase.

Page | 265
Figure 4.24 Shape of histogram when n = 2

Figure 4.25 Shape of histogram when n = 5

Page | 266
Figure 4.26 Shape of histogram when n = 10

Figure 4.27 Shape of histogram when n = 40

From these observations we conclude that if we sample from a population that is


normally distributed with mean µ and standard deviation σ (X ~ N(µ, σ2)), then the
sampling means are also normally distributed with mean µ and standard deviation of
the sample means of σX̅ = σ⁄ . This relationship is represented by equation (4.9):
√n

Page | 267
𝜎2 𝜎
𝑋 ∼ 𝑁 (𝜇, ) or 𝑋 ∼ 𝑁 (𝜇, ) (4.9)
𝑛 √𝑛

Now we know that the sample mean is normally distributed, we can solve a range of
problems using the method that will be described in Chapters 5–6.

Before we proceed, let us remind ourselves of equation (3.6) that refers to the
standardised values for the normal distribution. Just like equation (3.6) that enabled us
̅, we can
to convert all the values of X into Z, if we had a distribution of sample means X
̅
use a similar equation to convert every value of X to Z.

The standardised sample mean Z value is given by equation (4.10).


̅− μ
X ̅− μ
X
Z= = σ (4.10)
σX
̅
√n

Equation (3.6) shows how to convert any x value into the Z-score. However, in this
chapter we are dealing with the distribution of the means, and therefore, from this
X,  remains the same and  becomes 𝜎𝑋̅ , which is the
equation (3.6), X becomes ̅
standard error of the means, SE or σX̅ = σ⁄ . This is how equation (3.6) becomes
√n
(4.10).

Why is this standardised sample mean Z value so important? Because this is the
shortcut that will enable us to make estimates about the population without a need to
draw numerous random samples to estimate the true parameters of the population. We
will be able to take one sample and make inference about the whole population.

Example 4.7

Diet X runs several weight reduction centres in a large town in the North East of
England. From historical data it was found that the weight of participants is normally
distributed with a mean of 180 pounds and a standard deviation of 30 pounds, X ~
N(180, 302).

Calculate the probability that the average sample weight is greater than 189 pounds
when 25 participants are randomly selected for the sample.

Given X ~ N(, 2) = N(180, 302). Calculate P(sample mean > 189) when we have a
randomly chosen sample of size n = 25. From the central limit theorem, we have:

𝜎2 302 30
𝑋 ∼ 𝑁 (𝜇, ) = 𝑁 (180, ) = 𝑁 (180, ) = 𝑁(180,6)
𝑛 25 √25

The central limit theorem is a ‘rule’ which says that the means calculated from a large
number of samples, taken from a large population, will be distributed in accordance
with a normal population, even if this large population is not distributed as normal
distribution. It also implies that the mean of all sample means taken is equal to the true
mean of the population.

Page | 268
The problem requires the solution to the problem P(X ̅ > 189). Figure 4.28 illustrates
the region to be found that represents this probability. Excel can be used to solve this
problem by either using the =NORM.DIST() or =NORM.S.DIST() functions.

Figure 4.28 Shaded region represents P(𝑋̅ > 189)

From equation (4.10) we have:

𝑋̅ − 𝜇 189 − 180 9
𝑍= 𝜎 = = = + 1.5
30 6
√𝑛 √25

Note that what is 189 on the actual scale (X = 189 pounds) becomes 1.5 (Z = 1.5) on the
standardised scale:

̅ > 189) = P(Z > + 1.5)


P(X

From normal tables:

̅ > 189) = P(Z > + 1.5) = 0.06681


P(X

The probability that the sample mean is greater than 189 pounds is 0.06681 or 6.7%.

Excel solution

Page | 269
Figure 4.29 Example 4.7 Excel solution

The above solution shows two alternative ways in Excel of calculating Z and P. We
already described both Excel functions =NORM.DIST() and =NORM.S.DIST(). Make sure
you do not confuse them. Given the population mean (µ = 150), population standard
deviation (σ = 30), sample size (n = 25), and standard error of the sample mean(𝜎𝑋̅ =
𝜎⁄√𝑛 = 30⁄√25 = 6).

Therefore, from Excel the probability that the sample mean is greater than 189 is
0.06680 or 6.7%.

SPSS solution

Select Transform > Compute Variable


Target Variable: EXAMPLE7
Enter in Numeric Expression: 1-CDF(189,180,6).

Figure 4.30 Use compute variable to calculate P(𝑋̅ > 189)

Click OK.

Page | 270
SPSS will now carry out the calculation and store the result in the data file under
column labelled EXAMPLE7.

Figure 4.31 SPSS solution P(𝑋̅ > 189) = 0.066807

This result agrees with the Excel solution shown in Figure 4.28.

Example 4.8

Use the Example 4.6 data to calculate the probability that the sample mean lies between
156 and 165 pounds.

We have X ~ N(, 2) = N(163, 352). We require P(140 ≤ ̅


X ≤ 158) when we have a
randomly chosen sample of size n = 25. From the central limit theorem, we have:

𝜎2 302
𝑋̄ ∼ 𝑁 (𝜇, ) = 𝑁 (163, )
𝑛 25

We are given the population mean (µ = 163), population standard deviation (σ = 35),
sample size (n = 25), and standard error of the sample mean (𝜎𝑋̅ = 𝜎⁄√𝑛 = 30⁄√25 =
6). Figure 4.32 illustrates the region to be found that represents this probability. Again,
Excel can be used to solve this problem by using either the =NORM.DIST() or
=NORM.S.DIST() function.

Figure 4.32 Shaded region represents P(156 ≤ 𝑋̅ ≤ 165)

From equation (4.10) we have:

̅
X1 − μ 156 − 163
Z1 = σ = = − 1.2329
30
√n √25

Page | 271
̅
X2 − μ 165 − 163
Z2 = σ = = +0.3523
30
√n √25

P(156 ≤ 𝑋̅ ≤ 165) = 𝑃(− 1.2329 ≤ 𝑍 ≤ +0.3523)

From normal critical tables:

P(156 ≤ 𝑋̅ ≤ 165) = 𝑃(− 1.2329 ≤ 𝑍 ≤ +0.3523)

P(156 ≤ 𝑋̅ ≤ 165) = 1 − 𝑃(𝑍 ≥ + 1.2329) − 𝑃(𝑍 ≥ 0.3523

P(156 ≤ 𝑋̅ ≤ 165) = 1 − 0.10935 − 0.36317

P(156 ≤ 𝑋̅ ≤ 165) = 0.52748 or 52.8%

The probability that the sample mean is between 156 and 165 Ibs is 0.52748 or 53%.

Excel solution

Figure 4.33 illustrates the Excel solution using the standard formula and function
methods.

Figure 4.33 Example 4.8 Excel solution

P(156 ≤ 𝑋̅ ≤ 165) = 0.5289 or 52.9%

The probability that the sample mean is between 156 and 165 Ibs is 0.5289 or 53%.

SPSS solution

Select Transform > Compute Variable


Page | 272
Name Target Variable: example8
Enter in Numeric Expression: CDF.NORMAL(165, 163, 5.6777)-
CDF.NORMAL(156, 163, 5.6777).

Figure 4.34 Use compute variable to calculate P(156 ≤ 𝑋̅ ≤ 165)

Click OK

SPSS will now carry out the calculation and store the result in the data file under
column labelled example8.

Figure 4.35 SPSS solution P(156 ≤ 𝑋̅ ≤ 165) = 0.528869 or 52.9%

This result agrees with the Excel solution illustrated in Figure 4.33 and the manual
solution.

Sampling from a non-normal population

In the previous section, we sampled from a population which is normally distributed,


and we stated that the sample means will be normally distributed with mean µ and
standard error of the mean 𝜎𝑋̅ . What if the data do not come from the normal
distribution? It can be shown that if we select a random sample from a non-normal
distribution then the sampling mean will still be approximately normal with mean µ and
standard deviation 𝜎𝑋̅ , if the sample size is sufficiently large.

In most cases the value of n should be at least 30 for non-symmetric distributions and at
least 20 for symmetric distributions, before we apply this approximation. This
relationship is already represented by equation (4.9).

This leads to an important concept in statistics that is known as the central limit
theorem, which we briefly mentioned above. The central limit theorem provides us with
a shortcut to the information required for constructing a sampling distribution. By
applying the theorem, we can obtain the descriptive values for a sampling distribution,
usually the mean and the standard error, computed from the sampling variance.

We can also obtain probabilities associated with any of the sample means in the
sampling distribution. The central limit theorem states that if you have a population
with mean μ and standard deviation σ and take sufficiently large random samples from
the population with replacement, then the distribution of the sample means will be
approximately normally distributed. As we just stated, this will hold true regardless of

Page | 273
whether the source population is normal or skewed, provided the sample size is
sufficiently large (usually n > 30).

If the population is normal, then the theorem holds true even for samples smaller than
30. This means that we can use the normal probability model to quantify uncertainty
when making inferences about a population mean based on the sample mean. The
central limit theorem also implies that certain distributions can be approximated by the
normal distribution, for example:

1. Student’s t distribution, t(df), is approximately normal with mean 0 and variance 1


when the degrees of freedom df is large.
2. The chi-square distribution, 2(k), is approximately normal with mean k and
variance 2k, for large k.
3. The binomial distribution, B(n, p), is approximately normal with mean np and
variance np(1 – p) for large n and for p not too close to 0 or 1.
4. The Poisson distribution, Po(), with mean value  is approximately normal with
mean  and variance  for large values of .

Example 4.9

Consider the sampling of 38 electrical components from a production run where


historically a component’s average lifetime was found to be 990 hours with a standard
deviation of 150 hours. The population data are right-skewed and therefore is not
normally distributed. Calculate the probability that a randomly chosen sample mean is
less than 995 hours.

Given a population variable X has a mean  = 990 and standard deviation  = 150 (right-
skewed). We need to calculate P(𝑋̅ < 995) when we have a randomly chosen sample of
size n = 38.

The population distribution is right-skewed, but the sample size is greater than 30, so
we can use the central limit theorem to state:

σ2 1502
̅ ~ N (μ,
X ) = N (990, )
N 38

The problem requires us to find 𝑃(𝑋̅ < 995).

Figure 4.36 illustrates the region to be found that represents this probability. Excel can
be used to solve this problem by using either the =NORM.DIST() or =NORM.S.DIST()
function.

Page | 274
Figure 4.36 Shaded region represents 𝑃(𝑋̅ < 995) = 0.5814.

We are given the population mean (µ = 990), population standard deviation (σ = 150),
sample size (n = 38), and standard error IS:

𝜎 150
𝜎𝑋̅ = = = 24.33321317
√𝑁 √38

From equation (4.10) we have:

̅
X − μ 995 − 990
Z = σ = = +0.2054
150
√n √38

From normal tables:

𝑃(𝑋̅ < 995) = 𝑃(𝑍 < +0.2054)

𝑃(𝑋̅ < 995) = 1 − 𝑃(𝑍 > +0.2054)

𝑃(𝑋̅ < 995) = 1 − 0.41683

𝑃(𝑋̅ < 995) = 0.58317

Based on a random sample, the probability that the sample mean is less than 995 hours
is 0.58317 or 58%.

Excel solution

Figure 4.37 illustrates the Excel solution.

Page | 275
Figure 4.37 Example 4.9 Excel solution

Based on a random sample, the probability that the sample mean is less than 995 hours
is 0.58140 or 58%.

SPSS solution

Select Transform > Compute Variable


Name Target Variable: Example9
Enter in Numeric Expression: CDF.NORMAL(995, 990, 24.33321317).

Figure 4.38 Use compute variable to calculate P(𝑋̅ < 995)

Click OK.

SPSS will now carry out the calculation and store the result in the data file under
column labelled Example9.

Figure 4.39 SPSS solution P(𝑋̅ < 995) = 0.581402

The result agrees with the Excel solution shown in Figure 4.37 and the manual solution.

Sampling without replacement

In the previous cases we assumed that sampling will have taken place with replacement
(for a very large or infinite population). If there is no replacement, then equation (4.8)
has to be modified by a correction factor to give equation (4.11):

Page | 276
𝜎 𝑁−𝑛
𝜎𝑋̅ = √ (4.11)
√𝑛 𝑁−1

Where N is the size of the population and n is the size of the sample.

Example 4.10

A random sample of 38 part-time employees is chosen without replacement from a firm


employing 98 part-time workers. The mean number of hours worked per month is 45,
with a standard deviation of 12.

Determine the probability that the sample mean: (a) will lie between 45 and 48 hours;
(b) be over 47 hours.

In this example we have a finite population of size N (= 98) and a sample size n (= 38).
From equation (4.11) we can calculate the standard error of the sampling mean and
then use Excel to calculate the two probability values. Since the sample size (n = 30) is
sufficiently large for the population (N=200), we will apply the central limit theorem to
the problem and assume that the sampling mean distribution is approximately normally
distributed as given by equation (4.8):

σ2
̅ ~ N (μ,
X )
N

a. The problem requires us to find P(45 ≤ ̅


X ≤ 48).

Figure 4.40 illustrates the region to be found that represents this probability.
Excel can be used to solve this problem by using either the =NORM.DIST() or
=NORM.S.DIST() function.

̅ ≤ 48)
Figure 4.40 Shaded region represents P(45 ≤ X

Page | 277
Calculate P(45 ≤ ̅
X ≤ 48)?

Given the population size (N = 98), population mean (µ = 45), population


standard deviation (σ = 12) and sample size (n = 38), the corrected standard
error is:

𝜎 𝑁−𝑛
𝜎𝑋̅ = × √
√𝑛 𝑁−1

12 98 − 38
σX̅ = × √
√38 98 − 1

𝜎𝑋̅ = 1.5310

From equation (4.10) we have:

̅ − μ
X 48 − 45
Z = = = +1.9595
𝜎𝑋̅ 1.5310

P(45 ≤ ̅
X ≤ 48) = 𝑃(0 ≤ 𝑍 ≤ 1.9595 )

P(45 ≤ ̅
X ≤ 48) = 0.5 − 𝑃(𝑍 ≥ 1.9595 )

From critical normal tables:

P(45 ≤ ̅
X ≤ 48) = 0.5 − 0.02500

̅ ≤ 48) = 0.475
P(45 ≤ X

Based upon a random sample, the probability that the sample mean lies between 45 and
48 is 0.475or 48.0%.

Excel solution

Figure 4.41 illustrates the Excel solution.

Page | 278
Figure 4.41 Example 4.10a Excel solution

Both methods provided the same answer to the problem of calculating the required
probability.

SPSS solution

Select Transform > Compute Variable


Name Target Variable: Example10a
Enter in Numeric Expression: CDF.NORMAL(48, 45, 1.5310).

Figure 4.42 Use compute variable to calculate

Click OK.

SPSS will now carry out the calculation and store the result in the data file under
column labelled Example10a

Figure 4.43 SPSS solution P(45 ≤ ̅


X ≤ 48) = 0.475

Page | 279
This result agrees with the Excel solution illustrated in Figure 4.41 and the manual
solution which uses the critical normal tables.

̅ > 47).
b. The problem requires us to find P(X

Given the population mean (µ = 45), population standard deviation (σ = 12),


population size N = 98, and sample size (n = 30).

We are using the correction factor; the corrected standard error given by
equation (4.11):

𝜎 𝑁−𝑛
𝜎𝑋̅ = × √
√𝑛 𝑁−1

12 98 − 38
σX̅ = × √
√38 98 − 1

𝜎𝑋̅ = 1.5310

Figure 4.44 shows the region to be found that represents the probability find
̅ > 47).
P(X

̅ > 47).
Figure 4.44 Shaded region represents P(X

From equation (4.10) we have:

̅ − μ
X 47 − 45
Z = = = +1.3063
𝜎𝑋̅ 1.5310

From critical normal tables:

̅ > 47) = 0.09510


P(X

Page | 280
The probability that the sample mean is greater than 47 is 0.09510 or 9.5%.

Excel solution

The Excel solution is illustrated in Figure 4.45.

Figure 4.45 Example 4.10b Excel solution

From Excel, the probability that the sample mean is greater than 47 is 0.0957 or 9.6%,
which agrees with the manual solution using critical normal tables.

SPSS solution

Select Transform > Compute Variable


Name Target Variable: Example10b
Enter in Numeric Expression: 1-CDF.NORMAL(47, 45, 1.5310)

̅ > 47).
Figure 4.46 Use compute variable to calculate P(X

Click OK

SPSS will now carry out the calculation and store the result in the data file under
column labelled Example10b

̅ > 47) = 0.095719.


Figure 4.47 SPSS solution P(X

This result agrees with the Excel solution shown in Figure 4.45 and the manual solution
using critical normal tables.

Page | 281
Sampling distribution of the proportion

Consider the case where a variable has two possible values, ‘yes’ or ‘no’, and we are
interested in the proportion of people who choose ‘yes’ or ‘no’ in some survey. An
example for this scenario could be a measure of responses of shoppers in deciding
whether to purchase product A. From historical data it is found that 38% of people
surveyed preferred product A. We would define this as the estimated population
proportion, π, who prefer product A.

If we then took a random sample from this population, it would be unlikely that exactly
38% would choose product A, but, given sampling error, it is likely that this proportion
could be slightly less or slightly more than 38%. If we continued to sample proportions
from this population, then each sample would have an individual sample proportion
value, which when placed together, would form the sampling distribution of the
sample proportion who choose product A.

The sampling distribution of the sample proportion is approximated using the binomial
distribution, given that the binomial distribution represents the distribution of r
successes (choosing product A) from n trials (or selections). Remember, the binomial
distribution is the distribution of the total number of successes, whereas the
distribution of the population proportion is the distribution of the mean number of
successes.

Although we use the binomial distribution to approximate the sampling distribution of


the sample proportions, they are not identical. Given that the mean is the total divided
by the sample size, n, the sampling distribution of the proportion is somewhat different
from the binomial distribution. Why? The sample proportion is effectively the mean of
the scores and the binomial distribution is dealing with the total number of successes.

Let us work out the details. We know from equation (3.22) that the mean of a binomial
distribution is given by the equation µ = nπ, where π represents the population
proportion. If we divide this expression by n, then this equation gives equation (4.12).
Equation (4.12) represents the unbiased estimator of the mean of the sampling
distribution of the proportion.

𝜇𝜌 = 𝜋 (4.12)

Equation (3.23) represents the variance of the binomial distribution which. This one,
when divided by n, gives equation (4.13). Equation (4.13) represents the standard
deviation of the sampling proportion (or standard error), σρ, where  represents the
population proportion:

𝜋 (1− 𝜋)
𝜎𝜌 = √ (4.13)
𝑛

From equations (4.12) and (4.13), the sampling distribution of the proportion is
approximated by a binomial distribution with mean (µρ) and standard deviation (σρ).
Furthermore, the sampling distribution of the sample proportion (ρ) can be

Page | 282
approximated with a normal distribution when the probability of success is
approximately 0.5, and n and n(1 – ) are at least 5:

𝜋 (1− 𝜋)
𝜌 ~ 𝑁 (𝜋, ) (4.14)
𝑛

The standardised sample mean Z value is given by modifying equation (4.10) to give
equation (4.15):
ρ− π ρ− π
Z= = (4.15)
σρ π (1− π)

n

Example 4.11

It is known that 32% of workers in a factory own a personal computer. Find the
probability that at least 40% of a random sample of 38 workers will own a personal
computer.

In this example, we have the population proportion π = 0.25 and sample size n = 80. The
problem requires us to find P(ρ ≥ 0.4). From equation (4.13) the standard error for the
sampling distribution of the proportion is:

𝜋 (1 − 𝜋)
𝜎𝜌 = √
𝑛

0.32 (1 − 0.32)
𝜎𝜌 = √
38

𝜎𝜌 = 0.075672424

Substituting this value into equation (4.15) gives the standardised Z value:

ρ− π 0.4 − 0.32
Z= = = 1.057
σρ 0.075672

P(ρ ≥ 0.4) = 𝑃(𝑍 ≥ 1.057)

Figure 4.48 illustrates the area representing P(ρ ≥ 0.4).

Page | 283
Figure 4.48 Shaded region represents P( ≥ 0.4)

From normal tables:

P(ρ ≥ 0.4) = 𝑃(𝑍 ≥ 1.057) = 0.14457

The probability that at least 40% of a random sample of 38 workers will own a personal
computer is 14.5%.

Excel solution

Figure 4.49 illustrates the Excel solution.

Figure 4.49 Example 4.11 Excel solution

From Excel, the probability that at least 40% of a random sample of 38 workers will
own a personal computer is 14.5% which agrees with the manual solution.

Page | 284
SPSS solution

Select Transform > Compute Variable


Name Target Variable: Example11
Enter in Numeric Expression: 1-CDF.NORMAL(0.4, 0.32, 0.075672424)

Figure 4.50 Use compute variable to calculate

Click OK.

SPSS will now carry out the calculation and store the result in the data file under
column labelled Example11

Figure 4.51 SPSS solution P( ≥ 0.4) = 0.145213

This result agrees with the Excel solution shown in Figure 4.49 and the manual solution.

Check your understanding

X4.6 Five people have made insurance claims for the amounts shown in Table 4.4. A
sample of two people is to be taken at random, with replacement, from the five.
Derive the sampling distribution of the mean and prove: (a) 𝑥̿ = 𝜇, and (b)
𝜎𝑥̅ = 𝜎/√𝑛.

Person 1 2 3 4 5
Insurance claim (£) 500 400 900 1000 1200
Table 4.4 Insurance claims (£)

X4.7 If X is a normal random variable with mean 10 and standard deviation 2, that is,
X˜ N(10, 4), define and compare the sampling distribution of the mean for
samples of size: (a) 2, (b) 4, (c) 16.

X4.8 If X is any random variable with mean 63 and standard deviation 10, define and
compare the sampling distribution of the mean for samples of size: (a) 40, (b) 60,
and (c) 100.

X4.9 Use Excel to generate a random sample of 100 observations from a normal
distribution with mean 10 and a standard deviation 4. Calculate the sample mean

Page | 285
and standard deviation. Why are these values different from the population
values?

Chapter summary
In this chapter we have introduced the important statistical concept of sampling and the
concept of the central limit theorem. This enabled us to conclude that the sampling
distribution can be approximated by the normal distribution.

We have shown how the central limit theorem can eliminate the need to construct a
sampling distribution by examining all possible samples that might be drawn from a
population. The central limit theorem allows us to determine the sampling distribution
by using the population mean and variance values or estimates of these obtained from a
sample.

Furthermore, we established that an unbiased estimate of the population mean is


provided by the sample mean. The sample variance (or standard deviation) is a biased
estimate of the population variance (or standard deviation) and we used the standard
error of the estimate, which is used to estimate the population variance (or standard
deviation).

We have shown that as the sample size increases, the standard error decreases.
However, any advantage quickly vanishes as improvements in standard error tend to be
smaller as the sample size gets larger. The next chapter will use these results to
introduce the idea of calculating point and interval estimates for the population mean
(and proportion) based on sample data and the assumption that the underlying x
variable is normally distributed.

Test your understanding


TU4.1 Is the sampling distribution of the sample means dependent on the underlying
population distribution? For what values of sample size would the central limit
theorem apply?

TU4.2 The central limit theorem allows a sampling distribution of the sample means to
be developed for large sample size even if the shape of the population
distribution is unknown:

a. What happens to the value of the standard deviation of the sampling means if
the sample size increases?
b. What happens to the sampling distribution of the sample means as the sample
size increases from n = 15, n =23, n = 30, and n > 30?
c. What value of the sample size should we use to randomly select samples such
that the central limit theorem can be applied even when the shape of
population distribution is unknown?

TU4.3 A series of samples are to be taken from a population with a population mean of
100 and population variance of 67. The central limit theorem applies when the

Page | 286
sample size is at least 30. If a random sample of size 34 is taken then calculate:
(a) the mean and variance of the sampling distribution of the sample means; (b)
the probability that a sample mean is greater 103; and (c) the probability that a
sample mean lies between 98.4 and 103.2.

TU4.4 Current airline rules assume that the average mean weight for male passengers
is 88 kg, with no information provided on the shape of the population
distribution. If we assume that the population standard deviation is 14 kg,
describe the sampling distribution for the sample means if the sample size is
large. Calculate the probability that a sample mean is greater than 96 kg when
we randomly sample 38 passengers.

TU4.5 A population variable has an unknown shape but known mean and standard
deviation of 74 and 18, respectively. If we randomly choose a sample of size 48,
calculate the probability that the sample mean: (a) is greater than 69, (b) is less
than 69, (c) is between 69 and 73, and (d) is greater than 76.

TU4.6 According to a recent UK Office for National Statistics report, the average weekly
household spending was £528.90 during 2016, with a standard deviation of £90
and a population distribution that is positively skewed. What is the probability
that the household sample mean spend is greater than £534.87 if we randomly
sample 56 households?

TU4.7 A research assistant collects data on the time spent in minutes by gym users on a
new piece of gym equipment. The assistant randomly selects a sample of 57 from
the population where the population mean is 18 minutes with a standard
deviation of 1.8 minutes.

a. What is the shape of the sampling distribution of the sample means?


b. Is your answer to part (a) dependent upon the shape of the population
distribution?
c. Calculate the mean and standard deviation of the sampling means.
d. Calculate the probability that a sample mean is greater than 18.4 minutes.

TU4.8 A university academic spends on average 6 minutes answering each student


email. The shape of the population distribution is unknown, but the standard
deviation is known to be 2.2 minutes. The academic decides to calculate the
sample mean and wishes to know the shape of the sampling distribution of the
sample means and to calculate a series of probabilities associated with the
sample means.

a. If we select a small random sample can we apply the central limit theorem
(say n < 30)?
b. For what values of sample size can we then apply the central limit theorem?
c. What does the central limit theorem tell us about the sampling distribution of
the sample means?
d. What is the value of the mean of the sampling means?
e. What is the value of the standard error of the sampling means if we select a
random sample of size 67?

Page | 287
f. Calculate the probability that the sample mean is greater than 6.6 minutes
when n = 67.
g. Calculate the probability that the sample mean lies between 6.2 and 6.8
minutes when n = 67.

TU4.9 A local supermarket employs a staff member at the counter who deals with the
collection and returns of goods purchased online via the supermarket e-
commerce website. The time spent dealing with each customer has a population
mean of 4.5 minutes with a standard deviation of 0.35 minutes. The shape of the
population distribution is unknown.

a. What is the shape of the sampling distribution of the sample means if the
supermarket randomly selects a sample of size 45?
b. What is the name of the theorem stated in your answer to part (a)?
c. What is the value of the mean of the sampling means if the central limit
theorem applies?
d. What is the value of the standard error of the sampling means if the central
limit theorem applies?
e. Calculate the probability that the sample mean time will be greater than 4.6
minutes?
f. Calculate the probability that the sample mean time will lie between 4.55
and 4.7 minutes?

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. AW4 Use SPSS to demonstrate the Central Limit Theorem

Page | 288
Chapter 5 Point and interval estimates
5.1 Introduction and learning objectives
Chapter 4 enabled us to understand that the sampling distributions provide very useful
clues about the population parameters, and we specifically looked at the mean and
proportions. Effectively, this chapter takes the sampling distribution of the sample
mean and proportion that we investigated in the previous chapter to the next level. We
will learn how to use the sample mean (or proportion) to provide an estimate of the
population value.

We will first define two different types of estimates, which are a point estimate and
interval estimate. We will then define the criteria for a good estimator. From there, we
will learn how to make some of the specific point estimates of the population mean,
variance and proportion. This will be followed by exploring interval estimates for the
same population statistics.

We will clarify how interval estimates are related to confidence intervals and conclude
the chapter by providing specifics of the procedure for calculating the interval estimate
of the population mean (µ) and proportion (π), provided that  known and the sample
is larger than 30 observations. Then we will do the same for the situation where 
unknown and the sample is smaller than 30 observations.

We will complete the chapter with some practical advice on how to calculate the sample
size. Students might find this advice useful not only when designing surveys for their
dissertations, but also later in professional life.

Learning objectives

On completing this chapter, you will be able to:

1. Calculate point estimates for population parameters (mean, proportion, variance)


for one and two samples
2. Calculate sampling errors and confidence intervals when the population standard
deviation is known or unknown (z and t tests) for one and two samples
3. Determine sample sizes
4. Solve problems using Microsoft Excel and IBM SPSS.

An important skill in business statistics is to be able to analyse sample data such that we
can infer the population statistics. The simplest method is to calculate the sample mean
and infer that this is equal to the value of the population mean. If we have collected
sufficient sample data to minimise the chance of sampling error, then we would expect
the sample mean to be an unbiased estimator of the population mean. Because no
estimate can be 100% reliable, you would want to know how confident you can be in
your estimate and whether to act on it from a business decision-making perspective.

In statistics, a confidence interval gives the probability that an estimated range of


possible values includes the actual value being estimated. For example, you may decide

Page | 289
that a printing shop can print 2000 A4 pages per day. Because the printing shop cannot
be expected to print exactly 2000 A4 pages on each day, a confidence interval can be
created to give a range of possibilities. You may state that there is a 95% chance that the
shop prints between 1800 and 2200 A4 pages per day. The confidence interval is 95%,
and the probability that the actual number of pages printed is outside this estimated
range is 5%. You can also think of this 5% figure as a risk factor. In other words, there is
a 5% risk that actual number of pages is not between 1800 and 2200.

The contents of this chapter will be useful to you if you need to make an estimate based
on sample information. This happens almost every day. You might be in business and
need to decide, based on research data, whether to launch a new product. If this is the
case, the content of this chapter will help you frame your decision in a compelling way.
You might be in manufacturing and deciding, based on the scrap sample, if you have
adequate quality problem in your process. Again, the content of this chapter will
provide the basis for such decision-making. You might be in public service and need to
decide, based on local information, if the funding on a broader level is appropriate. The
methods covered in this chapter are essential in this case.

5.2 Point estimates


In the previous chapter we explored the sampling distribution of the mean and
proportion. We showed that these distributions can be normal with population
parameters µ and 2. Because the distribution of the means (or proportions) is
effectively normal, this will enable us to make estimates of the unknown population
mean (or proportion), based on the sample mean (or proportion). This can be phrased
as: our objective is to determine the value of a population parameter based on a sample
statistic. As we will see shortly, this population parameter is determined with a degree
of probability, which is called the level of confidence. We reiterate, the method
described in this section is dependent upon the sampling distribution being
(approximately) normally distributed.

There are two estimates of the population value that we can make: either a point
estimate or an interval estimate. Some procedures provide a single value (called point
estimates), while others provide a range of values (called interval estimates). Figure 5.1
illustrates the relationship between the point and interval estimates for a population
mean .

Figure 5.1 Point and interval estimate

Suppose that you want to find the mean weight of all football players who play in a local
football league. Due to practical constraints you are unable to measure all the players,
but you can select a sample of 45 players at random and weigh them to provide a
sample mean. From your survey, this sample mean is calculated as 78 kg.

Page | 290
From Chapter 4, we know that the sampling distribution of the mean is approximately
normally distributed for large sample sizes. We also know that the sample mean is an
unbiased estimator of the population mean. Because of these two facts, you can treat
this number of 78 kg also as the point estimate of the population mean.

If we knew, or can estimate, the population standard deviation (), then we can apply
equation (4.9) to provide an interval estimate for the population mean based upon
some degree of error between the sample and population means. This interval estimate
is called the confidence interval for the population mean. This approach is also valid
for the confidence interval for the population proportion if we are measuring
proportions.

Performing a parametric statistical analysis requires the justification of several


assumptions. These assumptions are as follows:

1. Data are collected randomly and independently.


2. Data are collected from normally distributed populations.
3. Variances are equal when comparing two or more groups.

The word ‘parametric’ means that sample data come from a known distribution (often a
normal distribution, though not necessarily) whose parameters are fixed. There are
many ways to check the assumption of normality (covered in subsequent chapters):

1. Normality can be checked by plotting a histogram to observe if the shape looks


normal, or we can undertake some statistical analysis and calculate the
Kolmogorov–Smirnov test statistic.
2. Equal variance can be checked by undertaking Levene’s test for equality of
variance.
3. Hypothesis tests involving means and regression analysis can be run (more
about that in the chapters that follow).

If the data collected are not normally distributed, or the equality of variance assumption
is violated, there are alternative ways to make inferences or carry out hypothesis tests.
These include carrying out an equivalent nonparametric test which does not assume
that the distribution is normally distributed (or that the distribution parameters are
fixed). These so-called nonparametric tests measure data that are not at the scale/ratio
level but are of ordinal or categorical form (we shall explore some of them in Chapter 7).

An alternative method that can be used when the parametric assumptions have been
violated is to use Excel or IBM SPSS to implement a method called bootstrapping.
Bootstrapping is a numerical sampling technique where the data are sampled with
replacement. This means that you require an initial sample and then use this sample to
generate more samples. In this way you obtain many samples from the original sample
which can then be used to calculate descriptive statistics, such as the mean, median and
variance.

Bootstrapping can be used to create many resamples, and bootstrapping statistics allow
the researcher to analyse any distribution and carry out hypothesis tests (covered in the
next Chapter).

Page | 291
As the objective of this chapter is to estimate certain statistics, we might ask ourselves
what constitutes a good estimator. A good estimator should be unbiased, consistent,
and efficient:

1. An unbiased estimator of a population parameter is an estimator whose


expected value is equal to that parameter. As we already know from the previous
chapter, the sample mean 𝑥̅ is an unbiased estimator of the population mean, µ.
In other words, the expected value of the sample mean equals the population
mean, as expressed in equation (5.1):

̅) = μ
E(X (5.1)

2. An unbiased estimator is said to be a consistent estimator if the difference


between the estimator and the parameter grows smaller as the sample size
grows larger. The sample mean 𝑥̅ is a consistent estimator of the population
mean, µ, with variance given by equation (5.2):

σ2
̅) =
VAR(X (5.2)
n

As n grows larger, the variance of the sample mean grows smaller.

3. If there are two unbiased estimators of a parameter, the one whose variance is
smaller is said to be the more efficient estimator. For example, both the sample
mean, and median are unbiased estimators of the population mean. Which one
should we use? If the sample median has a greater variance than the sample
mean, we choose the sample mean since it is more efficient.

Point estimate of the population mean and variance

A point estimator draws an inference about a population parameter by estimating the


value of an unknown parameter given as a single point or data value. For example, the
sample mean is the best estimator of the population mean. It is unbiased, consistent,
and the most efficient estimator as long as the sample was drawn from a normal
population. If it was not drawn from a normal population but the sample was
sufficiently large, then the central limit theorem states that the sampling distribution
can be approximated by a normal distribution, so it is still unbiased, consistent and the
most efficient estimator.

To summarise, the sample mean represents a point estimate of the population mean, ,
and this relationship is defined by equation (5.3):

μ=̅
X (5.3)

Applying the central limit theorem, we would expect the point estimator to get closer
and closer to the true population value as the sample size increases. The degree of error
is not reflected by the point estimator. However, we can employ the concept of the
interval estimator to put a probability on the value of the population parameter lying

Page | 292
between two values, with the middle value being the point estimator. In Section 5.3, we
will discuss the concept of an interval estimate or confidence interval.

In statistics, the standard deviation is often estimated from a random sample drawn
from a population.

In Chapter 4, we showed, via a simple example, that the sampling distribution of the
means gives the following rules:

1. The mean of the sample means is an unbiased estimator of the population mean
(𝑋̅ = μ). In other words, the expected value of the sample means equals the
population mean (E(X ̅) = μ).
2. The sample variance is a biased estimator of the population variance (𝜎𝑋2̅ ≠ 𝜎 2 ).
In other words, the expected value of the sample variance (s2) is not equal to the
population variance (𝐸(𝑠) ≠ 𝜎).

As the sample mean is an unbiased estimator, it is obvious that estimating the


population mean from the sample mean is straightforward. Let us just consider the
variance and standard deviation.

Because the sample variance is biased, the bias can be corrected using Bessel’s
correction. This corrects the bias in the estimation of the population variance, and some,
but not all, of the bias in the estimation of the population standard deviation. The Bessel
correction factor is given by expression (5.4):
𝑛
(5.4)
𝑛−1

The relationship between the unbiased sample variance (s2) and the biased sample
variance (sb2 ) is given by equation (5.5):
n
s2 = sb2 (5.5)
n−1

From equation (5.5), the unbiased sample variance is given by equation (5.6):

̅ )2
(Xi −X
s 2 = ∑ni=1 (5.6)
n−1

The Excel function to calculate an unbiased estimate of the population variance (s2) is
=VAR.S(). From equation (5.6), the unbiased sample standard deviation is given by
equation (5.7):

̅ )2
(Xi −X
s = √∑ni=1 (5.7)
n−1

The Excel function to calculate an unbiased estimate of the population standard


deviation (s) is =STDEV.S().

Why n – 1 and not n in equations (5.6) and (5.7)?

Page | 293
This is because the population variance (σ) is estimated from the sample mean (𝑋̅) and
from the deviations of each measurement from the sample mean. If, for example, we
lacked any one of these measurements (the sample mean or a single deviation value),
we could still calculate it from the rest of the data. So, with n measurements (data
points) only n – 1 of them are free variables in the calculation of the sample variance.

This means that a missing observation can be found from the other n – 1 observations
and the sample mean. The term n – 1 is called the degrees of freedom. Also note that if
you used n rather than n-1, you would effectively underestimate the true population
variance or standard deviation.

Unfortunately, it can be shown mathematically that not all the bias is removed when
using n – 1 in the equation rather than n. Fortunately, however, the amount of bias is
negligible. Despite this negligible bias, we can safely assume that equation (5.7) is an
unbiased estimator of the population standard deviation.

From equation (5.7), we have n – 1 degrees of freedom. The larger the sample size, n,
the smaller the correction involved in using the degrees of freedom (n – 1). For example,
Table 5.1 compares the value of 1/n and 1/(n – 1) for different sample sizes n = 15, 25,
30, 40, 50, 100, …, 10000. We conclude that very little difference exists between 1/n and
1/(n – 1) for large n.

n 1/n 1/(n - 1) % difference


15 0.0667 0.0714 0.0667
25 0.0400 0.0417 0.0400
30 0.0333 0.0345 0.0333
40 0.0250 0.0256 0.0250
50 0.0200 0.0204 0.0200
100 0.0100 0.0101 0.0100
1000 0.0010 0.0010 0.0010
10000 0.0001 0.0001 0.0001
Table 5.1 Difference between 1/n and 1/(n-1) as n increases

Similarly, the bias in the sample standard deviation is negligible when n – 1 is used
instead of n in the denominator. For example, for a normally distributed variable, the
unbiased estimator of the population standard deviation (𝜎̂) can be shown to be given
by equation (5.8):

1
̂ = s (1 +
σ ) (5.8)
4 (n−1)

In equation (5.8) s is the sample standard deviation and n is the number of observations
in the sample.

Table 5.2 explores the degree of error between the unbiased estimator of the population
standard deviation and the sample standard deviation. The table shows that when the
sample size is 4 the underestimate is 8.33% and when the sample size is 30 the
underestimate is 0.86%. The difference between the two values quickly reduces with
increasing sample size.

Page | 294
n= 4 10 20 30 40 50 100 1000 10000
Error = 0.083 0.028 0.013 0.009 0.006 0.005 0.003 0.000 0.000
% error = 8.333 2.778 1.316 0.862 0.641 0.510 0.253 0.025 0.003
Table 5.2 Degree of error as n increases

Now we know how to calculate the unbiased standard deviation (s) for a sample by
using equation (5.7), this value of s can be used to estimate the standard deviation of the
population. This is done via the standard error of the mean.

We already know that the standard error of the sample means is defined by equation
(4.8), so we will repeat it here for the sake of convenience and replace  with s, which is
the standard deviation for a sample:
𝑠
𝜎𝑋̅ = (5.9)
√𝑛

This value 𝜎𝑋̅ , or the standard error of the sample mean (which is effectively the
standard deviation of the sampling means), will become the key statistic that will enable
us to make estimates of the population parameters, based on the sample parameters.

Example 5.1

A manufacturer takes 25 measurements of the weight of a product (Kgs) with the


measurement results presented in table 5.3.

Sample data
5.02 5.32 5.14 5.12 4.97
5.11 4.89 5.23 4.92 4.86
5.06 4.97 4.84 5.03 4.79
5.30 4.75 4.85 5.27 4.56
4.46 4.93 5.29 4.82 5.30
Table 5.3 Weight of product (Kgs)

Calculate an unbiased estimate of the mean, and the standard deviation and standard
error of your estimate of the mean.

The unbiased estimates of the population mean, standard deviation, and standard error
of the mean are provided by solving equations (5.3), (5.7), and (5.9).

a. Calculate the sample statistics (sample size, sample mean, sample standard
deviation)

Sample data, X (X-Xbar)^2


5.02 0.000784
5.11 0.013924
5.06 0.004624
5.30 0.094864
4.46 0.283024
5.32 0.107584

Page | 295
4.89 0.010404
4.97 0.000484
4.75 0.058564
4.93 0.003844
5.14 0.021904
5.23 0.056644
4.84 0.023104
4.85 0.020164
5.29 0.088804
5.12 0.016384
4.92 0.005184
5.03 0.001444
5.27 0.077284
4.82 0.029584
4.97 0.000484
4.86 0.017424
4.79 0.040804
4.56 0.186624
5.30 0.094864
Table 5.4 Calculation of the summary statistics

Summary statistics:

Sample size, n = 25

Sample mean, ̅
X

5.02 + 5.11 + ⋯ . . +4.56 + 5.30


𝑋̅ =
25

124.8
𝑋̅ =
25
̅ = 4.9920 Kgs
X

Sample variance, s2
n
̅) 2
(Xi − X
2
s = ∑
n−1
i=1

0.000784 + 0.013924 + ⋯ . +0.186624 + 0.094864


s2 =
25 − 1

1.2588
s2 =
24

Page | 296
s2 = 0.0525 Kgs2

Sample standard deviation, s

𝑠 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

𝑠 = √0.0525

s = 0.2290 Kgs

b. Population estimates

Unbiased estimate of the population mean, 𝜇̂ , is:

μ̂ = ̅
X = 4.9920 Kgs

Unbiased estimate of the population standard deviation, 𝜎̂, is:

̂=s
σ

̂ = 0.2290 Kgs
σ

Unbiased estimate of the standard deviation of the sampling means (the standard
error, σX̅ ) is

𝜎̂
𝜎̂𝑋̅ =
√𝑛
𝑠
𝜎̂𝑋̅ =
√𝑛

0.2290
𝜎̂𝑋̅ =
√25

𝜎̂𝑋̅ = 0.0458 Kgs

We will show shortly how is this standard error used for interval estimates.

Excel solution

Method 1. Formula method – equations (5.3), (5.5) and (5.7)

Figures 5.2 and 5.3 illustrate the formula methods to calculate the required summary
statistics.

Page | 297
Figure 5.2 Example 5.1 Excel solution

Figure 5.3 Example 5.1 Excel solution continued

From Excel, the population estimates are:

1. Estimate of the population mean is 4.9920 Kgs


2. Estimate of the population standard deviation is 0.2290 Kgs
3. Estimate of the standard error is 0.0458 Kgs

Method 2 Excel function method

Figure 5.4 illustrates the formula method to calculate the required summary statistics.

Page | 298
Figure 5.4 Example 5.1 Excel function solution continued

SPSS Solution

SPSS solution using SPSS Frequencies

Input data into SPSS.

Figure 5.5 Example 5.1 SPSS data

Frequencies method

Select Analyze > Descriptive Statistics > Frequencies

Page | 299
Figure 5.6 SPSS Frequencies

Transfer Data into the Variable(s) box.

Switch off Display frequency tables (ignore warning).

Figure 5.7 SPSS frequencies menu

Click on Statistics.

Choose Mean, Std.deviation, S.E. mean

Page | 300
Figure 5.8 SPSS frequencies statistics option

Click Continue.

Figure 5.9 SPSS frequencies menu

Click OK

SPSS output

Page | 301
Figure 5.10 SPSS frequencies solution

The SPSS solutions presented in Figure 5.10 agree with the manual solution and Excel
solutions presented in Figures 5.3 and 5.4.

Alternatively, you could use the SPSS Descriptives and Explore menus to provide the
results.

Descriptives method

Select Analyze > Descriptive Statistics > Descriptives.


Transfer Data into the Variable(s) box.
Click on Options.
Choose Mean, Std.deviation, S.E. mean.
Click Continue.
Click OK.

SPSS output

Figure 5.11 SPSS descriptives solution

Explore method

Select Analyze > Descriptive Statistics > Explore.


Transfer Data into the Dependent List box.
Click on Statistics.
Choose Descriptives.
Click Continue.
Click OK.

SPSS output

Page | 302
Figure 5.12 SPSS explore solution

Observe that the Explore method provides a range of descriptive statistics and not just
the statistics that the first two SPSS methods provided.

What are the implications of these statistics that we have just calculated? We calculated
that the average weight is 4.9920 Kgs. The standard error of 0.0458 Kgs will help us, as
we will demonstrate shortly, to estimate how ‘close’ we are to estimating the true mean
value of all the rods that we manufacture.

Point estimate of the population proportion and variance

In the previous section we provided the equations to calculate the point estimate for
the population mean based upon the sample data. A similar principles can be used if
we have a sample proportion and we want to provide the point estimate of the
population proportion.

Equations (5.10) and (5.11) provide unbiased estimates of the population proportion
(𝜋̂) and standard error (), where  is the sample proportion.

Estimate of the population proportion

𝜋̂ = 𝜌 (5.10

Estimate of the population standard deviation

𝜌 (1− 𝜌)
𝜎̂𝜌 = √ (5.11)
𝑛

This means we find the sample proportion  and from there we estimate the population
proportion 𝜋̂ and the population standard deviation 𝜎̂𝜌 .

Page | 303
Example 5.2

A local call centre has randomly sampled 30 staff from a total of 143 to ascertain if the
staff are in favour of moving to a new working shift pattern. The results of the survey
are presented in table 5.5. Provide a point estimate of the population proportion of total
workers who disagree with the new shift pattern and give an estimate for the standard
error of your estimate.

Staff outcomes (A - Agree, D -


ID Disagree)
1 D
2 D
3 A
4 A
5 A
6 A
7 A
8 D
9 D
10 A
11 D
12 A
13 A
14 A
15 A
16 D
17 D
18 D
19 D
20 A
21 A
22 A
23 D
24 A
25 A
26 D
27 A
28 D
29 A
30 D
Table 5.5 Call centre staff survey results

Unbiased estimates of the population proportion and standard error of the proportion
are found by solving equations (5.10) and (5.11).

Unbiased estimate of the population proportion

13
𝜋̂ = 𝜌 = = 0.4333333′
30

Page | 304
Unbiased estimate the standard error of the proportion

𝜌 (1 − 𝜌) 0.4333333′ × (1 − 0.4333333′)
𝜎̂𝜌 = √ = √ = 0.0905
𝑛 30

Estimate of the population proportion is 0.4333 with an estimate of the standard error
equal to 0.0905.

Excel solution

Figures 5.13 and 5.14 show the Excel solution.

Figure 5.13 Example 5.2 Excel solution

Figure 5.14 Example 5.2 Excel solution continued

From Excel, the estimates of the population proportion and standard error are 0.43
(43%) and 0.0905 (9.05%), respectively.

Page | 305
SPSS solution

You can calculate the proportions by using the SPSS Frequency menu.

Input data into SPSS

Figure 5.15 Example 5.2 SPSS data

Select Analyze > Descriptive Statistics

Figure 5.16 Frequencies menu

Select Frequencies.

Transfer Outcome into Variable(s) box

Page | 306
Figure 5.17 SPSS frequencies menu

Click OK

SPSS output

The output is shown in Figure 5.18.

Figure 5.18 SPSS frequencies solution

We observe that the population proportion dissatisfied is equal to the sample


proportion dissatisfied, which is 43.3% or 0.433. As before, we may now use equation
(5.11) to estimate the value of the standard error of the proportion:

𝜌 (1 − 𝜌) 0.4333333′ × (1 − 0.4333333′)
𝜎̂𝜌 = √ = √ = 0.0905
𝑛 30

Again, we will shortly demonstrate how is this standard error of the proportion used to
provide interval estimates.

Page | 307
Pooled estimates

If more than one sample is taken from a population then the resulting sample statistics
can be combined to provide pooled estimates for the population mean, variance and
proportion. If we have two samples of sizes n1 and n2, then the estimate of the
population mean is provided by the pooled sample mean:

n1 X 1 + n2 X 2
X=
n1 + n2 (5.12)

The estimate of the population variance is provided by the pooled sample variance:

n1s12 + n2 s22
ˆ 2 =
n1 + n2 − 2 (5.13)

The estimate of the population proportion is provided by the pooled sample proportion:

n1ˆ1 + n2ˆ 2 n  +n 
ˆ = = 1 1 2 2
n1 + n2 n1 + n2 (5.14)

Check you understanding

X5.1 A random sample of five values was taken from a population: 8.1, 6.5, 4.9, 7.3,
and 5.9. Estimate the population mean and standard deviation, and the standard
error of the estimate for the population mean.

X5.2 The mean of 10 readings of a variable was 8.7 with standard deviation 0.3. A
further five readings were taken: 8.6, 8.5, 8.8, 8.7, and 8.9. Estimate the mean and
standard deviation of the set of possible readings using all the data available.

X5.3 Two samples are drawn from the same population as follows: sample 1 (0.4, 0.2,
0.2, 0.4, 0.3, and 0.3) and sample 2 (0.2, 0.2, 0.1, 0.4, 0.2, 0.3, and 0.1). Determine
the best unbiased estimates of the population mean and variance.

X5.4 A random sample of 100 rods from a population were measured and found to
have a mean length of 12.132 with standard deviation 0.11. A further sample of
50 is taken. Find the probability that the mean of this sample will be between
12.12 and 12.14.

X5.5 A random sample of 20 children in a large school were asked a question and 12
answered correctly. Estimate the proportion of children in the school who would
answer correctly and the standard error of this estimate.

X5.6 A random sample of 500 fish is taken from a lake and marked. After a suitable
interval, a second sample of 500 is taken and 25 of these are found to be marked.
By considering the second sample, estimate the number of fish in the lake.

Page | 308
5.3 Interval estimates
As we know by now, if we take just one sample from a population, we can use the
sample statistics to estimate a population parameter. Our knowledge of sampling error
would indicate that the standard error provides an evaluation of the likely error
associated with the estimate. If we assume that the sampling distribution of the sample
means is normal, then we can provide a measure of this error in terms of a probability
value that the value of the population mean will lie within a specified interval.

This interval is called an interval estimate (or as we will explain later, it can be called
the confidence interval), and it is centred at the point estimate for the population mean.

To create an interval around the sample mean ̅ X, so that you can state with a certain
confidence that the true mean resides within it, we need three parameters. We need to
calculate the sample mean ̅ X (see equation (4.3)), we need the value of the standard
error σX̅ (see equation (4.8) or (5.9)) and we need the Z-value which will correspond to
the desired probability level (see equation (4.10)).

With these three parameters, we can make an estimate that the true population mean is
within the interval, as described by equation (5.15) or equation (5.16):

μ=̅
X ± Z σX̅ (5.15)

or
𝑠
̅ ±Z
μ=X (5.16)
√𝑛

From our knowledge of the normal distribution we know that 95% of the distribution
lies within ± 1.96 standard deviations of the mean. Therefore, for the distribution of
sample means, 95% of these sample means will lie in the interval defined by equation
(5.16).

This equation tells us that an interval estimate (or confidence interval) is centred at X̅
with a lower and upper confidence interval given by equations (5.17) and (5.18):

Lower confidence interval boundary value


s
μ1 = ̅
X − 1.96 (5.17)
√n

Upper confidence interval boundary value


s
̅ + 1.96
μ2 = X (5.18)
√n

Figure 5.19 illustrates graphically the position of the point estimate, lower and upper
boundaries for the confidence interval (1, 2) and the confidence interval itself relative
to these statistics.

Page | 309
Figure 5.19 Shaded region represents the confidence interval

̅. From there,
In other words, if we select a sample, we can calculate the sample mean X
̅
we can be 95% confident that this sample mean X represents the true population mean
, somewhere in the interval that is estimated as ̅X ± 1.96 s⁄√n.

Why?

Because, if we took many samples, as explained earlier, 95% of all the intervals would
be around the true population mean. This also implies that we would expect that 5% of
the intervals would not contain the population mean.

For example, suppose we take a sample from a population which is normally distributed
with a population mean of 90 and standard deviation of 5. We can take random samples
of size 36 from this population and calculate the sample means. If we wish to calculate a
95% confidence interval, then this interval would be given by the following equation:

𝑠
μ=̅
X ± 1.96
√𝑛

Now, if we collect four samples with a sample means of 88.9, 91.0, 87.2 and 95.3
respectively, then the 95% confidence intervals for each sample are illustrate in Figure
5.20.

Page | 310
Figure 5.20 95% confidence intervals for four sample means

From Figure 5.20, we observe:

a. For sample 1, the interval is 86.3–91.5, which includes the assumed population
mean of 90.
b. For sample 2, the interval is 88.4–93.6, which includes the assumed population
mean of 90.
c. For sample 3, the interval is 84.6–89.8, which does not include the assumed
population mean of 90.
d. For sample 4, the interval is 92.7–97.9, which does not include the assumed
population mean of 90.

In the real world, you will choose one sample where the population mean  is unknown;
in this situation you have no guarantees that the confidence interval based upon this
sample mean will include the population mean .

However, if you take all possible samples of size n and computer the corresponding
sample means, 95% of the confidence intervals will include the population mean and
only 5% of the samples will not. This tells us that we would have 95% confidence that
the population mean  lies within the sample confidence interval.

If we return to Example 5.1, we had a mean value of 4.9920 and a standard error of
0.0458. If we take Z = 1.96, then we can provide an estimate of the population weight to
lie between 4.9022 (= 4.9920 – 1.96  0.0458) and 5.0818 (= 4.9920 + 1.96  0.0458).

Page | 311
In fact, because we used 1.96 for the Z-value, this means that we are 95% certain that
the true weight lies somewhere between 4.9022 and 5.0818. A different value of Z
would give us a different confidence interval.

From the above examples you can conclude that the estimate interval and the
confidence interval are connected. Equation (5.15) tells you how wide is the interval in
which the true mean is likely to be (which is the estimate interval), and the value of Z
will tell you the probability that the true mean value is in this interval (which is the
confidence interval). It is intuitive to expect that the higher the probability (i.e. the
higher the confidence level), the wider the estimate interval will be, and vice versa.

Now that we understand how interval estimates are made and how this is connected
with a confidence interval for the estimate, we will look at how to calculate confidence
intervals for both the population mean and proportion, depending on the sample size.
Interval estimate of the population mean where σ is known and the sample is larger
than 30 observations

To reiterate: if a random sample of size n is taken from a normal population N(, σ2)
2
then the sampling distribution of the sample means will be normal, ̅X ~ N (μ, σ ⁄n).

As we have shown, we can use equation (5.15) to give a confidence interval for the
population mean:

̅ − Z σX̅ ≤ μ ≤ X
X ̅ + Z σX̅ (5.19)

Or
𝜎 𝜎
̅
X−Z ≤ μ ≤̅
X+Z (5.20)
√𝑛 √𝑛

Example 5.3

Calculate a 95% confidence interval for the population mean data presented in Example
5.1 data set but this time assume that the data is normally distributed . If you carry out
the calculations, then you will find that the required statistics are as follows:

Population data

Data X ~ N(, 2)

Sample data

σ 2
Sample means ̅
X ~ N (μ, n )

Sample size, n = 25

Sample mean, ̅
X = 4.9920

Page | 312
Sample standard deviation, s = 0.2290

Standard error of the means, σX̅ − 0.0458

Confidence intervals

The value of the critical z statistic at a given significance level can be found from the
normal distribution tables.

With two tails your risk is equally split between the two tails, so you have 2.5% in the
left tail, 2.5% in the right tail, and 95% in between. Table 5.6 shows an example of this
with the critical value z identified for a z value of 1.96 of the probability P(Z ≥ z) = 2.5%
= 0.025 (right-hand tail in Figure 5.19).

Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06


0.0 0.500 0.496 0.492 0.488 0.484 0.480 0.476
0.1 0.460 0.456 0.452 0.448 0.444 0.440 0.436
0.2 0.421 0.417 0.413 0.409 0.405 0.401 0.397
1.8 0.036 0.035 0.034 0.034 0.033 0.032 0.031
1.9 0.029 0.028 0.027 0.027 0.026 0.026 0.025
2.0 0.023 0.022 0.022 0.021 0.021 0.020 0.020
2.1 0.018 0.017 0.017 0.017 0.016 0.016 0.015
Table 5.6 Calculation of z when P(Z ≥ 1.96) = 0.025

From Table 5.6, critical z value = 1.96 when P(Z ≥ z) = 0.025. Given that we have two
tails, then the critical z value = ± 1.96.

Estimate of the standard errors for the sample means is

𝑠
𝜎̂𝑋̅ =
√𝑛

0.2290
𝜎̂𝑋̅ =
√25

𝜎̂𝑋̅ = 0.0458

Then our interval endpoints are:

𝑠
𝜇1 = 𝑋 − 𝑍 = 4.9920 − 1.96 × 0.0458 = 4.9022
√𝑛
𝑠
𝜇2 = 𝑋 + 𝑍 = 4.9920 + 1.96 × 0.0458 = 5.0818
√𝑛

Figure 5.21 illustrates the 95% confidence interval for the population mean.

Page | 313
Figure 5.21 Shaded region represents 95% confidence interval

Thus, the 95% confidence interval for µ is = 4.9920 ± 1.96 × 0.0458, that is, from 4.9022
to 5.0818.

The other way to say the same is to conclude that there is 5% risk that the population
mean is not between 4.9022 and 5.0818.

Excel solution

Figures 5.22 and 5.23 illustrates the Excel solution (first 10 out of 25 observations
illustrated)

Figure 5.22 Example 5.3 Excel solution

Page | 314
Figure 5.23 Example 5.3 Excel solution continued

From Excel, the 95% confidence interval for µ is 4.9022 to 5.0812.

Note:

You can also solve this problem using the Excel CONFIDENCE.NORM function:
=CONFIDENCE.NORM(alpha, standard_dev, size)

Where,

1. Alpha is the significance level used to compute the confidence level. The
confidence level equals 100*(1 - alpha)%, or in other words, an alpha of 0.05
indicates a 95 percent confidence level.
2. Standard_dev is the population standard deviation for the data range and is
assumed to be known. Please note in the example above this value is not known
but you could replace this with the sample standard deviation to obtain the
results given in Figure 5.23.
3. Size is the sample size, n.

SPSS solution

There is no built-in SPSS solution, though there are workarounds. We will show some
possible ways how to use SPSS later in the text.

Interval estimate of the population mean where σ is not known and the
sample is smaller than 30 observations

In the previous example we calculated the point and interval estimates when the
population was normally distributed, and the population standard deviation was
known. In most cases the population standard deviation is unknown, and we have to
use the sample value to estimate the population value with associated errors.

With the population standard deviation unknown, the population mean estimate is still
given by the value of the sample mean, but what about the interval estimate? In the

Page | 315
previous example the sample mean, and sample size were used to provide this interval.
However, in this new case we have an extra unknown that has to be estimated from the
sample data in order to find the interval estimate of a population mean when the sample
size is small.

This is often the case in student research projects. They handle small sample sizes and
the population standard deviation is unknown. The question then becomes how we can
create interval estimates when the population standard deviation is unknown, and the
sample sizes are small. The question then becomes whether we can measure how much
smaller this probability will be.

This question was answered by W. S. Gossett who determined the distribution of the
mean when divided by an estimate of the standard error. The resulting distribution is
called Student’s t distribution. If the random variable X is normally distributed, then the
test statistic has a t distribution with n – 1 degrees of freedom and is defined by
equation (5.21).
̅− μ
X
t df = s (5.21)
√n

As we can see, the above equation has the same form as equation (4.10), although the t
distribution is not the same as the Z distribution. As we already know, the t distribution
is very similar to the normal distribution when the estimate of variance is based on
many degrees of freedom (df = n – 1), but has relatively more scores in its tails when
there are fewer degrees of freedom. The t distribution is symmetric, like the normal
distribution, but flatter (leptokurtic). As a reminder, Figure 5.24 compares the t
distribution with 5 degrees of freedom and the standard normal distribution.

Figure 5.24 Normal versus t distribution

Since the t distribution is leptokurtic, the percentage of the distribution within ±1.96
standard deviations of the mean is less than the 95% for the normal distribution.
However, if the number of degrees of freedom (df) is large (df = n – 1 ≥ 30) then there is
very little difference between the two probability distributions. The sampling error for
the t distribution is given by the sample standard deviation (s) and sample size (n), as
defined by equation (5.22), which has the same form equations (4.8) and (5.9):

Page | 316
̂
σ s
σX̅ = = (5.22)
√n √n

The degrees of freedom and the interval estimate are given by equation (5.23):

df = n – 1 (5.23)

This yields the interval estimate for the population mean to be defined as in equation
(5.24):
s s
̅
X − t df ≤ μ ≤̅
X + t df (5.24)
√n √n

Example 5.4

Calculate a 95% confidence interval for the data presented in table 5.7 (assume data is
normally distributed).

X
12.00
13.54
12.22
10.99
10.09
10.82
11.62
12.12
12.49
9.95
12.57
11.50
10.22
11.98
10.98
12.61
12.40
11.23
Table 5.7

For this data, we can calculate the summary statistics:

• Sample mean = 11.6


• Sample size, n = 18
• Sample standard deviation, s = 0.9858

Page | 317
The value of the critical t statistic at a given significance level and degrees of freedom
can be found from Student’s t distribution tables. Table 5.8 shows an example of this
with the critical t value identified for a value of the probability P(T ≥ t) = 2.5% = 0.025
(right-hand tail in Figure 5.25 below) (alpha = 2 × 0.025 = 0.5) and degrees of freedom n
– 1 = 18 – 1 = 17. From the table, the critical t value is 2.11 when P(T ≥ t) = 0.025 for 17
degrees of freedom.

ALPHA 50% 20% 10% 5% 2.50%


df 0.5 0.20 0.1 0.05 0.025
1 1.00 3.08 6.31 12.71 25.45
2 0.82 1.89 2.92 4.30 6.21
14 0.69 1.35 1.76 2.14 2.51
15 0.69 1.34 1.75 2.13 2.49
16 0.69 1.34 1.75 2.12 2.47
17 0.69 1.33 1.74 2.11 2.46
18 0.69 1.33 1.73 2.10 2.45
19 0.69 1.33 1.73 2.09 2.43
20 0.69 1.33 1.72 2.09 2.42
Table 5.8 Calculation of t for P(T ≥ t) = 0.025 with 17 df

The confidence interval is then found using equations (5.18)–(5.20). The standard error
is:

̂
σ s 0.9858
σX̅ = = = = 0.2323
√n √n √18

Then, we can calculate the confidence interval using equation (5.24).

Lower confidence interval boundary, Lower CI

s
̅ − t df
Lower CI = X = 11.6294 − 2.1098 × 0.2323 = 11.1392
√n

Upper confidence interval boundary, Upper CI

s
Upper CI = ̅
X + t df = 11.6294 + 2.1098 × 0.2323 = 12.1197
√n

Figure 5.25 shows the 95% confidence interval for the population mean.

Page | 318
Figure 5.25 Shaded region represents 95%confidence interval for µ

Thus, the 95% confidence interval is from 11.14 to 12.12. To put it another way, we are
95% confident that, based on this small sample, the true population mean is between
11.14 and 12.12.

Excel solution

Method 1. Excel formula method

Figures 5.26 and 5.27 illustrates the Excel solution using equations (5.17)–
(5.20).

Figure 5.26 Example 5.4 Excel solution

Page | 319
Figure 5.27 Example 5.4 Excel solution continued

From Excel, the 95% confidence interval for µ is 11.14 to 12.12.

Method 2. Excel function method

Figure 5.28 illustrates the Excel solution using Excel functions. The results are
the same as for the formula method.

Figure 5.28 Example 5.4 Excel solution continued

From SPSS, the 95% confidence interval for µ is 11.14 to 12.12.

In this example we used the Excel CONFIDENCE.T function:


=CONFIDENCE.T(alpha, standard_dev, size).

Where,
1. Alpha is the significance level used to compute the confidence level. The
confidence level equals 100*(1 - alpha)%, or in other words, an alpha of 0.05
indicates a 95 percent confidence level.
2. Standard_dev is the population standard deviation for the data range and is
assumed to be known.
3. Size is the sample size.

Page | 320
Given the population standard deviation (Standard_dev) is unknown, we have replaced
this value with the sample standard deviation to obtain the same results as illustrated in
Figure 5.28.

SPSS solution

Enter data into SPSS

Figure 5.29 Example 5.4 SPSS data

Select Analyze > Descriptive Statistics > Explore

Transfer Sample_data to the Dependent List: box

Figure 5.30 SPSS Explore

Choose Statistics

Page | 321
Figure 5.31 SPSS Explore Statistics option

Click on Continue.

Click OK

SPSS output

The output is shown in Figure 5.32

Figure 5.32 SPSS solution continued

From Figure 5.32, the point estimate for the population mean is 11.39 and the 95%
estimate interval is from 10.66 to 12.12. These results agree with the Excel solution

Page | 322
shown in Figures 5.25 and 5.26. We are 95% confident that the population mean is
contained within the interval from 10.66 to 12.12.

Interval estimate of a population proportion

If the population is normally distributed or the sample size is large (central limit
theorem, n ≥ 30) then the confidence interval for a proportion is calculated by using
equation (5.11) and transforming equation (5.16) to give equation (5.25), where the
population proportion, π, is estimated from the sample proportion, ρ:

𝜌 − 𝑍 𝜎𝜌 ≤ 𝜋 ≤ 𝑍 𝜎𝜌 (5.25)

Which can also be written as

𝜌(1 − 𝜌) 𝜌(1 − 𝜌)
𝜌−𝑍 × √ ≤ 𝜋 ≤ 𝜌+𝑍 × √
𝑛 𝑛

Example 5.5

Fit a 95% confidence interval for the population proportion given the sample
proportion is 0.4 and sample size is 38.

The confidence interval is given by equation (5.25). From the data we can state:

1. Sample proportion  = 0.4.


2. Sample size n = 38.
3. Population proportion distribution unknown but sample size large.
4. Given point 3, we can apply the central limit theorem to state 𝜌 ∼ 𝑁(𝜋, 𝜎𝜌2 ).
5. Z for 95% confidence is ±1.96 (Table 5.6).

Substituting these values into equations (5.10), (5.11), and (5.25) gives:

Estimate of the population proportion via equation (5.10)

𝜋̂ = 𝜌

𝜋̂ = 0.4

Estimate of the population standard deviation via equation (5.11)

𝜌 (1 − 𝜌)
𝜎̂𝜌 = √
𝑛

0.4 (1 − 0.4)
𝜎̂𝜌 = √
38

Page | 323
𝜎̂𝜌 = 0.07947

Estimate the 95% confidence interval for the population proportion using equation
(5.25)

Lower confidence interval (LCI), Z = - 1.96

𝜌(1 − 𝜌)
𝐿𝐶𝐼 = 𝜌 − 𝑍 × √
𝑛

𝐿𝐶𝐼 = 0.4 − 1.96 × 0.07947

𝐿𝐶𝐼 = 0.2442

Lower 95% confidence boundary is 0.2442

Upper confidence interval (UCI), Z = + 1.96

𝜌(1 − 𝜌)
𝑈𝐶𝐼 = 𝜌 + 𝑍 × √
𝑛

𝑈𝐶𝐼 = 0.4 + 1.96 × 0.07947

𝑈𝐶𝐼 = 0.5558

Upper 95% confidence boundary is 0.5558

Thus, the 95% confidence interval for the population proportion, , is 0.4 ± 1.96 ×
0.07947, that is, from 0.2442 to 0.5558. Figure 5.33 illustrates the 95% confidence
interval for the population proportion.

Figure 5.33 Shaded region represents 95% confidence interval

Page | 324
We are 95% confident that the true population proportion is contained within the
interval from 0.2442 to 0.5558.

Concept of risk

Let us reiterate one point. In business, people often use the concept of risk. A risk could
be defined in relation to the confidence interval. In most simplistic terms, a risk and
confidence interval are two opposite sides of the same phenomenon. If we state with
95% confidence that the true value is, for example, between 4 and 6, then implicitly we
also state that there is a 5% risk that the true value is not between 4 and 6.

Excel solution

Figure 5.34 illustrates the Excel solution.

Figure 5.34 Example 5.5 Excel solution

SPSS solution

There is no built-in SPSS solution but you could solve this problem in SPSS by using the
SPSS transform method to calculate the individual statistics.

Check your understanding

X5.7 The standard deviation for a method of measuring the concentration of nitrate
ions in water is known to be 0.05 ppm. If 100 measurements give a mean of 1.13
ppm, calculate the 90% confidence limits for the true mean.

X5.8 In trying to determine the sphere of influence of a sports centre, a random


sample of 100 visitors was taken. This indicated a mean travel distance (d) of 10
miles with a standard deviation of 3 miles. Calculate a 90% confidence interval
for the mean travel distance (D).

X5.9 The masses, in grams, of 13 ball bearings taken at random from a batch are: 21.4,
23.1, 25.9, 24.7, 23.4, 24.5, 25.0, 22.5, 26.9, 26.4, 25.8, 23.2, 21.9. Calculate a 95%

Page | 325
confidence interval for the mean mass of the population, supposed normal, from
which these masses were drawn.

5.4 Calculating sample sizes


If you take a closer look at equation (5.20), you will notice that we can control the width
of the confidence interval by determining the sample size necessary to produce narrow
intervals. For example, if we assume that we are sampling a mean from a population
that is normally distributed then we can modify equation (4.10) to calculate an
appropriate sample size.

Equation (4.10) stated that:

̅− μ
X
Z= σ
√n

This can be re-written to give equation (5.26)


𝜎
̅
X−μ=Z (5.26)
√𝑛

One way to look at equation (5.26) is to say that ̅


X − μ is in fact an interval within which
our estimate should fall. In this case we can rewrite the above equation as:
σ
Interval estimate = 2 × Z × (5.27)
√n

Why did we insert number 2 in equation (5.27)? The graph in Figure 5.35 illustrates the
point.

Figure 5.35 Relationship between sample mean and confidence interval

Another way to look at equation (5.27) is to say that ̅


X − μ is effectively an error
between the estimated mean value and the true mean value. In this case, from equation
(5.27) we can specify this error as:
σ
E=Z × (5.28)
√n

You can also think of E in equation (5.28) as a margin of error. If, for example, you
would like your results to be within 10% accuracy, then the margin of error is
expressed in decimal numbers as 0.1. If desired accuracy is 5%, then the margin of error

Page | 326
is expressed as 0.05. Either way, a margin of error applies to a given confidence level
that is determined by Z.

If ̅
X − μ is effectively an error E, we can now rearrange equation (5.28) into:


√n =
E

This gives the sample size n, as per equation (5.29):

Zσ 2
n= ( ) (5.29)
E

Figure 5.36 illustrates the confidence interval, margin of error, and sample size.

Figure 5.36 Confidence interval for the population mean µ

Example 5.6

A researcher working in the quality control department of a brewery wishes to


determine the sample size where the margin of error is no more than 0.05 units of
alcohol and with a 98% confidence. Historical data provided to the researcher indicates
that the population data are normally distributed with a population standard deviation
of 0.2 units of alcohol.

Information provided:

• Population data normally distributed


• Confidence interval = 98%
• From Table 5.6, Zcri for 98% CI = ± 2.326347874
• Population standard deviation = 0.2 units
• Required margin of error E = 0.05 units

The sample size can now be calculated using equation (5.26).

Page | 327
𝑍2𝜎2 2.32..2 0.22
𝑛= = = 86.59031…
𝐸2 0.052

To meet the requirements of the researcher, a sample size of 87 is required.

Excel solution

Figure 5.37 illustrates the Excel solution

Figure 5.37 Example 5.6 Excel solution

From Excel, the sample size to achieve the result would be 87.

SPSS solution

There is no built-in SPSS solution but you could solve this problem in SPSS by using the
SPSS transform method to calculate the individual statistics.

Impact of margin of error on the confidence interval

To see what impact the selection of the margin of error and confidence interval has on
the sample size, we will run a small simulation. Table 5.9 illustrates how sample size
changes with differing margins of error and confidence interval

By keeping the same margin of error, but changing the confidence interval, we can see
how the sample size changes. Effectively, in this example, we need to increase the
sample size by almost two and half times if we want our confidence interval to increase
from 90% to 99% (see Table 5.9).

Margin of error 0.05 0.05 0.05 0.05


Conf. Interval 90% 95% 98% 99%
Sample size 44 62 87 107
Table 5.9 sample size changes if margin of error and confidence interval changes

Page | 328
Let us now keep the confidence interval constant, at 90%, but let’s change the margin of
error. Table 5.10 shows the sample size required.

Margin of error 0.15 0.10 0.05 0.01


Conf. Interval 90% 90% 90% 90%
Sample size 5 10 44 1083
Table 5.10 How sample size changes when margin of error changes but
confidence interval is constant

As we can see, the margin of error has a tremendous impact on the sample size. It is
particularly important to emphasise here that the margin of error depends very little on
the size of the population from which we are sampling if the sampling fraction is less
than 5% of the total population. For very large populations, the impact is almost
negligible.

The same equation (4.10) can be used to extract n, as we did in equation (5.29), or E or
Z, as below:

s √n
E=Z or Z=E
√n s

If we solve the equation for E, this tells us what is going to be the size of our error if we
use a level of confidence (Z) and a given a sample size (n). If we solve the equation for Z,
this tells us the expected level of confidence given the size of our error (E) and a given a
sample size (n).

A note to remember: Error margin is expressed in the same units as the original data
values, and so is the standard deviation. If the data units are kg, for example, then the
mean value, the standard deviation and the error margin are also kg. If the mean value
and the standard deviation are percentages, for example, then the error margin is also
expressed in percentages. However, if your original values and the standard deviation
are in kg, for example, and you would like the error margin to be 10% of the target value
expressed in kg, then you need to multiply the target value by 0.1 which results with the
correct error margin expressed in kg as units.

Sample size for the proportion estimate

If we do not have the standard deviation of the population, which is very often the case,
and we use proportions, then there is another equation to determine the size of the
sample. Let us assume that p is the proportion of some variable and q = 1 – p. In this
case, the sample size is calculated as:

Z 2
n = p q (E) (5.30)

Example 5.7

Suppose that we do not know what is the true proportion of people who like Marmite,
which means we will assume the neutral position of 50% (p = 0.5). Thus, q = 1 – 0.5 =

Page | 329
0.5. In other words, there are also 50% of people who do not like Marmite. Let us also
assume that we would like 95% confidence in our results, which means that Z = 1.96.
And finally, let us say that we will accept an error of 5% (E = 0.05). To calculate the
size of the sample needed, we insert these values in equation (5.26):

1.96 2
n = 0.5 × 0.5 × ( ) = 384.16
0.05

A sample of 384 people will give us results within the 95% confidence limit and will not
generate an error that is greater than  5%. Equation (5.26) will work for very large
populations, but if the population is small (and denoted by N), then the sample size n
must be corrected using equation (5.31), called Cochran’s formula correction:
n
n1 = (n−1) (5.31)
1+
N

Example 5.8

We will use the same data as in Example 5.7 but let us make two changes. Let us assume
that we got survey results from somewhere and we know that only 41% of the people
like Marmite (p = 0.41 and q = 0.59). Let us also assume that we would like to apply
these numbers and survey the population of the first year at our business school, and
this population is only 500 (N = 500). What should be the sample size we need to take in
this case, still assuming 95% confidence limit and 5% margin of error?

By inserting the new numbers into equation (5.30) we get:

1.96 2
𝑛 = 0.41 × 0.59 × ( ) = 1032.54 = 1033 𝑎𝑝𝑝𝑟𝑜𝑥
0.05

We use this value of n in equation (5.31) to get the new estimate of the sample:

n
n1 =
(n − 1)
1+ N

1033
n1 =
(1033 − 1)
1+
500

n1 = 337.09

Now we have the corrected sample size, which indicates that we must include 337
people in our survey and that we can expect the results within a 5% error.

As before, the same equation (4.10) can be used to extract n for estimating the
proportions, as we did in equation (5.26), or E or Z:

Page | 330
pq E
E=Z × √ or Z=
n pq
√n

If we solve the equation for E, this tells us what is going to be the size of our error if we
used a particular proportion (p) at a level of confidence (Z) and a given a sample size
(n). If we solve the equation for Z, this tells us the expected level of confidence given the
size of our error (E) and a given a sample size (n) and the expected proportion (p).

Check your understanding

X5.10 A business analyst has been requested by the managing director of a national
supermarket chain to undertake a business review of the company. One of the
key objectives is to assess the level of spending of shoppers who historically have
weekly mean levels of spending of €168 with a standard deviation of €15.65.
Calculate the size of a random sample to produce a 98% confidence interval for
the population mean spend, given that the margin of error is €3. Is the sample
size appropriate given the practical factors?

Chapter summary
In this chapter we have explored methods that can be used to provide point and interval
estimates for population parameters. We learned how to estimate the population mean
and population proportion from the sample mean and the sample proportion,
respectively.

Once we learned how to provide point estimates, we extended these principles to


interval estimates. What we learned was that we can assign a probability that these
point estimates will reside in an interval. To this effect we learned how to make interval
estimates for the population mean and a population proportion. We also learned that
interval estimates, although closely related to confidence intervals, are not the same
thing. One provides the interval of values where the population parameter is likely to be
(interval estimate) and the other provides the probability (confidence interval) that the
true value is in this interval. And finally, we learned how to handle estimates if the
samples are small, i.e. less than 30 observations.

Table 5.11 summarises the various equations and formulae that can be used to calculate
certain population and sample parameters, as well as how they can be used to estimate
the population parameters from a sample.

Statistic Population Sample > 30 Sample < 30


Mean ∑ 𝑓𝑋 ∑ 𝑓𝑥 ∑ 𝑓𝑥
𝜇= 𝑥̅ = 𝑥̅ =
∑𝑓 ∑𝑓 ∑𝑓
Standard deviation
∑( 𝑋 − 𝜇)2 ∑( 𝑥 − 𝑥̅ )2 ∑( 𝑥 − 𝑥̅ )2
𝜎=√ 𝑠=√ 𝑠=√
𝑁 𝑛−1 𝑛−1
Standard error for 𝑥̅ 𝑠 𝑠
𝑆𝐸𝑥̅ = 𝑆𝐸𝑥̅ =
√𝑛 √𝑛
Page | 331
z-value / t-value (𝑥 − 𝜇) (𝑥̅ − 𝜇) (𝑥̅ − 𝜇)
𝑧= 𝑧= 𝑡=
𝜎 𝑆𝐸𝑥̅ 𝑆𝐸𝑥̅
x or 𝑥̅ -value 𝑥 = 𝑧𝜎 + 𝜇 𝑥̅ = 𝑧 𝑆𝐸𝑥̅ + 𝜇 𝑥̅ = 𝑡 𝑆𝐸𝑥̅ + 𝜇

Proportion   

Standard error for 


𝜌(1 − 𝜌) 𝜌(1 − 𝜌)
𝑆𝐸𝜌 = √ 𝑆𝐸𝜌 = √
𝑛 𝑛
Probability p given x For both the population and the samples >30 from either the
or z tables or from =NORM.DIST() if you know x, or
=NORM.S.DIST() if you know z. For samples <30 =T.DIST(),
=T.DIST.2T() or =T.DIST.RT().
Expected value 𝐸(𝑋) = 𝜇 = 𝑥̅ 𝐸(𝑥) = 𝑝 𝑥 𝐸(𝑥) = 𝑝 𝑥

Estimate interval for  𝑥̅ − 𝑧 𝑆𝐸𝑥̅ < 𝜇 𝑥̅ − 𝑡 𝑆𝐸𝑥̅ < 𝜇


< 𝑥̅ + 𝑧 𝑆𝐸𝑥̅ < 𝑥̅ + 𝑡 𝑆𝐸𝑥̅
Estimate interval for  𝜌 − 𝑧 𝑆𝐸𝜌 < 𝜋 𝜌 − 𝑡 𝑆𝐸𝜌 < 𝜋
< 𝜌 + 𝑧 𝑆𝐸𝜌 < 𝜌 + 𝑡 𝑆𝐸𝜌
Table 5.11 Equations to calculate estimates and confidence intervals

Test your understanding


TU5.1 During a production process to create chocolate bars a small sample of size 60
was randomly selected. The average weight of the sample was 62.1 grams with a
standard deviation of 2.06 grams. Calculate: (a) an estimate of the population
mean weight, and (b) a 99% confidence interval for the mean weight of all
chocolate bars.

TU5.2 A random sample of 75 students were selected and their heights measured.
Given the sample average height was 182.56 cm with a standard deviation of
5.02 cm, calculate a 95% confidence interval for the mean height of all students.

TU5.3 A local newsagent measures the time that shoppers spend in the store. From the
historical data the time spent is normally distributed with a standard deviation
of 4 minutes. A random sample of 40 shoppers in the shop had a mean time of
24.5 minutes. Calculate: (a) an estimate of the average time spent in the store for
all shoppers, and (b) a 95% confidence interval for this population average time.

TU5.4 A warehouse supplies greengrocers with supplies which are carefully added to
bags for delivery. One of the items bagged is a speciality sugar. Historically the
weights of bags are normally distributed with a standard deviation of 1.45
ounces. The contents of a random sample of 50 bags had a mean weight of 22.8
ounces. Calculate the 98% confidence interval for the population weight of all
bags of speciality sugar bagged by the warehouse.

TU5.5 A random sample of 300 final-year undergraduate business dissertation students


was selected from the last 3 years, and the number of times a student was absent

Page | 332
from their 25 one-to-one supervisory meetings was recorded. The analysis of the
data suggests the average number of missed meetings was 9.7 with a sample
standard deviation of 3.2. Provide an estimate of the population average number
of missed meetings and develop a 98% confidence interval.

TU5.6 During the manufacturing process for a dual-core 3.0 GHz processor, a company
randomly selects a sample of 65 microprocessors and measures the
microprocessor clock speed. The sample statistics gives an average of 3.01 GHz
with a standard deviation of 0.28 GHz. Calculate a 99% confidence interval for
the true microprocessor speed. Should the company be concerned when you
compare the official microprocessor speed to 3.0 GHz, given the 99% confidence
interval?

TU5.7 A computer company sells computers to small and medium-sized businesses via
its e-commerce operation. The historical data collected by the company shows
that the time for delivery of the equipment, within business hours, to business
customers is normally distributed with a standard deviation of 4.5 hours.
Construct a 98% confidence interval for the population mean delivery time if we
select 38 orders and with a sample average was 17.8 hours. If the company
advertises that it delivers within 20 business hours, then should the company be
concerned?

TU5.8 A forestry conservation organisation has decided to supply timber to a local


builder. The organisation needs to calculate if it has enough trees to fell in order
to supply the builder. The method used is to estimate the mean diameter of trees
in an area of forest to determine whether there is enough lumber to harvest.
Historically, the average diameter of trees is normally distributed with a
standard deviation of 3 inches. The organisation needs to estimate the
population average diameter to within 0.4 inches with a 95% confidence
interval. What sample size is enough to guarantee this?

TU5.9 A halogen 400 W light bulb has stated lifetime of 2000 hours. A random sample
of size 70 was collected from the manufacturing process and the sample average
lifetime calculated to be 2010 hours with a standard deviation of 48.6 hours.
Calculate a 95% confidence interval for the true mean lifetime of the bulbs.
Should the company be concerned with the confidence interval results, given the
company advertises that the bulbs have average lifetime of 2000 hours?

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. AW5 Resampling with Replacements (Bootstrapping)

Page | 333
Chapter 6 Hypothesis testing
6.1 Introduction and Learning Objectives
Imagine that you and your friend have just handed in a big piece of coursework, having
worked hard on it for weeks. You head to the student union for celebratory drinks, but
find the main bar deserted. Your friend messages his flatmates to ask whether they’re
coming out for a drink, and six out of nine flatmates say they’re saving themselves (and
their money) for a big student night at the city’s main nightclub tomorrow. You suggest
waiting for the three others to respond rather than going to check another student bar,
but your friend says that, given the response of his flatmates, it’s likely that everywhere
will be empty, so you’d be better off going home and ordering a pizza. Your friend’s
statement could be considered a ‘hypothesis’ – a proposed explanation for something,
based on limited data. If you were still keen to go for drinks and wanted to test its truth,
to assess whether his conclusion is scientifically sturdy enough to act upon, you could
use ‘hypothesis testing’.

This chapter looks at a range of ways in which we can test a hypothesis – that is, to
evaluate whether our proposed explanation is valid or not – particularly in the context
of drawing conclusions from normally distributed data. In Chapter 7 we will explore
what is meant specifically by a parametric hypothesis testing and we’ll learn how to
conduct some of the tests. In Chapter 8 we will then go on to explore some more
hypothesis tests, including so-called nonparametric hypothesis tests. Hypothesis testing
helps businesses make good decisions so that they can achieve their goals. On an
individual level, the scientific support that hypothesis testing provides gives a decision-
maker the confidence to pursue their chosen course of action. It helps them to explain
and defend their actions to colleagues and external shareholders. Sooner or later you
will be confronted with a question, such as, ‘I can see how this region is responding to
my campaign, but can I assume that the whole market will react in the same way?’

You can answer this and many similar questions with full confidence only if you
understand the concepts of hypothesis testing. The effectiveness of a new drug, for
example, is analysed using hypothesis testing to decide if the drug will work and be
approved by the regulating authority. This technique is very powerful, yet not very
complicated, and it will provide you with a very solid basis to base your business
decisions upon. Hypothesis testing might sound like a very academic concept, but it is a
powerful and extremely practical tool that any business professional can easily use to
sense-check statements and enhance his or her argument. Consider, for example, a US
pet food company that is attempting to enter the UK market. The company appoints a
distributor in Scotland who reports that, on average, 1.9 kg of the product is consumed
week, per pet. Savvy professionals would not just accept this statement at face value – it
would prompt several questions. For example:

• Does this number represent consumption in the whole of UK?


• What confidence can be assigned to these numbers?
• Are these figures from the UK comparable to their domestic US market?

Page | 334
• Just how much confidence can be placed in the inference that the UK is
comparable to the US market, and that there is no difference between two
populations?
• And finally (based on the answers to the previous questions), would it be wise
for the company to expand its network of distributors?

This chapter will show you how to answer these typical questions, through setting up a
hypothesis and testing it using the statistics of the mean, variance, or proportion.

Learning objectives

On completing this chapter, you will be able to:

1. Understand the concept of hypothesis testing


2. Differentiate between one-tail and two-tail tests, as well as one-sample and two-
sample tests
3. Understand the difference between dependent and independent samples
4. Learn how sampling from different types of distributions and knowledge about the
parameters from these distributions impact testing
5. Understand the concept of a Type I and Type II error and statistical power and be
able to calculate these values for simple problems
6. Learn the five typical steps of hypothesis testing
7. Solve problems using Microsoft Excel and IBM SPSS Statistics software packages.

6.2 What is hypothesis testing?


A hypothesis is a proposed explanation that is made based on limited evidence as a
starting point. The analogy we used in our chapter overview about your friend saying, ‘if
this bar is empty, the others will be, too’, is a good example of a hypothesis. Sometimes
the example might sound little more complicated, but essentially it is always the same
principle. You have some data, but it is limited in scope, and you would like to draw
conclusions based on these limited data. The conclusions that you would like to draw, in
a business context, need to be credible and defendable. This means that you need to find
a scientific way to state how true your hypothesis is (your proposed explanation), and
how confident you are that you are drawing the correct conclusions.

When we use the phrase “a scientific way”, we mean to conduct a statistical test. A
statistical test is a formal technique that relies on the probability distribution to reach a
conclusion concerning the reasonableness of the hypothesis. These formal techniques
are called hypothesis tests. In this textbook the hypothesis, or the proposed
explanations that we are testing, typically apply to the mean, variance, or proportion.

What are parametric and nonparametric statistical tests?

To generalise about the population from the sample, statistical tests are used. These
hypothesis tests are classified as parametric or nonparametric tests. A parametric test
is one which has information about the population parameters. It can be used to make
statements about the mean or proportion of the parent population, for example. The

Page | 335
test assumes that the variables are measured predominantly at the interval or ratio
level.

A nonparametric test is one where the researcher has no idea regarding the
population parameters. It is not based on underlying assumptions, and it does not
require a knowledge of a population’s distribution. The test is mainly based on
differences in medians. Hence, it is alternately known as a distribution-free test. The
test assumes that the variables are measured at the nominal or ordinal level. Table 6.1
provides a comparison between the two types of test.

Comparison Parametric test Non-parametric test


What it means? A statistical test, in which A statistical test, in which
specific assumptions are made specific assumptions are not
about the population made about the population
parameter is known as a parameter is known as a non-
parametric test. parametric test.
Distribution known? Yes Arbitrary
Measurement level? Interval, ratio, nominal, Nominal, ordinal
ordinal
Measure? Mean, Proportion Median
Information about Known Unknown
population data?
Test applied to? Variables Variables and attributes
Table 6.1 Comparison between parametric and nonparametric tests

Once we covered the following two chapters, you will realise that parametric and
nonparametric hypothesis tests we covered in this textbook (including online chapters)
are as follows:

Analysis required Parametric test Chapter Non-parametric Chapter


(means) test (medians)
Compare a sample 1 sample t test 7 Sign test 8
average against a One sample
constant Wilcoxon test
Compare two Two sample t test for 7 Mann-Whitney 8
independent sample independent samples test
averages
Compare two dependent Two sample t test for 7 Wilcoxon rank 8
sample averages dependent samples sum test
Compare three or more One-way ANOVA Online Kruskal-Wallis Online
independent sample test
averages
Compare three or more Factorial ANOVA Online Friedman test Online
dependent sample
averages
Estimate the degree of Pearson coefficient of 9 Spearman’s rank 9
association between two correlation correlation
quantitative variables
Table 6.2 Type of analysis given parametric or nonparametric tests

Page | 336
In summary, Chapter 7 is dedicated to some of the parametric tests and Chapter 8 to
some of the nonparametric tests. Either way, they all start with formulating the
hypothesis.

Hypothesis statements H0 and H1

To conduct a hypothesis test, you always start with stating the so-called null
hypothesis (H0), also known as the hypothesis of no difference. The null hypothesis
(H0) is formulated in anticipation of being rejected as false. Typical phrases used are:

• There is no difference between the two means


• Proportion A is equal to proportion B
• The population mean is equal to the sample mean

If we state the null hypothesis in hope that we will reject it as false, what is the
alternative? Well, we need something that is called the alternative hypothesis (H1).
The alternative hypothesis (H1) is a proposed explanation that is contrary to the null
hypothesis and which states that a significant difference exists. Typical phrases used
are:

• There is a difference between the two means


• Proportion A is not equal to proportion B
• The population mean is not equal to the sample mean

The shorthand that is usually used to state these two hypotheses is as follows:

Null hypothesis

H0:  = 100

Alternative hypothesis

H1:  ≠ 100

or H1:  < 100

or H1:  > 100

The first line of this shorthand reads ‘Our null hypothesis is that the mean value is equal
to 100’. The second line shows three different options: ‘Our alternative hypothesis is
that the mean value is not equal to 100’, ‘Our alternative hypothesis is that the mean
value is less than 100’ or ‘Our alternative hypothesis is that the mean value is greater
than 100’. We typically use only one of them, depending on what we are testing.

Hypothesis testing has its own language. What we mean by this is that you will often see
phrases such as ‘the evidence suggests that we reject the null hypothesis’ or ‘the
evidence suggests that we fail to reject the null hypothesis’. It would be incorrect to use
the phrase ‘accept the null hypothesis’. Why this convoluted language?

Page | 337
The way the test philosophy is used implies that you can never be completely sure that
the null hypothesis is true. If you are ‘not rejecting’ the null hypothesis, this does not
mean that it is true. It just means that you are still ‘retaining’ it, as there is some small
possibility that it might be true. In fact, you can think of these tests as a method to help
you collect evidence for rejecting or not rejecting the null hypothesis. If your evidence
suggests that you cannot reject the null hypothesis, this means that you do not have
enough evidence to do so. It does not mean that it is true, which is the reason why we
should not use the word ‘accept H0’, but rather ‘evidence suggests we do not reject H0’.

One- and two-tailed tests

Depending on how we state the hypothesis to test, we will have to use either a one-
tailed test or a two-tailed test. A typical two-tailed test states the null hypothesis as H0:
 = 100 and H1:  ≠ 100, for example. The symbol ≠ determines that we need a two-
tailed test. If the hypotheses are stated as H0:  = 100 and H1:  > 100 or H0:  = 100 and
H1:  < 100, for example, then a one-tailed test is appropriate. The symbol ‘<’ (less than)
or ‘>’ (greater than) will determine whether we will use a lower-tail or an upper-tail
test.

Whether you have a one- or two-tailed test will be important because of the rejection
region. The rejection region is in the tail(s) of the distribution. The exact location is
determined by the way H1 is expressed. If H1 simply states that there is a difference, for
example H1:   100, then the rejection region is in both tails of the sampling
distribution with areas equal to /2. For example, if  (the level of significance) is set at
0.05 then the area in both tails will be 0.025 (see Figure 6.1). This is known as a two-
tailed test.

Figure 6.1 Two-tailed tests and test rejection areas (shaded)

A two-tailed test is a statistical hypothesis test in which the values for which we can
reject the null hypothesis, H0, are in both tails of the probability distribution. If H1 states
that there is a direction of difference, for example  < 100 or  > 100, then the rejection
region is in one tail of the sampling distribution and we have a one-tailed test — the tail
being defined by the direction of the difference. If we have ‘less than’ (H1:  < 100, for

Page | 338
example), the left-hand tail is used, and this is known as a lower-tailed test (see Figure
6.2).

Figure 6.2 (Lower) one-tailed test and test rejection area (shaded)

Conversely, if we have ‘greater than’ (H1:  > 100, for example), the right-hand tail is
used, and this is known as an upper-tailed test (see Figure 6.3).

Figure 6.3 Upper one-tailed test and test rejection area (shaded)

One- and two-sample tests

The hypothesis testing procedure will vary, depending on how many samples we use,
whether they are dependent on each other or independent, and what kind of
conclusions we want to draw. For each of these options, a slightly modified test is used.
Although the logic and the procedures are almost identical, the formulae used are
somewhat different. We will now describe the different tests available to us.

How we form the hypotheses and how we go about executing the test will depend,
among other things, on whether we are dealing with one or multiple samples. All the
tests we will cover in this chapter are applicable to only one or two samples. A one-

Page | 339
sample test involves testing a sample parameter (e.g. the mean value, variance or
proportion) against a perceived population value to ascertain whether there is a
significant difference between the sample statistic and population parameter. For a two-
sample tests, we test one sample against another to ascertain whether there is a
significant difference between them and, consequently, whether the two samples
represent different populations.

When we talk about ‘one sample’ we mean that the results associated with one group of
observations, are compared with the population results. When we talk about ‘two
samples’ we mean that the results are compared between two groups of results
(products, individuals, groups, or anything similar). Typical examples are:

• One individual produces X of product as opposed to the average amount Y of the


whole group. Is he ‘in line’ with the rest of the group?
• Are the results of the sales team in region A comparable with the rest of the
company, or is the team underperforming?
• The quality in one factory seems to be different from that in another. Should we
be concerned or is it within our overall quality assurance standards?
• After attending a sales course, a team’s sales effectiveness seems to have gone
up. Is this by chance or does the training really have impact on our sales force?
• You are planning a promotion campaign and want to give certain products for
free if a customer buys some other products. Can you afford this and how should
you structure the ‘bundle’?

Independent and dependent samples/populations

When we come to using the tests that apply to two samples/populations, most of them
assume that the samples come from two independent populations. In some cases, the
two populations are dependent. These two different scenarios can be defined as follows.

Two populations are said to be independent if the measured values of the items
observed in one population do not affect the measured values of the items observed in
the other population. For example, consider the following two populations: all
unmarried men aged 28 in Wales (population A) and all married men aged 28 in Wales
(population B). The variable we are interested in measuring in these men is the amount
of weight they have gained/lost since they were 18 years old. In this case, we would say
that the two populations are independent, because the amount of weight gained by an
individual in population A will not affect the amount of weight gained by an individual
in population B (and vice versa).

Two populations are dependent if the measured values of the items observed in one
population directly affect the measured values of the items observed in the other
population. Typically, the items in two dependent populations are paired, in the sense
that each item in one population is directly linked to a corresponding item in the other
population. For example, in a study to determine the effectiveness of sales training, we
define population A as the population of salespeople before the training course and
population B as the population of salespeople after the training course.

Page | 340
The variable being measured is the sales effectiveness (measured as a ratio of open
quotations to closed orders). In this example, each item in population A is directly
linked to each item in population B (the individual before training and the same
individual after training). Clearly, the sales effectiveness value after the training
(population B) is somewhat reliant on the original value before the training (population
A), that is, they are dependent.

Sampling distributions from different population distributions

In Chapter 4, we explored the central limit theorem states that the sum of a number of
independent and identically distributed random variables with finite variances will tend
to a normal distribution as the number of variables grows. Thus, even though we might
not know the shape of the distribution where our data comes from, the central limit
theorem says that we can treat the sampling distribution as if it were normal.

Furthermore, as we sample from normal or non-normal distribution, the samples could


be either large or small. And finally, for every case the standard deviation is either
known or not known. All these sampling options are depicted below in Figure 6.4:

Figure 6.4 Different sampling scenarios

We will now describe briefly how to handle each of these cases.

Sampling from a normal distribution, large sample and known  (AAA)

We already know that if the observed sample data X1, X2, …, Xn are (i) independent, (ii)
have a common population mean , and (iii) have a common variance 2 (according to
equation (6.1), then the sample mean value 𝑋̅ has mean  and variance 2/n (equation
(6.2).

𝑋 ∼ 𝑁(𝜇, 𝜎 2 ) (6.1)

Page | 341
Then the sampling distribution of the sample means is:

𝜎 2
𝑋̄ ∼ 𝑁 (𝜇, 𝑛 ) (6.2)

And the corresponding standardised Z equation is:

̅− μ
X
Z= σ (6.3)
√n

Sampling from a non-normal distribution, large sample size and known 


(BAA)

Suppose now that the observed sample data X1, X2, …, Xn are (i) independent, (ii) have a
common population mean , and (iii) unknown common variance 2. For populations
that are not normally distributed we can make use of the central limit theorem. Because
of the central limit theorem, many test statistics are approximately normally distributed
for large samples (n ≥30). In this case, the unknown population standard deviation 
would be replaced by the sample standard deviation s (equation (6.4)).

Then the sampling distribution of the sample means is the same as equation (6.2), with
the exception that we use s instead of :

𝑠2
𝑋̄ ∼ 𝑁 (𝜇, 𝑛 ) (6.4)

And the corresponding standardised Z equation is the same as equation (6.3), with the
exception that we use s instead of :
̅− μ
X
Z= s (6.5)
√n

Sampling from a normal distribution, small sample size and unknown 


(ABB)

If the observed sample data X1, X2, …., Xn are (i) independent, (ii) have a common
population mean , and (iii) unknown common variance 2, what do we do if the sample
size is less than 30 and we do not know the population standard deviation?

If the population data are normally distributed, then we can replace the normal
distribution with Student’s t distribution with df = n – 1 degrees of freedom (equation
(6.6)). Then the sampling distribution of the sample means is:

𝑠 2
𝑋̄ ∼ 𝑡𝑑𝑓 (𝜇, 𝑛 ) (6.6)

And the corresponding standardised Z equation is:

Page | 342
̅− μ
X
t = s (6.7)
√n

Why a sample size of 30?

This is a historical issue from the time before we had appropriate software packages to
do the statistical analysis and a distinction would be made between small-sample and
large-sample versions of t tests. The small- and large-sample versions did not differ in
how we calculated the test statistic but in how the critical test statistic was obtained.

For the small-sample test, one used the critical value of t, from a table of critical t-values.
For the large-sample test, one used the critical value of z, obtained from a table of the
standard normal distribution. The other difference is that to calculate t-values, we need
one more piece of information, the degrees of freedom.

Today we can use statistical software to carry out t tests, such as Excel and SPSS, which
will print out for a given test the value of the test statistic and test statistic p-value.
When we solve statistical hypothesis problems in this and later chapters, we will use the
manual/critical tables method, the Excel method, and where possible provide the SPSS
solution. As a reminder, Figure 6.5 shows a comparison between the normal and t
distribution, where the number of degrees of freedom increases from 2 to 30. We
observe that the difference between the normal and t distribution decreases as the
number of degrees of freedom increases. In fact, very little numerical difference exists
between the normal and t distributions when we have sample sizes of at least 30.

Figure 6.5 Comparison between normal and t distribution

Page | 343
Sampling from a normal distribution, large sample and unknown  (AAB)

In this case, we are sampling from a normal distribution with an unknown  but large
sample size. Therefore, the sampling distribution is given by equation (6.4).

Sampling from a normal distribution, small sample and known  (ABA)

In this case, we are sampling from a normal distribution with a known  and small
sample size. Therefore, the sampling distribution is given by equation (6.4).

Sampling from a non-normal distribution, large sample and unknown 


(BAB)

In this case, we are sampling from a non-normal distribution with an unknown  and
large sample size. Therefore, the sampling distribution is given by equation (6.4) or
equation (6.6).

Sampling from a non-normal distribution, small sample and known 


(BBA)

In this case, we are sampling from a non-normal distribution with a known  and small
sample size. In this situation, we would use a non-parametric method described in
Chapter 8.

Sampling from a non-normal distribution, small sample and unknown 


(BBB)

In this case, we are sampling from a non-normal distribution with a unknown  and
small sample size. In this situation, we would use a non-parametric method described in
Chapter 8.

Check your understanding

State the null hypothesis for the following statements:

X6.1 The incidence of spending is higher for men than for women.

X6.2 On the playground, more children play on the swings than on the slide.

X6.3 A high score on a quantitative methods module predicts success as a business


analyst.

X6.4 Dog Luxuries Ltd sells luxury dog food direct to customers via its e-commerce
web site. The company monitors sales in real time and undertakes regular
checks on sales. The company would like to check if sales have changed since
January 2018 when 5500 sales per month were recorded with a sales value to be

Page | 344
recorded at the end of May 2018 for the May sales period. Use this information to
answer the following questions:

a. State the null and alternative hypotheses?


b. Is the alternative hypothesis one- or two-tailed?
c. If you changed the word ‘changed’ to ‘reduced’ how would this affect your
answers to (a) and (b)?

6.3 Introduction to hypothesis testing procedure


The hypothesis testing procedure can be condensed into just five steps. Some steps are
just very simple statements, the other ones contain calculations, and some involve
making decisions.

Steps in hypothesis testing procedure

We’ll define the five steps that apply to every hypothesis test. Every step is colour-coded
to correspond with the same colour area in the spreadsheet to make it easier to identify
what is happening during every step. In this section, we will not conduct any tests, just
explain how the procedure works.

Step 1 Provide the formal hypothesis statements H0 and H1

As we said, the null hypothesis (H0) and alternative hypothesis (H1) are two
competing statements concerning the population parameters of interest. These
statements determine the entire testing procedure. The general idea is to state
H0 so that we can reject it. How H0 and H1 are stated will depend on the problem
and the data available.

We know that, on average, men are taller than women. How do you prove it?
Let’s first state what we want to prove: that men are taller than women. We took
a sample of 37 men and measured their height. Also, we took a sample of 41
women and measured their height. (Note that the two groups do not always have
to be the same size.) We have population averages for both groups: 1 is the
average for men and 2 is the average for women. Our hypotheses could be
stated as follows:

H0: 1 ≤ 2 (on average men are not taller than women)


H1: 1 > 2 (on average men are taller than women)

Note that because we want to prove that men are taller, this becomes H1. As we
said, H0 is usually stated in anticipation of being rejected.

Step 2 Determine the test to apply for the given hypothesis statement

The statistical test we apply will depend, for example, on how many samples we
have (one or two or more), which statistic are we using (the mean or
proportions), how much we know about the population (we know the mean

Page | 345
and/or variance, or neither), and the size of the sample or population. We will
address all these criteria as we go through specific tests.

Figure 7.1 from the following Chapter provides a high-level map of how to select
the appropriate test.

For the above example with the average height for men and women, we would
choose a two-sample test for independent samples, comparing two means and
assuming the variances are unequal (a t-test). The following sections in this
chapter will explain all these terms.

Step 3 Set the level of significance level, 

The significance level, usually denoted by the Greek letter alpha (α), is a fixed
probability of making the error of ‘rejecting the null hypothesis H0, even though
it is true’ (more about that a bit later). This probability is arbitrary and specified
by the person conducting the test. It also represents the degree of accuracy that
the test should exhibit. For example, if we choose α = 0.05, we are saying that we
would like to make the mistake of incorrectly rejecting H0 in at most 5% of cases
when this test is conducted.

Another way to think about this is to say that the level of significance represents
the amount of risk that an analyst will accept when deciding. The use of the
significance level is connected with the aim of seeking to put beyond reasonable
doubt the notion that the findings are due to chance. The value of  normally
takes the value 5% (0.05) or 1% (0.01), but it could be any other value. The value
of  depends upon how sure you want to be that your decisions are an accurate
reflection of the true population relationship.

For the above example with the average height for men and women, the phrase
‘taller than’ implies that we are willing to take a chance on rejecting H0, when it
might be true. This means that we are ready to accept that in 5% of cases, we
might be rejecting the hypothesis which is true. You can also think about this
percentage as the confidence level (1 – α = 1 – 0.05 = 0.95), implying that we can
be 95% certain that our conclusion is correct.

Step 4 Extract the relevant statistic

A test statistic is a quantity calculated from sample data (an equation or a


formula). It is used to determine if H0 should be rejected. This statistic is always
calculated under the assumption that H0 is true. Most of the time we imply that
the data are distributed in accordance with the normal distribution (z) or in
accordance with the t distribution (t).

The critical value of the test statistic will be denoted by either zcri or tcri (or
sometimes zα or zα/2). These values (for the given level of significance α) are
compared with the calculated values that we call zcal or tcal. You must be patient
as we will explain this shortly.

Page | 346
There are two alternative ways to calculate the necessary statistic. We can either
use the critical value of the test statistic(zcri), or alternatively the so-called p-
value, which is the probability associated with this calculated statistic.

The critical value is a quantile (related to the probability α) from the sampling
distribution of the test statistic. Critical values define the range of values for the
test statistic for which we do not reject H0, that is, if the observed value of the
test statistic lies in the rejection region defined by the critical values, then we
reject H0. The p-value is the probability of getting a value of the test statistic as
extreme as or more extreme than that observed by chance alone, if the null
hypothesis is true. Typically, if the p-value is less than α, we will reject the null
hypothesis.

For the above example with the average height for men and women, let’s assume
that zcri = 1.65 and we calculated zcal = 3.5. We know that zcri corresponds to the α
value, which is why it is sometimes called zα. Think of α as the probability level
corresponding to zcri = 1.65. As an alternative method, we also calculated the p-
value. Let’s assume that the p-value is 0.002. Think of this as the probability level
corresponding to zcal=3.5.

Step 5 Make a decision

The final step in the hypothesis testing procedure is to decide whether the null
hypothesis should be rejected. We do this by deciding whether the test statistic is
‘large’, that is, whether the ‘distance’ between a sample statistic and a
hypothesised parameter is large.

Based on the value of the test statistic, we can determine if the sample statistic is
‘close’ to the hypothesised parameter (in which case the evidence suggests that
we fail to reject H0), or if it is ‘far away’ from the hypothesised parameter (in
which case we will reject H0). To facilitate this interpretation, we consider two
different approaches that we can use to decide what constitutes ‘far away’ and
what is ‘close’. We can use either the critical test statistic value or the p-value.

In both approaches, the way in which we make a decision depends on the nature
of the alternative hypothesis (whether it is right-sided, left-sided or two-sided).

Example 6.1

For the above example with the average height for men and women, if zcri = 1.65 and zcal
= 3.5, this means that zcal > zcri. We can therefore reject the null hypothesis H0 at the 0.05
level of significance which is stating that men are not taller than women. In fact, we are
95% confident that we have not made a wrong decision based on our samples.

Using the alternative method, because the p-value is 0.0002 and α = 0.05, we can see
that the p-value is less than α. Because of that, we can reject the null hypothesis H0. As
expected, although we used two alternative methods, they lead us to identical
conclusions. The situation can be visually summarised as shown in Figure 6.6.

Page | 347
Figure 6.6 Comparison between α and p-value, and between zcal (zα) and zcri

Because we used the phrase ‘taller than’, we had to use a one-tailed test. If we had used
the phrase ‘taller or shorter than’, then we would have had to use a two-tailed test. In
the one-tailed example above, the yellow area represents all the values that are greater
than 1.65 on the x-axis and smaller than 0.05 on the y-axis. This is called the rejection
area.

Every zcal value, if we use the zcri method, that is to the right of 1.65, falls in the rejection
area. This means if zcal > zcri, we must reject H0. The same logic applies if we use the p-
value method. Every value of p that is less than 0.05 (in this case) falls in the rejection
area, and H0 must be rejected. Rejecting H0 means that we need to accompany it with a
statement that reads something like this: ‘We reject the hypothesis that men are not
taller than women at the 5% level of significance’.

This implies that we are 95% certain that men are, on average, taller than women. If we
used the left-tailed test, the same logic would apply, only we would be looking for the
numbers to the right of zcri. The two-tailed test uses similar reasoning, as we will see
shortly.

How do we make decisions?

As we already demonstrated whether we use the critical test statistic value or p-value
makes no difference. These two approaches to hypothesis testing lead to identical
conclusions.

Intuitively, we can say that the critical test statistic values are thresholds indicating
when a test statistic’s value is close to zero or not. If the test statistic exceeds the critical
value, its value is ‘far away’ from zero (there is a ‘large distance’ between the sample
statistic and the hypothesised parameter). Consequently, we would reject the null
hypothesis H0 and accept the alternative hypothesis H1.

If the test statistic lies anywhere else, then it is ‘close’ to zero (the ‘distance’ between the
sample statistic and the hypothesised parameter is small enough to be considered

Page | 348
negligible). Consequently, we would fail to reject the null hypothesis H0 and reject the
alternative hypothesis H1.

The p-value approach is directly related to the critical value approach, except that
instead of basing the decision on the test statistic’s actual value compared to a critical
value, we base it on a probability associated with the test statistic, called the p-value.
We then compare the p-value to the level of significance 𝛼, specified for the test, and
decide between rejecting and failing to reject the null hypothesis, H0. Once again, the
nature of the alternative hypothesis will affect the way in which we make the decision.

Example 6.2

To illustrate how critical values and rejection regions are defined, we consider the
situation in which we test if the population mean is equal to 100 hours, H0: μ = 100. The
test statistic selected for this test follows a standard normal distribution for each of
these different alternatives if we assume that the sample data used in this test are
obtained from a normally distributed population with known variance, σ2.

The test statistic is given by equation (6.3):

X − 100
z=
 n

That is, z follows a standard normal distribution if the conditions stated in the null
hypothesis are true (we usually shorten this by simply saying ‘z has a standard normal
distributed under H0’).

Note that we distinguish between the random variable z and the actual calculated value
of the test statistic by denoting the calculated value by zcal. Just to make it easier, we will
select α = 0.05 for all our examples.

Left-sided test

Critical value method, Zcri

The hypothesis statement for a left-sided test that tests whether the population
mean is 100 hours versus the alternative that it is less than 100 hours is:

H0 : μ = 100

H1 : μ < 100

The critical value for this test is then just the lower one-tailed test with the α =
0.05 quantile for the standard normal distribution: it will be –Zα = –Z0.05 = –1.645.

This value can be obtained using statistical tables or using a software package
such as Excel. If the observed value of the test statistic is smaller than this critical
value (if it falls in the rejection region), then we reject H0. In other words, if Zcal <
Zα (or, as it is sometime written, Zcal < Zcri), we reject H0.
Page | 349
p-value method, p

We calculate the p-value using the formula:

p = P (Zα < Zcal ) (6.8)

To perform the test, we compare the p-value to the value of the significance level
(α). If the p-value is smaller than α, we reject H0. Figures 6.7 and 6.8 show two
scenarios for a left-sided test: Scenario I (Figure 6.7) shows the case where the
test statistic lies in the rejection region, and scenario II (Figure 6.8) shows the
case where the test statistic lies in the ‘fail to reject H0’ region.

Figure 6.7 Left-sided test showing both critical values and p-values –
scenario 1: H0:  = x, H1:  < x. If the p-value is less than , reject H0

Figure 6.8 Left-sided test showing both critical values and p-values –
scenario 2: H0:  = x, H1:  < x. If the p-value is greater than , fail to reject H0

Page | 350
Right-sided test

Critical value method

The hypothesis statement for a right-sided test that tests whether the population
mean is 100 hours versus the alternative that it is greater than 100 hours is:

H0 : μ = 100

H1 : μ > 100

The critical value for this test is the upper α = 0.05 quantile for the standard
normal distribution: it will be Zα = Z0.05 = 1.645. Again, this value can be obtained
using the Excel function =NORM.S.INV(). If the observed value of the test statistic
is greater than this critical value (if it falls in the rejection region), then we reject
H0. In other words, if Zcal > Zα, we reject H0.

p-value method

We calculate the p-value using the formula:

p = P (Z𝛼 > Zcal ) (6.9)

To perform the test, we compare the p-value to α. If the p-value is smaller than
α, we reject H0. Figures 6.9 and 6.10 below show the two scenarios for a right-
sided test: Scenario I (Figure 6.9) shows the case where the test statistic lies in
the rejection region, and scenario II (Figure 6.10) shows the case where the test
statistic lies in the ‘fail to reject H0’ region.

Figure 6.9 Right-sided test showing both critical values and p-values –
scenario 1: H0:  = x, H1:  > x. If the p-value is less than , reject H0

Page | 351
Figure 6.10 Right-sided test showing both critical values and p-values –
scenario 2: H0:  = x, H1:  > x. If the p-value is greater than , fail to reject H0

Two-sided test

Critical value method

The hypothesis statement for a two-sided test used to test whether the
population mean is 100 hours versus the alternative that it is not 100 hours is:

H0 : μ = 100

H1 : μ ≠ 100

There are two critical values in this test: the lower α/2 = 0.025 quantile and the
upper α/2 = 0.025 quantile, both for the standard normal distribution, that is,
they will be Zα/2 = – Z0.025 = – 1.96 and Zα/2 = Z0.025 = 1.96. If the observed value of
the test statistic is greater than Zα/2 or is less than – Zα/2, then we reject H0. In
other words, if Zcal < – Z(α/2), or if Zcal > Z(α/2), we reject H0. The critical values and
rejection region are illustrated in Figure 6.11.

p-value method

If the calculated value of the test statistic Zcal is positive, then the p-value is
calculated using the formula:

p = P(Z𝛼 < − Zcal ) + P(Z𝛼 > Zcal )

p = 2 × P(Z𝛼 > Zcal ) (6.10)

If the calculated value of the test statistic Zcal is negative, then the p-value is
calculated using the formula:

p = P(Z𝛼 > − Zcal ) + P(Z𝛼 < Zcal )

Page | 352
p = 2 × P(Z𝛼 < Zcal ) (6.11)

Note that this calculation differs from the previous two because we need to
calculate the probability to the left and to the right. Once again, to decide, we
compare the p-value to . If the p-value is smaller than , we reject H0. The
critical values and rejection region are illustrated in Figures 6.11 and 6.12.

Figure 6.11 Two-sided test showing both critical values and p-values –
scenario 1: H0:  = x, H1:  ≠ x. If the p-value is less than , reject H0

Figure 6.12 Two-sided test showing both critical values and p-values –
scenario 2: H0:  = x, H1:  ≠ x. If the p-value is greater than , fail to reject H0

Page | 353
Types of errors and statistical power

Making decisions always implies that there is a possibility of making an error. When
making decisions in hypothesis testing, we can distinguish between two types of
possible errors: Type I error and Type II error.

Null hypotheses again

Earlier we defined the notion of the null hypothesis (H0), which is the hypothesis that
the phenomenon to be demonstrated is in fact absent. For example, it is the hypothesis
that there is no difference between a population mean and an observed sample mean or
no difference between the means (1 = 2) in a t test.

The null hypothesis is important because it is what researchers are most often testing in
their studies. If they can reject the null hypothesis at a certain alpha level (e.g., p < 0.05),
then they can accept as probable whatever alternative hypothesis makes sense, for
example, that the population mean is not a predefined value for reasons other than
chance (e.g.,  < 100 at p < 0.05) in a t test. Once again, focusing on rejecting the null
hypothesis and declaring a ‘significant’ (at p < 0.05) mean difference is how researchers
typically proceed.

Type I and Type II error

Most often, the probability statements in the above example are taken to indicate the
probability that the researcher will accept the alternative hypothesis when the null
hypothesis is true (see α in top left-hand corner of Table 6.3). That seems to be the
primary concern of most researchers in their studies. However, there is another way to
look at these issues that involves what are called Type I and Type II errors.

From this perspective, α is the probability of making a Type I error (accepting the
alternative hypothesis when the null hypothesis is true), and β is the probability of
making a Type II error (accepting the null hypothesis when the alternative
hypothesis is true). By extension, 1 – α is the probability of not making a Type I error,
and 1 – β is the probability of not making a Type II error.

State of Nature (reality)


Null hypothesis (H0) is actually:
H0 is true H0 is false
Type I error (false positive) Correct (true positive)
Reject H0 Probability of making this Probability of getting this
Decision error = α correctly, Power = 1 – β
about null
Correct (true negative) Type II error (false
hypothesis
Fail to reject H0 Probability of getting this negative)
(H0)
correctly, the Level of Probability of making this
confidence = 1 - α error = β
Table 6.3 Scenarios of rejecting, or not rejecting, H0 that is true or not true

The primary concern of most researchers is to guard against Type I errors, errors that
would lead to interpreting observed differences as non-chance (or probably real) when

Page | 354
they are due to chance fluctuations. However, researchers often don’t think about Type
II errors and their importance. Recall that Type II errors are those that might lead us to
accept that a set of results is null (i.e., there is nothing in the data but chance
fluctuations) when the alternative hypothesis is true. Researchers may be making Type
II errors every time they accept the null hypothesis because they are so tenaciously
focused on Type I errors (α) while completely ignoring Type II errors (β).

Statistical power

The term statistical power (or just power) represents the probability that you will reject
a false null hypothesis and therefore accept a true alternative hypothesis. This can be
written as power = P(rejecting H0 given H0 is false). Based on these definitions, we can
write the following equation:

Statistical power = 1 – β (6.12)

For example, if the Type II error (β) is equal to 23% then the statistical power is 1 – 0.23
= 77% and we would conclude that we would not reject a false, null hypothesis 77% of
the time. If statistical power is high, the probability of making a Type II error, or
concluding there is no effect when, in fact, there is one, is low. The statistical power of
an experiment is determined by the following factors:

a. The level of significance to be used.


b. The variability of the data (as measured, for example, by their standard
deviation).
c. The size of the difference in the population it is required to detect.
d. The size of the samples.

By setting the power (often 80%) and any three of these four values the remaining one
can calculated. However, since we usually use a 5% level of significance, we need only
set two out of (b), (c), (d) and the power to determine the other. The variability of the
data (b) needs to be approximately assessed, usually from previous studies or from the
literature, and then the sample sizes (d) can be determined for a given difference (c) or,
alternatively, for a specific sample size the difference likely to be detected can be
calculated.

Software packages to calculate statistical power

There are several software packages that specialise in the calculation of statistical
power:

G*Power - Free software

G*Power is free and can be downloaded from http://gpower.hhu.de. G*Power is


a tool to compute statistical power analyses for many different t tests, F tests, χ2
tests, Z tests and some exact tests. G*Power can also be used to compute effect
sizes and to display graphically the results of power analyses.

Page | 355
SPSS - Not free software

IBM SPSS SamplePower software enables you to quickly find the right sample
size for your research and test the possible results before you begin your study.
The software provides advanced statistical techniques such as means and
differences in means, correlation, one-way and factorial analysis of variance
(ANOVA), regression and logistical regression, survival analysis and equivalence
tests.

Check your understanding

X6.5 A supermarket is supplied by a consortium of milk producers. A recent quality


assurance check suggests that the amount of milk supplied is significantly
different from the quantity stated within the contract.

a. Define what we mean by significantly different.


b. State the null and alternative hypothesis statements.
c. For the alternative hypothesis do we have a two-tailed, lower one-tailed, or
upper one-tailed test?

X6.6 A business analyst is attempting to understand visually the meaning of the


critical test statistic and the p-value. For a z-value of 2.5 and a significance level
of 5% provide a sketch of the normal probability distribution and use the sketch
to illustrate the location of the following statistics: test statistic, critical test
statistic, significance value, and p-value (you do not need to calculate the values
of zcri or the p-value).

X6.7 At the 2% significance level, what are the critical z values for (a) a two-tailed test,
(b) a lower one-tailed test, and (c) an upper one-tailed test?

X6.8 A marketing manager has done a hypothesis test to test for the difference
between accessories purchased for two different products. The initial analysis
has been performed and an upper one-tailed z test chosen. Given that the z value
was calculated to be 3.45, find the corresponding p-value. What would you
conclude from this result?

Chapter summary
In this chapter we have introduced the important statistical concept of hypothesis
testing. What is important in hypothesis testing is that you can recognise the nature of
the problem and should be able to convert this into two appropriate hypothesis
statements (H0 and H1) that can be tested. If you are comparing more than two samples
then you would need to employ advanced statistical parametric hypothesis tests that
are beyond the scope of this book – these statistical tests are called analysis of variance
(ANOVA) tests.

In this chapter we have described a simple five-step procedure to aid the solution
process. The main emphasis is placed on the use of the p-value, which quantifies the
probability of the null hypothesis (H0) being rejected. Thus, if the measured p-value is

Page | 356
greater than  then we would fail to reject H0 as statistically significant. Remember the
value of the p-value will depend on whether we are dealing with a one- or two-tailed
test. So, take extra care with this concept since this is where most students slip up.

The alternative part of the decision-making process described the use of the critical test
statistic in making decisions. This is the traditional textbook method which uses
published tables to provide estimates of critical values for various test values.
Moreover, we learned that we have two types of errors with hypothesis testing: Type I,
when we reject a true null hypothesis; and Type II, when we fail to not reject a false null
hypothesis. This concept was then extended to the concept of statistical power, when
we reject a false null hypothesis, and the relationship between statistical power and the
probability of making a Type II error.

Test your understanding


TU6.1 Calculate the critical z value if you have a two-tailed test and you choose a
significance level of 0.05 and 0.01.

TU6.2 If you conduct a z test and it is a lower one-tailed test, what is your decision if the
significance level is 0.05 and the value of the test statistic is – 2.01?

TU6.3 Calculate the value of the p-value for the TU2 question.

TU6.4 Calculate the value of the z statistic if the null hypothesis is H0:  = 63, where a
random sample of size 23 is selected from a normal population with a sample
mean of 66 (assume the population standard deviation is 15).

TU6.5 Calculate the probability that a sample mean is greater than 68 for the TU3
question, when the alternative hypothesis is (a) two-tailed, and (b) upper one-
tailed (assume the significance level is 0.05).

TU6.6 Repeat TU5 but with the information that the population standard deviation is
not known, Describe the test you would use to solve this problem. Given that the
sample standard deviation was estimated to be 16.2, answer TU5 (a), and (b).

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A6Wa Common assumptions about data


2. A6Wb Meaning of the p-value

Page | 357
Chapter 7 Parametric hypothesis tests
7.1 Introduction and Learning Objectives
As we describe a variety of tests in this Chapter, you will notice that in general we will
only focus on testing three different statistics: the mean, the proportion and the
variance (in fact, the variance is only covered in online chapters).

As you will see, we will have a variety of test permutations, due to the fact that we could
have one or more samples, they could be dependent or independent, they could be large
or small, the population standard deviation is either known or not known, etc. This
variety of permutations of tests often “clouds” the hypothesis testing chapters.

It is difficult, at least at first glance, to differentiate one test from another. Make sure you
can clearly understand which test is applied to what combination of conditions.

This chapter is dedicated to parametric hypothesis tests only. Figure 7.1 show a high-
level classification of some of the most important hypothesis tests.

Figure 7.1 Which test to use?

Regardless whether we use one or two sample tests, in this chapter we will focus only
on testing the mean and proportion. The online material also includes variance tests.
Table 7.1 provides a list of parametric statistical tests described in this book and
identifies which methods are solved using Excel and SPSS.

Page | 358
Statistics test Excel SPSS
One sample z test for the population mean Yes No
One sample t test for the population mean Yes Yes
One sample z test for the population proportion Yes No
Two sample z test for two independent population means Online No
Two sample z test for two independent population proportions Online No
Two sample t test for two population means (independent Yes Yes
samples, equal variance)
Two sample t test for two population means (independent Yes Yes
samples, unequal variance)
Two sample t test for two population means (dependent samples) Yes Yes
Two sample F test for two population variances Online Online
Table 7.1 Statistics tests covered in textbook

Learning objectives

On completing this chapter, you will be able to:

1. Understand the concept of parametric hypothesis testing for one and two samples.
2. Be able to apply the tests for small and large samples as well as if the population
standard deviation is known or not.
3. Conduct one- and two-sample hypothesis tests for the sample mean and proportion.
4. Solve problems using Microsoft Excel and IBM SPSS Statistics software packages.

7.2 One-sample hypothesis tests


In this section we will explore the application of one-sample parametric hypothesis
testing to carry out one-sample z tests for the population mean, one-sample t tests
for the population mean, and one-sample z tests for the population proportion.

One-sample z test for the population mean

The first test we will explore is the one-sample z test for the population mean, with the
following test assumptions:

1. The sample is a simple random sample from a defined population.


2. The variables of interest in the population are measured on an interval/ratio scale.
3. The population standard deviation is known.
4. The variable being measured is normally distributed in the population.

When dealing with a normal sampling distribution we calculate the z statistic using
equation (6.3):

̅− μ
X
Zcal = σ
√n

Page | 359
As we already know, by convention we take ̅ X to be the sample mean,  is the
population mean and σ is the population standard deviation. You should also remember
that the denominator in equation (6.3), σ⁄ , is called the standard error (or, to give it
√n
its full title, the standard error of the sampling distribution of the means, σX̅ ).

Example 7.1

A toy manufacturer undertakes regular assessment of employee performance. This


performance testing consists of measuring the number of a toys that employees can
make per hour, with the historical data recording a rate of 85 per hour with a standard
deviation of 15 units. All new employees are tested after a period of training. A new
employee is tested on 45 separate random occasions and found to have an output given
in table 7.2. Does this indicate that the new employee's output is significantly different
from the average output? Test at the 5% significance level.

ID Sample data ID Sample data


1 88.6 24 81.8
2 63.8 25 88.8
3 95.6 26 83.2
4 94.4 27 94.0
5 118.2 28 96.7
6 84.4 29 79.9
7 81.6 30 93.1
8 78.6 31 101.7
9 56.6 32 55.5
10 77.8 33 88.6
11 87.2 34 79.3
12 82.0 35 80.7
13 100.3 36 93.3
14 91.6 37 102.5
15 126.2 38 71.9
16 99.8 39 91.1
17 94.1 40 109.6
18 92.5 41 96.0
19 96.0 42 82.6
20 85.8 43 108.3
21 69.0 44 95.5
22 90.3 45 76.8
23 88.0
Table 7.2 Number of units per hour after training

The five-step procedure to conduct this test progresses as follows:

Page | 360
Step 1 State hypothesis

Null hypothesis H0:  = 85 (population mean is equal to 85 units per hour)

Alternative hypothesis H1:  ≠ 85 (population mean is not 85 units per hour)

The ≠ sign implies that a two-tailed test will be appropriate.

Step 2 Select test

We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:

• Number of samples – one sample.


• The statistic we are testing – testing for a difference between a sample
mean and population mean (µ = 85). Population standard deviation is
known (σ = 15).
• Size of the sample – relatively large (n = 45  30).
• Nature of population from which sample drawn – population distribution
is not known but sample size is large. For large n, the central limit
theorem states that the sample mean approximately follows a normal
distribution.

Then the sampling distribution of the sample means is from equation (6.2)

𝜎2
𝑋̄ ∼ 𝑁 (𝜇, )
𝑛

Where  is known (= 15).

Therefore, a one-sample z test of the mean is selected.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract the relevant statistic

When dealing with a normal sampling distribution we calculate the one-sample z


test statistic using equation (6.3):

̅
X− μ
Zcal = σ
√n

If we undertake the calculations, we find:

Page | 361
Sample size n = 45

̅ = 88.74 to 2 decimal places


Sample mean X

Substituting these values into equation (6.3) gives:

88.74 − 85
Zcal = = +1.6726 to 4 decimal places
15
√45

Step 5 Make a decision

The calculated test statistic Zcal = + 1.6726.

We need to compare this test statistic with the critical z-test statistic, Zcri. For 5%
significance and a two-tailed test, Zcri =  1.96 (see Figure 7.2).

Figure 7.2 Areas of the standardised normal distribution

Does the test statistic lie within the rejection region?

Compare the calculated and critical z values to determine which hypothesis


statement (H0 or H1) to accept. We observe that the sample z value (zcal) does not
lie in the upper rejection zone (+1.6726 < +1.96), so we will fail to reject H0. The
sample mean value (88.74 units per hour) is close enough to the population
mean value (85 units per hour) to allow us to assume that the sample comes
from that population.

We conclude from the evidence that there is no significant difference, at the 0.05
level, between the new employee's output and the firm’s existing employee
output.

Page | 362
Excel solution

Figures 7.3 and 7.4 illustrate the Excel solution.

Figure 7.3 Excel data for Example 7.1

Figure 7.4 Excel solution for Example 7.1

Page | 363
Critical Z value method

Two-tailed critical test statistic Zcri =  1.96


Zcal = 1.6726
Given Zcal lies between the lower and upper Z critical (- 1.96 < 1.6726 < + 1.96),
then fail to reject the null hypothesis.

P-value method

Two-tailed p-value = 0.0944


Given two-tailed p-value > significance level  (0.0944 > 0.05), fail to reject the
null hypothesis.

We conclude, the evidence suggests that there is no significant difference, at the 0.05
level, between new employee's output and the firms existing employee output after the
training. Figure 7.5 illustrates the relationship between using the critical test statistic
and using the p-value method in deciding on which hypothesis statement to accept.

Figure 7.5 Relationship between the p-value, test statistic, and critical test
statistic for a two-tailed test

Table 7.3 provides the Excel functions to calculate the critical z test statistic or p-values.

Calculation P-values Critical test statistic


Lower one-tail = NORM.S.DIST(abs(z value)) = NORM.S.INV (significance level)
Upper one-tail = 1-NORM.S.DIST(z value) = NORM.S.INV (1-significance
level)
Two-tail = 2*(1-NORM.S.DIST(ABS(z = NORM.S.INV (significance
value or cell reference), level/2) for lower critical z value
true)) and
= NORM.S.INV (1-significance
level/2) for upper critical z value
Table 7.3 Excel functions to calculate critical values and p-values

Page | 364
Confidence interval method

From the previous chapter, we know that the true mean resides somewhere in the
interval defined by equation (5.16). This was part of the interval estimate procedure,
and we also used an expression confidence interval. We can use this confidence interval
to decide on the null and alternative hypotheses. The confidence interval for the
population mean  is given by rearranging equation (6.3) to give equation (7.1).

𝜎
𝜇 = 𝑋 ± 𝑍cri ( 𝑛) (7.1)

If we carry out the calculation for the 5% significance level, then the 95% confidence
interval for the population mean would be from 84.36 to 93.12 as illustrated in Figure
7.6.

Figure 7.6 Confidence interval solution to make a hypothesis test decision

Cells N22 and N26 in Figure 7.6 calculate the same value, but using two different Excel
functions. The end result, which are the cells N23:N24 and N27:N28, is identical.

Observe that this 95% confidence interval (84.35 to 93.12) does contain the known
population mean (85) in the hypothesis test. We conclude from the evidence that there
is no significant difference, at the 0.05 level, between the new employee's output and
the firm’s existing employee output.

Checking assumptions

Page | 365
To use the z test, the data are assumed to represent a random sample from a population
that is normally distributed. One-sample Z tests are considered ‘robust’ for violations of
the normal distribution assumption. This means that the assumption can be violated
without serious error being introduced into the test. The central limit theorem tells us
that, if our sample is large, the sampling distribution of the mean will be approximately
normally distributed irrespective of the shape of the population distribution. Knowing
that the sampling distribution is normally distributed is what makes the one-sample Z
test robust for violations of the assumption of normal distribution.

If the underlying population distribution is not normal and the sample size is small, then
you should not use the Z test. In this situation you should use an equivalent
nonparametric test (see Chapter 8).

Check your understanding

X7.1 A mobile phone company is concerned at the lifetime of phone batteries supplied
by a new supplier. Based upon historical data, this type of battery should last for
900 days with a standard deviation of 150 days. A recent randomly selected
sample of 40 batteries was selected and the sample battery life was found to be
942 days. Is the population battery life significantly different from 900 days
(significance level 5%)?

X7.2 A local Indian restaurant advertises home delivery times of 30 minutes. To


monitor the effectiveness of this promise the restaurant manager monitors the
time that the order was received and the time of delivery. Based upon historical
data, the average time for delivery is 30 minutes with a standard deviation of 5
minutes. After a series of complaints from customers regarding this promise the
manager decided to analyse the last 50 data orders, which resulted in an average
time of 32 minutes. Conduct an appropriate test at the 5% significance level.
Should the manager be concerned?

One-sample t test for the population mean

In many real-world cases of hypothesis testing, one does not know the standard
deviation of the population. In such cases, it must be estimated using the sample
standard deviation. That is, s (calculated with division by n – 1) is used to estimate σ.

Other than that, the calculations are identical as we saw for the z test for a single sample
– but the test statistic is called t, not z, and we conduct a one-sample t test for the
population mean with the following t-test assumptions:

1. The sample is a simple random sample from a defined population.


2. The variables of interest in the population are measured on an interval/ratio scale.
3. The sampling distribution of the sample means is normal (the central limit theorem
tells you when this will be the case).
4. The population standard deviation is estimated from the sample.

Student’s t test is built around a t distribution with the value of the t-test statistic given
by equation (6.7):

Page | 366
̅
X− μ
t cal = s
√n

Where the sample standard deviation (s) is given by equations (2.7) and (2.8):

∑(X − ̅
X)2
s= √
n−1

For a single-sample t test, we must use a t distribution with n – 1 degrees of freedom.

As this implies, there is a whole family of t distributions, with degrees of freedom


ranging from 1 to infinity. All t distributions are symmetrical about t = 0, like the
standard normal. In fact, the t distribution with df =  is identical to the standard
normal distribution.

Example 7.2

A car dealer offers a generous package to customers who would like high-quality extras
fitted to the cars. Historically, people who access this offer spend £2300 per customer.
The owner is concerned that recently this average spend has changed and requested,
after discussions with a data analyst, that this is tested.

The analyst recommended a one-sample t test. To test this hypothesis, the data analyst
checked the data for the last 10 years to confirm that the population spend follows
approximately a normal distribution and then collected the spending data for the last
thirteen customers as illustrated in table 7.4.

ID Sample data, £'s


1 2595
2 1670
3 2899
4 2194
5 2313
6 2469
7 2131
8 2131
9 2657
10 1817
11 2473
12 1890
13 2330
Table 7.4 Sample data

Page | 367
Test the hypothesis that the average spend is £2300 (test at the 5% significance level).

The five-step procedure to conduct this test progresses as follows:

Step 1 State hypothesis

Null hypothesis H0:

 = 2300 (population mean spend on extras is equal to £2300).

Alternative hypothesis H1:

  2300 (population mean is not equal to £2300).

The ≠ sign implies a two-tailed test.

Step 2 Select test

We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:

a. Number of samples – one sample.


b. The statistic we are testing – testing for a difference between a sample
mean and population mean (µ = 2300). Therefore, we want a two-tailed
test.
c. Size of the sample – small (n = 13).
d. Nature of population from which sample drawn – normal population
distribution, sample size is small, and population standard deviation is
unknown. The sample standard deviation will be used as an estimate of
the population standard deviation and the sampling distribution of the
mean is a t distribution with n – 1 degrees of freedom.

Then the sampling distribution of the sample means is given by equation


(6.6)

𝑠2
𝑋̄ ∼ 𝑡𝑑𝑓 (𝜇, )
𝑛

and the corresponding standardised t equation is given by equation (6.7)

̅
X− μ
t = s
√n

We conclude that a one-sample t test of the mean is appropriate.

Step 3 Set the level of significance

 = 0.05

Page | 368
Step 4 Extract relevant statistic

Sample data

Sample size, n = 13

Sample mean, ̅
X = 2274.538462

Sample standard deviation, s = 352,4974268


s
Sampling error of the means, σX̅ = = 97.76519591
√n

Substituting these values into equation (6.7) gives:

̅
X− μ 2274.5385 − 2300
t = s = = −0.2604
97.7652
√n

with the number of degrees of freedom, df = n – 1 – 12.

Step 5 Make a decision

Critical value method

From statistical tables, the two-tailed critical test statistic, tcri = t(0.05/2, 12) =
2.18.

Figure 7.7 Critical values of the t distribution

The calculated test statistic tcal = – 0.2604 and the critical t value  2.18 are
compared to decide which hypothesis statement to accept.

Page | 369
Given tcal lies between the lower and upper critical t values (– 2.18 and + 2.18),
we fail to reject the null hypothesis H0. Figure 7.9 illustrates the relationship
between the p-value, test statistic and critical test statistic.

The evidence suggests that there is no significant difference, at the 0.05 level, between
the extras purchased by the sample (i.e., the customers today) and the historical extras
purchased of £2300.

Excel solution

Figures 7.8 and 7.9 illustrate the Excel solution.

Figure 7.8 Example 7.2 data

Page | 370
Figure 7.9 Example 7.2 Excel solution

From Excel:

Critical value method

The calculated test statistic tcal = –0.26 and the critical t value tcri =  2.18. Given
that tcal lies between the lower and upper critical t values (– 2.18 and + 2.18), we
fail to reject the null hypothesis H0.

P-value method

From the Excel and SPSS solutions, the value of the two-tail p-value = 0.7989 >
significance level (0.05), so we fail to reject the null hypothesis.

Figure 7.10 illustrates the relationship between using the critical test statistic and using
the p-value method in deciding on which hypothesis statement to accept.

Page | 371
Figure 7.10 Relationship between the p-value, test statistic, and critical test
statistic.

Table 7.5 provides the Excel functions to calculate Student’s critical test statistic or p-
values using Excel functions.

Calculation P-values Critical test statistic


Lower one-tail =T.DIST (t value, degrees = - T.INV (significance level,
of freedom, true) degrees of freedom) for lower tail
Upper one-tail =T.DIST.RT (t value, = T.INV (significance level, degrees
degrees of freedom) of freedom)
Two-tail =T.DIST.2T (ABS(t = T.INV.2T (significance level,
value), degrees of degrees of freedom) for upper
freedom). critical t value
and
= - T.INV.2T (significance level,
degrees of freedom) for lower
critical t value
Table 7.5 Excel functions to calculate critical t values or p-values

Confidence interval method

The confidence interval for the population mean  is given by rearranging equation
(6.7) to give equation (7.2). In other words, a true  will be somewhere inside the
interval defined by equation (7.2):

𝑠
𝜇 = 𝑋̅ ± 𝑡𝑐𝑟𝑖 × ( 𝑛) (7.2)

If we carry out the calculation for a 5% significance level, then a 95% confidence
interval for the population mean difference would be – 238.47 to + 187.55 as illustrated
in Figure 7.11.

Page | 372
Figure 7.11 Confidence interval solution to make a hypothesis test decision

As before, cells K23 and K32 use two different Excel functions to yield the same result.
Observe that this 95% confidence interval does contains the assumed population mean
in the hypothesis test of – 25.46 (sample mean – population mean = 2274.54 – 2300).
Therefore, we fail to reject the null hypothesis. We conclude from the evidence that
there is no significant difference, at the 0.05 level, between the extras purchased by the
sample (i.e. the customers today) and the historical extras purchased of £2300. If we
carry out the calculation for the 5% significance level, then the 95% confidence interval
for the population mean would be from -238.47 to 187.55 as illustrated in Figure 7.11.

SPSS solution

Enter data into SPSS

Figure 7.12 Example 7.2 SPSS data

Page | 373
Select Analyze > Compare Means

Figure 7.13 SPSS One-Sample T Test

Select One-Sample T Test

Transfer Data_value into Test Variable(s) box

Type 2300 into the Test Value box.

Figure 7.14

Click on Options.

Type 95% in the Confidence Interval Percentage box.

Page | 374
Figure 7.15 SPSS one-sample t test options

Click on Continue.

Figure 7.16 SPSS one-sample t test

Click OK

SPSS output

Figure 7.17 gives the one-sample statistics and one-sample hypothesis test results.

Figure 7.17 SPSS solution

Page | 375
Summary

Observe that the manual, Excel, and SPSS solutions all agree.

Hypothesis test method

t = - 0.26
df = 12
2 tail p-value = 0.799 > 0.05, so fail to reject null hypothesis

Confidence interval method

Mean difference = - 25.46


95% confidence interval – 238.47 to + 187.55
Observe mean difference = - 25.46 lies between – 238.47 and + 187.55, so fail to
reject null hypothesis.

Checking assumptions

The assumptions of the one-sample t test are identical to those of the one-sample z test.
To use the t test, the data are assumed to represent a random sample from a population
that is normally distributed. In practice, if the sample size is not too small and the
populations are nearly symmetrical, the t distribution provides a good approximation to
the sampling distribution of the mean when the population standard deviation is
unknown. The t test is called a robust test (not sensitive to departures) in that it does
not lose power if the shape of the distribution departs from a normal distribution and
the sample size is large (n  30); this allows the test statistic to be influenced by the
central limit theorem. If the underlying population distribution is not normal and the
sample size is small, then you should not use the t test. In this situation you should use
an equivalent nonparametric test (see Chapter 8).

Check your understanding

X7.3 Calculate the critical t values for a significance level of 1% and 12 degrees of
freedom for (a) a two-tailed test, (b) a lower one-tailed test, and (c) an upper
one-tailed test.

X7.4 After further data collection the marketing manager (Exercise X7.5) decides to
revisit the data analysis and changes the type of test to a t test.

a. Explain under what conditions a t test could be used rather than the z test.
b. Calculate the corresponding p-value if the sample size was 13 and the test
statistic equal to 2.03. From this result what would you conclude?

X7.5 A tyre manufacturer conducts quality assurance checks on the tyres that it
manufactures. One of the tests was on its medium-quality tyres with an
independent random sample of 12 tyres providing a sample mean and standard
deviation of 14,500 km and 800 km, respectively. Given that the historical

Page | 376
average is 15,000 km and that the population is normally distributed, test
whether the sample gives cause for concern.

X7.6 A new low-fat fudge bar is advertised with 120 calories. The manufacturing
company conducts regular checks by selecting independent random samples and
testing the sample average against the advertised average. Historically the
population varies as a normal distribution and the most recent sample consists
of the values 99, 132, 125, 92, 108, 127, 105, 112, 102, 112, 129, 112, 111, 102,
122. Is the population value significantly different from 120 calories (at
significance level 5%)?

One-sample z test for the population proportion

We now consider the one-sample z test for the proportion, π. This test also relies on the
central limit theorem to ensure a standard normal distribution for its test statistic, and
so we can only apply it to large samples.

Null hypothesis for this test is

H0: π = π0

Alternative hypothesis for this test is

Left-sided test

H1: π < π0

or right-sided test

H1: π > π0

or two-sided test

H1: π ≠ π0

Here  is a population proportion value specified in the null hypothesis. The only
assumption that this test requires is that the sample size be sufficiently large that the
test statistic in equation (7.3) will follow a standard normal distribution:
p− π
Z= (7.3)
π (1− π)

n

Where p is the sample proportion, n is the sample size and  is the population
proportion value specified in H0. The calculated value of the test statistic is denoted by
zcal.

Page | 377
Example 7.3

The human resources manager of a large company is asked to investigate a claim by the
union that 20% of the employees of the company are experiencing high levels of work
stress. The company, hoping to prove that this number is not as high as the union claims
it to be, conducts a large study on 695 randomly selected employees. The study shows
that 123 of the 695 employees included in the study suffer from high levels of work
stress. The company would like to know if it can use this result to refute the claim.
Assume the data sampled follow a normal distribution and test at the 5% significance
level.

The five-step procedure to conduct this test is as follows.

Step 1. State hypothesis

It was claimed that the population proportion of employees experiencing high


levels of work stress is 20% or 0.2, so the null hypothesis is H0: π = 0.2. The
company would like to prove that the true proportion is, in fact, lower than 0.2,
so the alternative hypothesis is H1: π < 0.2. The symbol < implies that a left-sided
test will be used.

Step 2 - Select test

• We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:
• Number of samples – one sample.
• Clearly the parameter of interest is the population proportion π.
• The test statistic will need to reflect the distance between the sample
proportion (p = 123/695= 0.176978) and the hypothesised population
proportion (π0 = 0.2).
• The data collected follow a normal distribution.

Step 3 - Set the level of significance

 = 0.05

Step 4 - Extract the relevant statistic

Note that the sample proportion of employees experiencing high levels of work
stress is p = 123/695= 0.176978 and that the hypothesised value of the
population proportion is given as π0 = 0.2. Therefore, the test statistic calculated
from this information and equation (5.11) is:

p− π
Z=
√π (1 − π)
n

Page | 378
0.176978 − 0.2
Z= = −1.59
√0.2 (1 − 0.2)
695

Step 5. Make a decision

Using a significance level of α = 0.05, the critical value of a left-sided test is – Z =


– Z0.05 = –1.645 (from tables; see Figure 7.18).

Figure 7.18 Critical values of the standardised normal distribution

Comparing the calculated test statistic, zcal = –1.59, to the critical value, Z0.05 = –
1.645, we find that the Zcal does not exceed the critical value. We therefore fail to
reject H0.

Please note the values are quite close and we could conclude that the sample evidence is
inconclusive on whether we accept or fail to reject the null hypothesis. A possible
course of action would be to re-do the data collection and review how the sample data
was collected.

Excel solution

Figure 7.19 illustrates the Excel solution.

Page | 379
Figure 7.19 Example 7.3 Excel solution

Critical value method

Zcal = - 1.590

Lower one-tail Zcri = - 1.645

Given Zcal > Zcri (- 1.590 > - 1.645), we conclude that we fail to reject the null
hypothesis.

P-value method

Lower one-tail p-value = 0.0558

Given one tail test then compare this p-value with your significance level, ( =
0.05).

Given p-value >  (0.0558 > 0.05), fail to reject the null hypothesis.

Please note the values are quite close and we could conclude that the sample evidence is
inconclusive on whether we accept or fail to reject the null hypothesis. A possible
course of action would be to re-do the data collection and review how the sample data
was collected.

Page | 380
Although the company hoped to disprove the claim that 20% of the work force suffers
from stress, they cannot do so (at the 0.05 significance level). The rejection region is
shown in Figure 7.20.

Figure 7.20 Graphical representation of the critical values, test statistic value and
p-value of the test

Confidence interval method

The confidence interval for the population mean  is given by rearranging equation
(7.3) to give equation (7.3). As before, this means that the true proportion for the
population is somewhere within the interval defined by equation (7.4):

p (1− p)
π = p ± Zcri × √ (7.4)
n

If we carry out the calculation for the 5% significance level, then a 95% confidence
interval for the population mean would be from 0.15 to 0.21 as illustrated in Figure
7.21.

Page | 381
Figure 7.21 Confidence interval solution for Example 7.3

Observe that this 95% confidence interval does contain the assumed population
proportion in the hypothesis test of 0.2. Therefore, we fail to reject the null hypothesis.
Although the company hoped to disprove that 20% of the work force suffers from
stress, it cannot do so (at the 0.05 significance level).

Checking assumptions

To use the z test, the data are assumed to represent a random sample from a population
that is normally distributed. One-sample Z tests are considered robust for violations of
the normal distribution. This means that the assumption can be violated without
serious error being introduced into the test. The central limit theorem tells us that, if
our sample is large, the sampling distribution of the mean will be approximately normal
irrespective of the shape of the population distribution. Knowing that the sampling
distribution is normally distributed is what makes the one-sample Z test robust for
violations of the assumption of the normal distribution. If the underlying population
distribution is not normal and the sample size is small, then you should not use the Z
test. In this situation you should use an equivalent nonparametric test (see Chapter 8).

Check your understanding

X7.7 Do 9% of Teesside commuters travel to work by car? A survey on commuting by


car was done on a random sample of 250 commuters and found car commuting
to be 13%. Test the claim using a 5% level of significance.

X7.8 A national provider of gas and electricity within a national market claims that
86% of its customers are very satisfied with the service they receive. To test this
claim, the company regularly undertakes random sampling surveys of its
customers. A recent random sample of 100 customers showed an 80% rating at
the very satisfied level. Based on these findings, can we reject the hypothesis that
86% of the customers are very satisfied? Assume a significance level of 0.05.

Page | 382
X7.9 The company in X7.8 now claims that at least 86% of its customers are very
satisfied. Again, 100 customers are surveyed using simple random sampling,
with 80% very satisfied. Based on these results, should we accept or reject the
company’s hypothesis? Assume a significance level of 0.05.

X7.10 A university reviews student progress and over the last two years has
implemented many initiatives to improve the retention rate. The historical data
based upon the last three years show a failure rate of 6% across all university
programmes. After implementing several new initiatives during the two
academic years, the university re-evaluated its failure rate using a random
sample of 250 students and found the failure rate for this academic year had
changed to 2.5%. Test whether the failure rate has improved.

7.3 Two-sample hypothesis tests


So far, we have learned how to test hypotheses involving one sample, where we
contrasted what we observed with what we expected from the population. Often,
researchers are faced with hypotheses about differences between groups. For example,
do interest rates rise more quickly when wages increase or stay the same? In this
section we will explore a range of two-sample tests covering comparing means,
proportions, and variances. Table 7.6 provides a list of these. We will concentrate on
two-sample tests that are available in both Excel and SPSS; other tests are explored in
the online materials.

Statistics test Excel SPSS


Two sample z test for two independent population means Online No
Two sample z test for two independent population Online No
proportions
Two sample t test for two population means (independent Yes Yes
samples, equal variance)
Two sample t test for two population means (independent Yes Yes
samples, unequal variance)
Two sample t test for two population means (dependent Yes Yes
samples)
Two sample F test for two population variances Online Online
Table 7.6 Two sample hypothesis tests

Two-sample t test for the population mean: independent samples

In testing for the difference between means we assume that the populations are
normally distributed, with either equal or unequal population variances.

Pooled-variance t test – equal population variances assumed

For situations in which the two populations have equal variances, the pooled-variance t
test is robust to moderate departures from the assumption of normality, provided the
sample sizes are large (nA  30, nB  30).

Page | 383
In this case you can use the pooled-variance t test without serious effects on the power.
The test is used to test whether the population means are significantly different from
each other, using the means from randomly drawn samples. For example, do males and
females differ in terms of their exam scores?

When dealing with a normal sampling distribution we calculate the t-test statistic using
equation (7.5), an estimate of the population standard deviation from equation (7.6),
and the number of degrees of freedom given by equation (7.7).

Test statistic

(XA − XB )−(μA − μB )
t cal = 1 1
(7.5)
̂ A+B ×√
σ +
n A nB

Pooled standard deviation

(nA −1) S2A +(nB −1) S2B


̂A+B = √
σ (7.6)
nA + nB −2

Degrees of freedom

df = nA + nB – 2 (7.7)

Equation (7.6) is an estimator of the pooled standard deviation of the two samples. As
indicated above, the null hypothesis for this test specifies a value for A – B, the
difference between the population means. When testing using equation (7.5), H0
specifies that A – B = 0. For that reason, most textbooks omit A – B from the
numerator of equation (7.5). The independent-samples (or unpaired) t test has degrees
of freedom df = nA + nB – 2.

The assumptions for the equal-variance t test are as follows:

1. Both samples are simple random samples.


2. The two samples are independent.
3. The sampling distribution of 𝑋̅𝐴2 − 𝑋̅𝐵2 is normal.
4. The populations from which you sampled have equal variances.

The two-sample t test is robust to departures from normality. When checking


distributions graphically, look to see that they are symmetric and have no outliers.
There are a range of statistical tests that you could employ to see if the normality
assumption is violated (e.g. the F test for population variances, Kolmogorov–Smirnov
test or Shapiro–Wilks test).

If you have an issue with the homogeneity of variance assumption then a rule of thumb
on the relative sizes is that if the larger of the two variances is no more than 4 times the
smaller, the t-test approximation is probably good enough, especially if the sample sizes
are equal. Homogeneity of variances means that population variances are equal. The
other word often used is homoscedasticity.

Page | 384
It is important to note, however, that heterogeneity of variance and unequal sample
sizes do not mix. If you have reason to anticipate unequal variances, make every effort
to keep your sample sizes as equal as possible. If the heterogeneity of variance is too
severe, there are versions of the independent-group t test that allow for unequal
variances. The method used in SPSS, for example, is called Welch’s unequal variances t
test. If you have a problem of this nature, then think about using a nonparametric test
called the Mann–Whitney U test.

Separate-variance t test for the difference between two means: unequal population
variances

The assumption of equal variances is critical if the sample sizes are markedly different.
Welch developed an approximation method for comparing the means of two
independent normal populations when their variances are not necessarily equal.
Because Welch’s modified t test is not derived under the assumption of equal variances,
it allows users to compare the means of two populations without first having to test for
equal variances.

With equations (7.5)–(7.7), we assumed that the variances were equal for the two
samples and conducted a two-sample pooled t test. If the variances are unequal, we
should not use these equations but use the Welch’s unequal variances t test statistic
defined by equation (7.8):
̅A− X
(X ̅ B )− (μA − μB )
t cal = (7.8)
S2 S2
√ A+ B
nA nB

Here, SA2 and SB2 are the unbiased estimators of the standard deviations of the two
samples. Unlike in Student's t test, the denominator is not based on a pooled variance
estimate. For use in significance testing, the distribution of the test statistic is
approximated as an ordinary Student’s t distribution with the degrees of freedom
calculated using the Welch–Satterthwaite equation:

S2 S 2 2
( A+ B )
aA n B
df = 2 2 2 (7.9)
S S2
( A) ( B)
nA nB
+
nA −1 nB −1

The assumptions for the unequal-variance t test are as follows:

1. Both samples are simple random samples.


2. The two samples must be independent.
3. The sampling distribution of ̅
XA2 − ̅
XB2 is normal.

The two-sample t test is robust to departures from normality. When checking


distributions graphically, look to see that they are symmetric and have no outliers.
Welch's t test remains robust for skewed distributions and large sample sizes.

Page | 385
If you have a problem of this nature, then think about using a nonparametric test called
the Mann–Whitney U test.

Example 7.4

A newsagent sells tea bags to the local community at slightly above cost price which are
packed and delivered via a distribution sent that receives unwanted goods from
national supermarkets. The newsagent shop owner decides to check on the number of
tea bags in the bags to check that they have approximately the same number of tea bags
in each bag.

To enable this analysis the shop owner checks two random independent samples at two
different time points and counts the number of tea bags with the data provided in table
7.7 (G1=Group1 and G2=Group2). Conduct a two-sample t test for the population mean
to test this hypothesis at the 5% significance level (assume sampling distributions are
normally distributed).

G1 G2
425 429 417 414 421 409 471 439 368
385 381 359 444 323 387 505 303 375
464 420 418 417 368 385 364 392 283
396 443 364 403 485 358 456 339
365 417 357 401 405 393 444 413
387 407 351 464 426 440 566
351 381 468 364 402 413 434
446 386 421 409 402 455 522
411 349 303 379 407 434 469
426 376 436 434 408 357 369
Table 7.7 Number of tea bags in bags at time t1 and t2

If we calculate the mean number of tea bags for group 1 and group 2, we will find:

Group 1 mean = 399.5349


Group 2 mean = 413.6571

We observe that the number of teabags is different between the two groups, but the
question is if this difference is statistically significant? To answer this question, we will
undertake a 2-sample t test for two independent samples.

In this case, we are not told if the population variances are equal or unequal.

The five-step procedure to conduct this test progresses as follows.

Step 1. State hypothesis

Null hypothesis: H0: 1 = 2

Alternative hypothesis: H1: 1 ≠ 2

Page | 386
The ≠ sign implies a two-tailed test.

Step 2. Select test

We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:

• Number of samples – two samples.


• The statistic we are testing – testing that the amount of beans in a bag
sold by both shops are the same. Both population standard deviations are
unknown.
• Nature of population from which sample drawn – population
distributions are normal.

Step 3. Set the level of significance

 = 0.05

Step 4. Extract relevant statistic

Summary statistics are shown in Table 7.8

Sample statistic Sample Group 1 Sample Group 2


Sample size 43 35
Sample mean 399.5349 413.6571
Sample standard deviation 37.6983 57.7918
Table 7.8 Sample statistics

Option 1. Equal population variances assumed

If H0 is true (µA – µB = 0) then equations (7.5) to (7.7) give the t test


statistic, pooled standard deviation, and degrees of freedom.

Pooled standard deviation

(n1 − 1) S21 + (n2 − 1) S22


̂ 1+2 =
σ √
n1 + n2 − 2

̂ 1+2 = 2279.8498
σ

Test statistic

(X1 − X 2 ) − (μ1 − μ2 )
t cal =
1 1
̂1+2 × √n + n
σ
1 2

t cal = −1.2992

Page | 387
Degrees of freedom

df = n1 + n2 – 2

df = 76

Option 2. Unequal population variances assumed

If H0 is true (µA – µB = 0) then equation (7.8) and (7.9) can be used to


calculate the t test statistic and the degrees of freedom:

Test statistic

̅A − X
(X ̅ B ) − (μA − μB )
t cal =
S2 S2
√ A+ B
nA nB

tcal = - 1.2458

Degrees of freedom
2
S2 S2
( A + B)
a A nB
df = 2 2
S2 S2
( A) ( B)
nA nB
+
nA − 1 nB − 1

df = 56.1711

Step 5. Make a decision

Option 1 Equal population variances assumed

tcal = - 1.2992

df = 76

Critical t value for 5% two tailed and df = 76 can be found from the critical
t tables as illustrated in Figure 7.22.

Page | 388
Figure 7.22 Critical values of the Student’s t distribution

Therefore, require the critical t value for a significance level of 5% 2-tail


for df = 76 given that the two-tail critical t value is 1.99 at df = 70 and 1.99
at df = 80.

Therefore, using linear interpolation:

t at df76 = t at df70 + (6/10) * (t at df80 – t at df70)

t at df76 = 1.99 + (6/10) *(1.99 – 1.99)

t at df76 = 1.99

The critical t value for 95% 2-tail with 76 degrees of freedom is  1.99.

Therefore, compare the calculated value of t with this critical value.

tcal = - 1.2992

tcri =  1.99

Given tcal (= - 1.2992) lies between – 1.99 and +1.99, we fail to reject the
null hypothesis.

Option2. Unequal population variances assumed

tcal = - 1.2458

df = 56.1711

Critical t value for 5% two tailed and df = 56.1711 can be found from the
critical t tables as illustrated in Figure 7.23.

Page | 389
Figure 7.23 Critical values of the Student’s t distribution

Therefore, require the critical t value for a significance level of 5% 2-tail


for df = 56.1711 given that the two-tail critical t value is 2.01 at df = 50
and 2.00 at df = 60.

Therefore, using linear interpolation:

t at df56.1711 = t at df50 + (6.1711/10) * (t at df60 – t at df50)

t at df56.1711 = 2.01 + (6.1711/10)*(2.00 – 2.01)

t at df56.1711 = 2.0038

Therefore, the critical t value for 95% 2-tail with 76 degrees of freedom is
 2.00.

Therefore, compare the calculated value of t with this critical value.

tcal = - 1.2458

tcri =  2.00

Given tcal (= - 1.2458) lies between – 2.00 and +2.00, we fail to reject the
null hypothesis.

Therefore, the analysis suggests that the number of tea bags in the bags is not
significantly different at a 5% significance.

Excel solution

Figures 7.24–7.25 show the Excel solutions for (a) population variance equal and (b)
population variance not equal.

Enter data into Excel

Page | 390
Figure 7.24 Example 7.4 data

Page | 391
Figure 7.24 Example 7.4 Excel solution

Two sample pooled t test for means

Figure 7.26 represents the Excel solution.

Figure 7.26 Two-sample pooled t test

From Excel:

tcal = - 1.2992
two-tailed tcri =  1.99
two-tailed p-value = 0.1978

Based upon these statistics we reject the null hypothesis.

Page | 392
We conclude that, based upon the sample data collected, the quantity of teabags is not
different at the 5% level of significance. Figure 7.27 illustrates the relationship between
the p-value and the test statistic.

Figure 7.27 Comparison of t, critical t, and p-value

It can be concluded that, based upon the sample data collected, the quantities of beans
sold by shops A and B are significantly different at the 5% level of significance. It should
be noted that the decision will change if you choose a 1% level of significance.

Two sample t test for means assuming unequal variances

Figure 7.28 represents the Excel solution.

Figure 7.28 Two sample t test for means assuming unequal variances

From Excel:

tcal = - 1.2458
Page | 393
two-tailed tcri =  2.00
two-tailed p-value = 0.2180

Based upon these statistics we reject the null hypothesis.

We conclude that, based upon the sample data collected, the quantity of teabags is not
different at the 5% level of significance.

Figure 7.29 illustrates the relationship between the p-value and test statistic.

Figure 7.29 Comparison of t, critical t, and p-value

Note that Excel provides the test statistic, critical test statistic, and p-value. SPSS
provides the test statistic and p-value.

Confidence interval solutions

We can use the confidence interval method to make an hypothesis test decisions.

Two sample pooled t test for means

Figure 7.30 95% confidence interval for pooled t test

Page | 394
The 95% confidence interval is – 35.77 to 7.53 with a difference of 21.65. Given
the confidence interval does not contain this value then we conclude that we
accept the alternative hypothesis.

Two sample t test for means assuming unequal variances

Figure 7.31 95% confidence interval for t test assuming unequal variances

The 95% confidence interval is – 36.83 to 8.59 with a difference of 22.71. Given
the confidence interval does not contain this value then we conclude that we
accept the alternative hypothesis.

SPSS solution

Enter data into SPSS

Figure 7.32 Example 6.6 SPSS data

Given that we have a two-sample independent test to perform, we have created two
variables for SPSS. The Group variable takes values 1 and 2, representing tea bag type 1
and type 2, respectively.

Page | 395
Select Analyze > Compare Means

Figure 7.33 Choosing SPSS Independent samples t test

Select Independent-Samples T Test

Transfer Beans variable into the Test Variable(s) box.

Transfer Group variable into the Grouping Variable box.

Click on Define Groups, and type 1 in the Group 1 box and 2 in the Group 2 box.

Click Continue.

Figure 7.34 Independent samples t test

Page | 396
Click on Options.

Type 95% in the Confidence Interval Percentage box.

Figure 7.35 Options

Click Continue.

Figure 7.36 SPSS independent samples t-test menu

Click OK

SPSS output

Figure 7.37 gives the group statistics for group 1 and group 2. Figures 7.38 and 7.39
gives the hypothesis test results for the two-sample independent t-test results.

Figure 7.37 Example 6.6 SPSS solution

Page | 397
Figure 7.38 Example 7.4 SPSS solution continued

Figure 7.39 Example 7.4 SPSS solution continued

Equal variances assumed

From SPSS, we note that we have two results. When we assume equal variances,
we obtain t = - 1.299 and two-tailed p-value = 0.196. This is identical to the Excel
results t = - 1.2992 and two-tailed p-value = 0.1978. Given that the two-tailed p-
value is greater than 0.05, we fail to reject H0.

Equal variances test (Levene’s test)

The SPSS printout includes the value of the F-test statistic which can be used to
test whether the two populations for group 1 and group 2 have equal population
variances. The hypothesis test is H0: variances equal, and H1: variances not equal.

The value of F is given as 3.235, and the associated test p-value is 0.076. Since
our significance level is 0.05 and our p-value is greater than this, we accept the
null hypothesis H0 and conclude at the 5% level of significance that the two
population variances are not significantly different.

This also means that we should use the two-sample t test for independent
samples with pooled variances.

Unequal variances assumed

If we did have evidence for unequal variances, then we can use the second t test.
When we do not assume equal variances, we obtain t = - 1.246 and two-tailed p-

Page | 398
value = 0.218. This is identical to the Excel result. Given the two-tailed p-value is
greater than 0.05, we fail to reject the null hypothesis H0.

Conclusions

We conclude that based upon the sample data collected that we have no evidence to
suggest that the number of tea bags in groups 1 and 2 are significantly different at the
5% level of significance.

Excel Data Analysis solutions

As an alternative to either of the two previous methods, we can use a method embedded
in Excel Data Analysis.

Select Data > Data Analysis > t-Test: Two Sample Assuming Equal Variances:

Figure 7.40 Excel Data Analysis menu: two-sample t test assuming equal
variances

Figure 7.41 Excel Data Analysis menu: two-sample t test assuming equal
variances

Page | 399
We observe from Figure 7.41 that the relevant results agree with the previous
results.

Select Data > Data Analysis > t-Test: Two Sample Assuming Unequal Variances:

Figure 7.42 Excel data analysis solution – t test for 2 sample assuming unequal
variances

Figure 7.43 Excel data analysis solution – t test for 2 sample assuming unequal
variances

Warning:

It should be noted that the Excel Analysis ToolPak method for this statistical test
will round up the value of the degrees of freedom and use this value to calculate
the critical value.

We observe from Figure 7.43 that the relevant results agree with the previous results
(see Excel warning about the degrees of freedom).

Conclusion

Page | 400
We conclude that based upon the sample data collected that we have no evidence that
the number of tea bags is different between the two data collection time points at the
5% level of significance.

Check your understanding

X7.11 During an examination board meeting concerns were raised concerning the
marks obtained by students sitting the final year advanced economics (AE) and
e-marketing (EM) papers (see Table 7.9). Historically the sample data follow a
normal distribution and the population standard deviations are approximately
equal. Assess whether there is a significant difference between the two sets of
results (test at 5%).

AE AE EM EM EM
51 63 71 68 61
66 35 69 53 59
50 9 63 65 55
48 39 66 48 66
54 35 43 63 61
83 44 34 48 58
68 68 57 47 77
48 36 58 53 73
45 68 64 54
Table 7.9 Student marks

X7.12 A university finance department would like to compare the travel expenses
claimed by staff attending conferences. After initial data analysis, the finance
director has identified two departments which seem to have very different levels
of claims. Based upon the data provided in Table 7.10, carry out a suitable test to
assess whether the level of claims from department A is significantly greater
than that from department B. You can assume that the population expenses data
are normally distributed and that the population standard deviations are
approximately equal.

Department A Department B
156.67 146.81 147.28 140.67 108.21 109.10 127.16
169.81 143.69 157.58 154.78 142.68 110.93 101.85
130.74 155.38 179.89 154.86 135.92 132.91 124.94
158.86 170.74
Table 7.10 Travel expenses

X7.13 Repeat Exercise X7.11 but do not assume equal variances. Are the two sets of
results significantly different? Test at 5%.

X7.14 Repeat Exercise X7.12 but do not assume equal variances. Are the expenses
claimed by department A significantly different than those by department B?
Test at 5%.

Page | 401
Two-sample t-test for the population mean: dependent or paired samples

The paired sample t-test, sometimes called the dependent sample t-test, is a statistical
procedure used to determine whether the mean difference between two sets of
observations is zero. In a paired sample t-test, each subject is measured twice, resulting
in pairs of observations.

Suppose you are interested in evaluating the effectiveness of a weight loss diet. One
approach you might consider would be to measure the weight of a person before
starting the diet and the weight after being on the diet for a period of time. We would
then analyse the differences using a paired sample t-test. The null hypothesis for this
paired samples t test is that the difference scores are a random sample from a
population in which the mean difference has some value which you specify.

The test assumptions are as follows:

1. The samples are simple random samples.


2. The sample data consist of matched pairs.
3. The number of matched pairs is large (n  30) or the paired differences from the
population are approximately normally distributed.

The t-test statistic is given by equations (7.10) and (7.11), with the degrees of freedom
given by equation (7.12):

Test statistic
̅−D
d
t= sd (7.10)
√n

Standard deviation of the differences


∑(d−d) ̅ 2
sd = √ n−1 (7.11)

Degrees of freedom

df = n – 1 (7.12)

In equations (7.10)–(7.11), d̅ is the sample mean of the difference scores, D is the mean
difference in the population, given a true H0 (often D = 0, but not always), sd is the
sample standard deviation of the difference scores, n is the number of matched pairs,
and df is the degrees of freedom.

Example 7.5

A university department is running extra support for international students on basic


study skills. The unit that deals with this assesses student report writing skills at the
start of the course and re-assesses the students at the end of the course using s
standardised test. Please note that this study skills course is compulsory for all

Page | 402
international students but is not part of the student degree course being studied. Table
7.11 illustrates the test results for the 2019-20 academic year.

Conduct a two-sample t test for the population mean for paired samples to test the
hypothesis that the before and after test results are significantly different. If significant,
do we have any evidence that the support worked with improving the student results?
What hypothesis test would you conduct?

Pairs Pairs Pairs Pairs Pairs


X1 X2 X1 X2 X1 X2 X1 X2 X1 X2
56.86 42.66 40.48 45.93 49.75 39.09 67.93 67.07 64.28 69.46
57.88 55.71 44.78 38 54.71 47.32 70.21 53.23 70.41 79.53
44.64 59.12 87.45 63.27 60.43 66.12 55.63 52.21 55.29 44.73
48.87 55.75 66.28 30.72 70.05 36.76 46.76 56.55 66.83 33.38
68.34 40.91 59.11 46.81 78.55 82.46 26.72 36.58 70.18 62.43
55.76 25.58 50.69 58.23 50.75 44.92 45.67 44.88
46.83 55.4 57.24 43.36 66.87 57.28 60.75 59.18
56.66 59.2 48.14 69.73 46.5 31.35 65.99 40.2
56.18 48.44 68.17 62.13 43.39 67.8 30.62 53.42
51.1 61.18 64.48 66.09 76.26 67.08 40.14 60.01
Table 7.11 Before and after test results

The five-step procedure to conduct this test progresses as follows.

Step 1 State hypothesis

The hypothesis statement implies that the population mean difference between
test results is zero.

Null hypothesis:

No difference in population test results

H0: D = 1 - 2 = 0

Alternative hypothesis

Difference in population test results

H1: D ≠ 0

(or D = 1 - 2 ≠ 0)

The ≠ sign implies a two-tailed test.

Page | 403
Step 2 Select test

We now need to choose an appropriate statistical test for testing H0. From the
information provided we note:

• Number of samples – two samples.


• The statistic we are testing – testing that the test results before and after
are significantly different. Both population standard deviations are
unknown.
• Size of both samples – large (n1 and n2 = 45 > 30).
• Nature of population from which sample drawn – population distribution
is not known; we will assume that the population is approximately
normal given the sample size is greater than 30.

In this case, we have two variables that are related to each other (test result
before and test result after) and we will conduct a paired-sample t test for
means.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract relevant statistic

Given equations (7.10) – (7.12) we need to calculate: sample pair size n, standard
deviation of the differences (sd), t statistic, and the degrees of freedom, df. Tables
7.12 – 7.13, represent the calculation of the summary statistics

Assessment Assessment
score after extra score before
Person help, X1 extra help, X2 d = X1 - X2 (d - )̅ 2
1 56.86 42.66 14.20 102.53
2 57.88 55.71 2.17 3.63
3 44.64 59.12 -14.48 344.27
4 48.87 55.75 -6.88 120.00
5 68.34 40.91 27.43 545.48
6 55.76 25.58 30.18 681.50
7 46.83 55.4 -8.57 159.88
8 56.66 59.2 -2.54 43.75
9 56.18 48.44 7.74 13.44
10 51.1 61.18 -10.08 200.35
11 40.48 45.93 -5.45 90.72
12 44.78 38 6.78 7.32
Table 7.12 Summary statistics

Page | 404
Assessment Assessment
score after extra score before
Person help, X1 extra help, X2 d = X1 - X2 (d - )̅ 2
13 87.45 63.27 24.18 404.23
14 66.28 30.72 35.56 991.34
15 59.11 46.81 12.30 67.66
16 50.69 58.23 -7.54 134.90
17 57.24 43.36 13.88 96.15
18 48.14 69.73 -21.59 658.66
19 68.17 62.13 6.04 3.86
20 64.48 66.09 -1.61 32.31
21 49.75 39.09 10.66 43.37
22 54.71 47.32 7.39 10.99
23 60.43 66.12 -5.69 95.34
24 70.05 36.76 33.29 853.55
25 78.55 82.46 -3.91 63.75
26 50.75 44.92 5.83 3.08
27 66.87 57.28 9.59 30.42
28 46.5 31.35 15.15 122.67
29 43.39 67.8 -24.41 811.36
30 76.26 67.08 9.18 26.07
31 67.93 67.07 0.86 10.33
32 70.21 53.23 16.98 166.55
33 55.63 52.21 3.42 0.43
34 46.76 56.55 -9.79 192.22
35 26.72 36.58 -9.86 194.17
36 45.67 44.88 0.79 10.79
37 60.75 59.18 1.57 6.27
38 65.99 40.2 25.79 471.57
39 30.62 53.42 -22.80 722.24
40 40.14 60.01 -19.87 573.34
41 64.28 69.46 -5.18 85.64
42 70.41 79.53 -9.12 174.09
43 55.29 44.73 10.56 42.06
44 66.83 33.38 33.45 862.92
45 70.18 62.43 7.75 13.51
Table 7.13 Summary statistics continued

From tables 7.12 and 7.13:

D = 0, n = 45,  d = 183.35,  (d – d̅)2 = 10288.72

Therefore,

Page | 405
Average difference d̅ = 4.0744

Substituting into equation (6.23) gives the standard deviation for the differences, sd

2
√ ∑(d − d̅) 10288.72
sd = =√ = 15.2916
n−1 45 − 1

Substituting into equation (6.22) gives the t test value

d̅ − D 4.0744 − 0
t= sd = 15.2916 = 1.7874
√n √45

Substituting into equation (6.24) gives the number of degrees of freedom

df = n – 1

df = 45 – 1 = 44

Step 5. Make a decision

Given t = 1.7874 and df = 44, we need to use the t critical tables to find the
critical value for df = 44 and a 5% significance level (see Figure 7.44).

Figure 7.44 Critical values for the Student’s t distribution

From statistical tables:

For a two-tail 5% significance we have:

Critical t value at df = 40 is 2.02

Critical t value at df = 50 is 2.01

Page | 406
Can use linear interpolation to find the critical value for df = 44 as follows:

tdf = 44 = tdf = 40 + (4/10)*(tdf=50 – tdf=40)

tdf = 44 = 2.02 + (4/10)*(2.01 – 2.02)

tdf = 44 = 2.016

We have:

tcal = + 1.7874

tcri =  2.016

Given tcal lies between the critical t values (- 2.016  + 1.7874  + 2.016),
then we would fail to reject the null hypothesis, H0.

We conclude that there is no evidence to suggest the test scores before or after are
significantly different at a 5% significance level.

Excel solution

Figures 7.45 - 7.46 illustrate the Excel solution. Note that in Figure 7.44 rows 11-45 are
hidden.

Figure 7.45 Example 7.5 Excel solution

Page | 407
Figure 7.46 Example 7.5 Excel solution

Critical t value method

From Excel, t = 1.7874 and the two-tail critical t value =  2.0154. Given 1.7874
lies between – 2.0154 and + 2.0154, we fail to reject the null hypothesis.

P-value method

From Excel, the two-tail p-value = 0.0808 > 2 tail significance level of 0.05, again
we fail to reject the null hypothesis.

Figure 7.47 illustrates the relationship between the p-value and test statistic.

Page | 408
Figure 7.47 Relationship between t, critical t value, and the p-value

We conclude that there is no evidence to suggest the test scores before or after are
significantly different at a 5% significance level.

Confidence interval method

If you calculate the confidence intervals you will have the following results.

Figure 7.48 Confidence interval calculations

Page | 409
From Excel, the 95% confidence interval is – 0.15 to 8.67.

Notice that the paired mean difference is 4.07 which lies between the lower and
upper confidence interval.

This tells us that we fail to reject the null hypothesis. We conclude that there is
no evidence to suggest the test scores before or after are significantly different at
a 5% significance level.

Excel Data Analysis add-in solution for a two-sample t test for the mean

As an alternative to either of the two previous methods, we can use a method


embedded in Excel Data Analysis.

Select Data > Data Analysis > t-Test: Paired Two Sample for Means.

Figure 7.49 Example 7.5 Excel Data Analysis solution

Figure 7.50 Example 7.5 Excel Data Analysis solution

Page | 410
SPSS solution

Input data into SPSS

Figure 7.51 Example 7.5 SPSS data

Analyze > Compare Means >

Figure 7.52 Paired sample t test

Select Paired-Samples T Test

Transfer variables Score_1 and Score_2 into the Paired Variables box.

Page | 411
Figure 7.53 SPSS paired samples t test menu

Click OK

SPSS output

Figure 7.54 shows the paired sample statistics. Figure 7.55 shows the paired samples
correlations. Figures 7.56 and 7.57 show the outcome of the two-sample dependent t
test.

Figure 7.54 SPSS solution

Figure 7.55 SPSS solution continued

Figure 7.56 SPSS solution continued

Page | 412
Figure 7.57 SPSS solution continued

From SPSS:

t value is 1.787
two-tailed p-value is 0.081

This is identical to the Excel results t = - 1.787 and two-tailed p-value = 0.081.
Given that the two-tailed p-value is greater than 0.05, we fail to reject H0.

We conclude that there is no evidence to suggest the test scores before or after
are significantly different at a 5% significance level.

Note that the equivalent nonparametric test for the two samples mean (dependent and
paired samples) is the Wilcoxon signed-rank test. We will cover this test in Chapter 8.

Check your understanding

X7.15 Choko Ltd provides training to its salespeople to aid the ability of each
salesperson to increase the value of their sales. During the last training session
15 salespeople attended and their weekly sales before and sales after are
provided in Table 7.14. Assuming the populations are normally distributed
assess whether there is any evidence for the training company’s claims that its
training is effective (test at 5% and 1%).

Person Before After Person Before After


1 2911.48 2287.22 9 2049.34 2727.41
2 1465.44 3430.54 10 2451.25 2969.99
3 2315.36 2439.93 11 2213.75 2597.71
4 1343.16 3071.55 12 2295.94 2890.20
5 2144.22 3002.40 13 2594.84 2194.37
6 2499.84 2271.37 14 2642.91 2800.56
7 2125.74 2964.65 15 3153.21 2365.75
8 2843.05 3510.43
Table 7.14 Change in value of sales

X7.16 Concern has been raised at the standard achieved by students completing final-
year project reports within a university department. One of the factors identified
as important is the mark achieved in the research methods module, which is
studied before the students start their project. The department has now
collected data for 18 students as given in Table 7.15. Assuming the populations
are normally distributed, is there any evidence that the marks are different? Test
at 5%.

Page | 413
Student RM Project Student RM Project
1 38 71 9 48 43
2 50 46 10 14 62
3 51 56 11 38 66
4 75 44 12 47 75
5 58 62 13 58 60
6 42 65 14 53 75
7 54 50 15 66 63
8 39 51
Table 7.15 Research methods results

Chapter summary
The focus of parametric tests is that the underlying variables are at the interval/ratio
level of measurement and the population(s) being measured is normally or
approximately normally distributed. In the next chapter we explore hypothesis tests for
variables that are at the nominal or ordinal level of measurement by employing the
concept of the chi-square test and nonparametric tests.

As a summary, Table 7.16 provides a comparison between the z test and t test. For large
sample size (n  30), the results from Student’s t test will be approximated by the
normal distribution, given that the sample standard deviation is assumed to be
approximately equal to the population standard deviation ( ≈ s).

Furthermore, Student’s t distribution approximates the shape of the normal distribution


when sample size is large.

Parameter Test Number Samples Test Population Sample


of independent statistic standard size
samples or dependent distribution deviation
Test for the z-test 1 Normal Known
mean
Test for the t-test 1 Student’s t Not known < 30
mean
Test for the z-test 1 Normal
proportion
Test for the z-test 2 Independent Normal Known
means
Test for the z-test 2 Independent Normal
proportions
Test for the t-test 2 Independent Student’s t Not known < 30
proportions
Test for the t-test 2 Dependent Student’s t Not known < 30
means
Table 7.16 One and two sample z and t tests

Page | 414
Test your understanding
TU7.1 Calculate the critical z value if you have a two-tailed test and you choose a
significance level of 0.05 and 0.01.

TU7.2 If you conduct a z test and it is a lower one-tailed test, what is your decision if the
significance level is 0.05 and the value of the test statistic is – 2.01?

TU7.3 Calculate the value of the p-value for the TU2 question.

TU7.4 Calculate the value of the z statistic if the null hypothesis is H0:  = 63, where a
random sample of size 23 is selected from a normal population with a sample
mean of 66 (assume the population standard deviation is 15).

TU7.5 Calculate the probability that a sample mean is greater than 68 for the TU3
question, when the alternative hypothesis is (a) two-tailed, and (b) upper one-
tailed (assume the significance level is 0.05).

TU7.6 Repeat TU5 but with the information that the population standard deviation is
not known, Describe the test you would use to solve this problem. Given that the
sample standard deviation was estimated to be 16.2, answer TU5 (a), and (b).

TU7.7 According to a recent report by a local car dealership, the mean age of cars that
are scrapped is 8.7 years. To test whether this age has changed the dealership
regularly keeps details of the age of cars when a decision to send to the scrap
yard is made. The manging director of the dealership has provided an analyst
(you) with a sample size of 52 and a sample average and standard deviation of
9.5 and 3.2, respectively. The managing director has asked you to analyse the dat.

a. State the null and alternative hypothesis.


b. Explain whether you will use a z or t test,
c. Calculate the critical z value (assume significance levels of 0.05 and 0.01).
d. Calculate the value of the test statistic.
e. Decide whether H0 or H1 should be accepted and provide reasons for your
decision (use the critical test statistic and p-value in making this decision).
f. Explain any reservations you may have in making your decision with
reference to the test assumptions.

TU7.8 A small manufacturing company sells ice cream in 430g pots. The quality control
process includes the measuring and recording of all 430g pots which are filled by
hand. A recent sample of 36 pots provided a sample mean and standard
deviation of 396g and 14g, respectively. You have been given the task of
answering the following questions:

a. What type of test could we use to test the hypothesis H0:  = 430, H1:  ≠
430?
b. Conduct the test and decide whether H0 or H1 should be accepted.

Page | 415
c. Provide details of any issues that the company needs to be aware of. Include a
discussion of test assumptions and any potential impact on the quality
control process.

TU7.9 The quality controller for the ice cream company (TU8) has been given
responsibility for monitoring individual staff performance on the 430 g
production line. The concern is that there is a difference in worker performance
in terms of the number of pots filled between the morning and afternoon shifts.
Given the sample data in Table 7.17, can we assume that there is a difference
between the morning and afternoon shifts? Test at the 0.01 level of significance
and assume independent samples. If a difference exists, what would the
consequences be for the company and employees?

Morning 72 68 91 69 78 73 77 80 77
shift
Afternoon 81 65 88 76 75 81 74 76
shift
Table 7.17 Worker performance

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A7Wa Two-sample z test using Excel.


2. A7Wb Comparing population variances: variance ratio F test and Levene’s
test
3. A7Wc Welch ANOVA test
4. A7Wd Statistical power and type II error

Factorial experiments workbook

Furthermore, you will find a factorial experiments workbook that explores using Excel
and SPSS the solution of data problems with more than two samples using both
parametric and non-parametric tests.

Page | 416
Chapter 8 Chi square and non-parametric hypothesis tests
8.1 Introduction and learning objectives
In previous chapters we explored choosing sample estimates from populations that may
be normally distributed, or large samples that came from non-normal distributions. We
carried out specific inference or hypothesis tests using z and Student’s t tests. We made
assumptions about the nature of the sample and distribution. However, very little has
been said about what to do if these assumptions are not met.

In this chapter tests will be presented for circumstances in which the assumptions for z
and t tests have not been met. In Chapters 5 and 7 we considered only interval- or ratio-
level variables. What can you do if you need to test differences or relationships among
nominal or ordinal variables? We will study tests that do not involve the actual data
values that were observed (e.g., test scores). Instead, we will look at two types of test.
The first type will involve either the counts (or frequencies) of observations (applicable
to chi-square tests). The second type will involve tests that use the ranks of scores
instead of the scores themselves (applicable to nonparametric tests).

Parametric tests from the previous chapter were used to assess whether the differences
between means (or proportions) are statistically significant. Within parametric tests we
sample from a distribution with a known parameter value, for example the population
mean (), variance (2), or proportion (π).

Important things to remember about the techniques described in Chapter 7 can be


summarised by three assumptions:

1. The underlying population follows a normal distribution.


2. The level of measurement is of equal interval or ratio scaling.
3. The population variances are equal.

Unfortunately, we will come across data that do not fit these assumptions. How do we
measure the difference between the attitudes of people surveyed in assessing their
favourite brand, when the response by each person is of the form 1, 2, 3,…, n? In this
situation, we have ordinal data for which taking differences between the numbers (or
ranks) is meaningless.

Another example is if we are asking for opinions where the opinion is of a categorical
form (e.g. strongly agree, agree, disagree, strongly disagree). The concept of difference
is again meaningless. The responses are words, not numbers. You can solve this
problem by allocating a number to each response, with 1 for strongly agree, 2 for agree,
and so on. This gives you a rating scale of responses. However, recall that the opinions
of people are not quite the same as measuring time or measuring the difference
between two points. Can we say that the difference between strongly agree and agree is
the same as the difference between agree and disagree? Another way of looking at this
problem is to ask the question: can we say that a rank of 5 is five times as strong as a
rank of 1?

Page | 417
As before, the relevance of this chapter is in specific applications that you may meet as
you conduct your business. A question, such as whether the two regions that you are
responsible for are performing in a uniform fashion, is a good example that can be
solved using the techniques from this chapter. Another example is whether male
customers respond to my advertising campaign in the same way as female customers.
You might think that the answer is obvious. For example, if 65% of women and 55% of
men respond positively to your campaign, does this mean that the women are more
convinced by the campaign? Not necessarily. Given the population size and other
factors, you might be drawing incorrect conclusions. If you apply the methods from this
chapter to this specific example, you might, on the other hand, end up telling your
management something different.

You might say that although it appears that women are more susceptible to the
campaign, you have evidence that this is not the case. In fact, you will be able to provide
evidence that both genders respond similarly, despite what the actual percentages
show. This is the practical value inherent in the methods covered in this chapter.

To take yet another example to illustrate how the tests in this chapter can be used,
imagine you are doing a small survey for your dissertation. Part of your overall research
project is to establish whether students’ attitudes towards work ethics change as they
progress through their studies. To establish this, you interview a group of first-year
students and a group of final-year students and ask them certain questions to illuminate
this problem.

You present the results in a simple table, where the rows represent first-year and last-
year students and the columns represent their attitudes (a scale such as strongly agree,
agree, disagree, etc.). Once you have constructed such a table, you can establish if the
maturity of students is in some way linked with their views on work ethics. The chi-
square test would test this claim of an association between their views on work ethics.

Table 8.1 provides a list of chi-square and nonparametric statistical tests described in
this book together and shows which methods are solved via Excel and SPSS.

Statistics test Excel SPSS


Chi-square test of independence test Yes Yes
Chi-square test for two proportions (independent samples) test Yes Yes
McNemar’s test for the difference between two proportions Yes Yes
(dependent samples) test
Chi-square goodness-of-fit test Yes Yes
Sign test Yes Yes
Wilcoxon signed rank sum test for matched pairs Yes Yes
Mann-Whitney test for two independent samples Yes Yes
Table 8.1 Chi-square and nonparametric tests in Excel and SPSS

As it was the case in the previous Chapter, it is sometimes difficult to select the
appropriate test for the given conditions. Figure 8.1 provides a flow chart of the
decisions required to decide which test to use to carry out the correct hypothesis test.

Page | 418
Figure 8.1 Which hypothesis test to use?

The key questions are:

1. What are you testing: difference or association?


2. What type of data is being measured?
3. Can we assume that the population is normally distributed?
4. How many samples?

We will begin this chapter with tests based around the chi-square distribution.

Learning objectives

On completing this chapter, you will be able to:

1. Apply a chi-square test to solve a range of problems. These include analysing


tabulated data, goodness-of-fit tests, testing for normality, and testing for equality of
variance.
2. Apply a range of nonparametric tests to solve a range of problems. These include the
sign test, Wilcoxon signed-rank test for two paired samples, and the Mann–Whitney
U test for two independent samples
3. Solve problems using Microsoft Excel and IBM SPSS Statistics software packages.

8.2 Chi-square tests


A chi-square test (or 2 test, from the Greek letter chi) is any statistical hypothesis test
where the sampling distribution of the test statistic is a chi-square distribution when
the null hypothesis is true. This test is versatile and widely used. It can be used when

Page | 419
dealing with data that are categorical (or nominal or qualitative) in nature and cannot
be stated as a number (e.g. responses such as ‘yes’, ‘no’, ‘red’, and ‘disagree’).

In Chapter 1 we explored tabulating such data and the use of bar and pie charts. Charts
were appropriate when we were dealing with proportions that fall into each of the
categories measured, which form a sample from a proportion of all possible responses.
In Chapter 7 we explored the situation of comparing two proportions where we
assumed that the underlying population distribution is normally distributed. In this
section we will explore a series of methods that make use of the chi-square distribution
to make inferences about two or more proportions:

1. Test of independence.
2. Chi-square test for two populations (independent samples)
3. McNemar’s test for differences between two proportions (dependent samples)

Chi-square test of independence

In this section, we test a claim that the row and column variables are independent of
each other. A test of independence tests the null hypothesis that there is no association
between the variables in a contingency table. There is no requirement for the
population data to be normally distributed or follow any other distribution. However,
the following three assumption are made:

1. Simple random sample: the sample data are a random sample from a fixed
population where every member of the population has an equal probability of
selection.
2. Independence: the observations must be independent of each other. This means
the chi-square cannot test for correlated data (e.g. matched pairs). In that
situation McNemar’s test is more appropriate.
3. Cell size: The chi-square test is valid if at least 80% of the expected frequencies
exceed 5 and all the expected frequencies exceed 1.

If these assumptions hold, the chi-square test statistic follows a chi-square distribution.
Suppose a university department surveyed 594 students to see which module they had
chosen for their final year of study. The objective was to determine whether males and
females differed in module preference for five specific modules (Table 8.2). The
question we would like to answer is whether we have an association between the
module chosen and a person’s gender.

Males Females
BIN3020-N 65 86
BIN3029-N 66 82
Module

BIN3022-N 51 38
BIN3678-N 59 40
BIN3045-N 59 48
Table 8.2 Gender versus module attended

Page | 420
To answer this question, we can employ the chi-square test of independence (or test
of association). This is used to determine whether the frequency of occurrences for
two category variables (or more) are significantly related (or associated) to each other.

Null hypothesis, H0

The null hypothesis (H0) states that the row and column categorical variables (e.g. the
module, student gender) are not associated (are independent).

Alternative hypothesis, H1

The alternative hypothesis (H1) states that the row and column variables are associated
(are dependent).

The chi-square test statistic is defined by equation (8.1):

(O−E)2
χ2 = ∑all cells (8.1)
E

Where O is the observed frequency in the cell of a contingency table and E is the
expected frequency in the cell of a contingency table, given the null hypothesis is true.

It can be shown that if the null hypothesis is true then the chi-square (2) test statistic
approximately follows a chi-square distribution with (r – 1)(c – 1) degrees of freedom.
Why are the degrees of freedom equal to (r - 1) (c - 1)?

Suppose that we know the marginal totals for each of the levels of our categorical
variables. In other words, we know the total for each row and the total for each column.
For the first row, there are c columns in our table, so there are c cells. Once we know the
values of all but one of these cells, then because we know the total of all the cells it is a
simple algebra problem to determine the value of the remaining cell. If we were filling in
these cells of our table, we could enter c – 1 of them freely, but then the remaining cell is
determined by the total of the row. Thus, there are c – 1 degrees of freedom for the first
row.

We continue in this manner for the next row, and there is again c – 1 degrees of
freedom. This process continues until we get to the penultimate row. Each of the rows
except for the last one contributes c – 1 degrees of freedom to the total. By the time that
we have all but the last row, then because we know the column sum, we can determine
all the entries of the final row. This gives us r - 1 rows with c - 1 degrees of freedom in
each of these, for a total of (r – 1) (c – 1) degrees of freedom.

As with parametric tests, you would reject the null hypothesis H0 if the value of the test
statistic is equal to or greater than the upper critical value of the chi-square distribution
with (r – 1)(c – 1) degrees of freedom:

Reject null hypothesis: χ2cal ≥ χ2cri

Fail to reject the null hypothesis: χ2cal < χ2cri

Page | 421
Figure 8.2 illustrates the region of acceptance/rejection for the null hypothesis.

Figure 8.2 Chi-square region of acceptance/rejection for the null hypothesis

Again, just like with parametric tests, we can also make decisions by using either the
critical value criterion or the p-value.

If the null hypothesis is true, then we can use the observed frequencies in the table to
calculate the expected frequencies (E) for each cell using equation (8.2):

Row total  Column total


E=
Total sample size (8.2)

To test the null hypothesis, we would compare the expected cell frequencies with the
observed cell frequencies and calculate the Pearson chi-square test statistic given by
equation (8.3), which is a more detailed version of equation (8.1):

(Oi -Ei )2
χ2cal = ∑ni=1 (8.3)
Ei

The chi-square test statistic enables a comparison to be made between the observed
frequency (O) and expected frequency (E). Equation (8.3) tells us what the expected
frequencies would be if there was no association between the two categorical variables,
for example, gender and course.

If the values were close to one another then this would provide evidence that there is no
association. Conversely, if we find large differences between the observed and expected
frequencies, then we have evidence to suggest an association does exist between the
two categorical variables: gender and course.

Page | 422
Statistical hypothesis testing allows us to confirm whether the differences are likely to
be statistically significant. The chi-square distribution varies in shape with the number
of degrees of freedom. Therefore, we need to find this value before we can look up the
appropriate critical values. The number of degrees of freedom (df) for a contingency
table is given by equation (8.4):

df = (r – 1)(c – 1) (8.4)

Where r is the number of rows and c is the number of columns. We will identify a
rejection region using either the critical test statistic or the p-value calculated from the
test statistic, as with the hypothesis testing methods described in Chapter 6.

Example 8.1

We are interested in establishing if gender and course selection are associated. We have
two attributes: gender, and module, both of which have been divided into categories:
two for gender and five for module. The resulting table is called a 5 × 2 contingency
table because it consists of five rows and rows columns. To determine whether gender
and chosen module are associated (or independent), we conduct a chi-square test of
association on the contingency table. Table 8.3 repeats Table 8.2, but with the addition
of row, column, and overall totals.

Males Females Totals


BIN3020-N 65 86 151
BIN3029-N 66 82 148
Module

BIN3022-N 51 38 89
BIN3678-N 59 40 99
BIN3045-N 59 48 107
Totals = 300 294 594
Table 8.3 Gender versus module chosen

The five-step procedure to conduct this test progresses as follows.

Step 1 State hypothesis

Null hypothesis H0: Gender and module choice are not associated (are
independent)

Alternative hypothesis H1: There is an association between sex and module


chosen (they are dependent)

Step 2 Select test

• Number of samples – two category data variables (gender and module).


• Random sample.
• Values represented as frequency counts within the contingency table.
• The statistic we are testing – testing for an association between the two
category data variables.

Page | 423
Apply a chi-square test of association to the sample data.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract relevant statistic

Calculate the expected frequencies using equation (8.2) and the chi-square value
for each observed/expected frequency pair using equation (8.3). The results are
given in Tables 8.4 and 8.5.

Expected
Expected male female
76.2626 =151*300/594 74.7374 =151*294/594
74.7475 =148*300/594 73.2525 =148*294/594
44.9495 =89*300/594 44.0505 =89*294/594
50.0000 =99*300/594 49.0000 =99*294/594
54.0404 =107*300/594 52.9596 =107*294/594
Table 8.4 Calculation of the expected frequencies

We can now calculate the ratio (O - E)2 / E for each cell pairing of observed (O)
and expected frequencies E.

(O-E)2/E O-E)2/E cont.


1.6633 1.6972
1.0237 1.0446
0.8144 0.8311
1.6200 1.6531
0.4552 0.4645
Table 8.5 Calculation of the chi-square values

Calculate the test statistic using equation (8.3):

χ2cal = 1.6633 + 1.0237 + ⋯ … … . +1.6531 + 0.4645

χ2cal = 11.2670

Calculate the number of degrees of freedom using equation (8.4) and the values
from Table 8.4:

df = (r – 1)(c – 1) = 4 × 1 = 4

Page | 424
Step 5 Make a decision

We can identify a rejection region using the critical test statistic method. From
chi-square critical tables, we find a critical value of 9.49 for the 5% significance
level and 4 degrees of freedom (Figure 8.3).

Figure 8.3 Percentage points of the chi-square distribution

Decision rule

χ2cal = 11.2670

χ2cri = 9.49

From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0. We
conclude that there is a significant relationship (or association) between the
category variables gender and module preference. If you look at table 8.5, you
can see that the main contribution to this overall χ2cal value of 11.2670 comes
from module BIN3020-N with more females compared to males, and BIN3678-N
with more males than females. Figure 8.4 illustrates the relationship between the
test statistic and the critical test statistic.

Rejection area

Figure 8.4 Chi-square distribution with df = 4

Page | 425
One of the requirements of this test is that we have an adequate expected call count.
What do we mean by an adequate expected cell count?

If we have a 2 x 2 table, then every cell should contain the value that is at least 5 or
larger. If a table is larger, then 5 or more in 80% of the cells, but no cells with zero
expected count. When this assumption is not met, Yates's correction for continuity
(Yates’s chi-square statistic) is applied. For a larger table, all expected frequencies must
be greater than 1 and no more than 20% of all cells may have expected frequencies less
than 5. If these criteria cannot be met, then we must increase the sample size or/and
combine classes to eliminate frequencies that are too small.

Any cell frequencies < 5

Note that none of the cells have an expected frequency less than 5.

Yate’s correction for continuity

Using the chi-square distribution to interpret the chi-square statistic requires


one to assume that the discrete probability of observed binomial frequencies in
the table can be approximated by the continuous chi-square distribution. This
assumption is not quite correct and introduces some error. To reduce the error
in approximation, Frank Yates suggested a correction for continuity that adjusts
the formula for the chi-square test by subtracting 0.5 from the difference
between each observed value and its expected value in a 2 × 2 contingency table.

Equation (8.5) shows Yates's correction for continuity (Yates’s chi-square statistic) for
Pearson's chi-square statistic:

(|Oi − Ei |−0.5)2
χ2Yates = ∑ni=1 (8.5)
Ei

If you calculate Yates’s chi-square test statistic by using equation (8.5) in place of
equation (8.3) then the values will be:

χ2cal = 9.9552 χ2cri = 9.49

From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0. We
conclude that there is a significant relationship (or association) between the category
variables gender and module preference. Notice that the corrected value of chi-square is
smaller than uncorrected chi-square value: Yates's correction makes the test more
‘conservative’, meaning that it is harder to get a significant result (although in this case
chi-square remains statistically significant). Note that the Excel function CHISQ.TEST
does not use Yates’s chi-square continuity correction. Note also that SPSS calculates chi-
square using Yates's correction for continuity for a 2 × 2 table when the data are in raw
data format (see Example 8.3).

Excel solution

Input the data.

Page | 426
Calculate the expected frequencies and chi-square ratio (O – E)2/E as illustrated in
Figure 8.5.

Figure 8.5 Example 8.1 Excel solution

Observe in Figure 8.5, we created extra columns called Em (Expected male), Ef


(Expected female), row totals, column totals, and (O – E)2/E.

Expected frequencies in Figure 8.5 are calculated as:

Expected frequencies Cell C16 Formula:=$E5*C$10/$E$10


male (Em) Copy formula down C17:E20
Expected frequencies Cell D16 Formula:=$E5*D$10/$E$10
female (Ef) Copy formula down D17:D20
(O-E)2/E Cell E15 Formula:=(C5-C16)^2/C16
Copy formula down and across to F20
Table 8.6

Now we can proceed and calculate the test statistics for the chi-square hypothesis test
as illustrated in Figure 8.6.

Page | 427
Figure 8.6 Example 8.1 Excel solution continued
2
Using 𝜒𝑐𝑟𝑖 to make a decision

From Excel:

χ2cal = 11.267

χ2cri = 9.49

From the calculations, we have χ2cal > χ2cri , so we reject the null hypothesis, H0.

Use test p-value to make a decision

From Excel:
P-value = 0.0237
Significance level  = 0.05
From the calculations, the p-value is less than the significance level (0.0237 <
0.05). Therefore, reject the null hypothesis, H0.

We conclude that there is a significant relationship (or association) between the


category variables gender and module preference.

Any cell frequencies < 5


Page | 428
Note in cell J29 we are checking that none of the cells have an expected
frequency less than 5.

In Figure 8.7, we observe that χ2cal lies in the lower rejection zone (11.267 > 9.488).

Figure 8.7 Chi-square distribution with df = 4

Yate’s continuity correction

You could calculate Yate’s continuity correction value of chi-square = 9.9552 by


subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in cells
E16:F20.

SPSS solution

We can use SPSS to solve this problem by converting the independent categories
(Gender, Module) into SPSS codes and creating a third variable to represent the
frequency of occurrence within the contingency table. Table 8.7 represents the coding
to be used for the two independent categories and the frequency count for each of the
category codes.

Module Module Gender Gender Frequency


code code
BIN3020-N 1 Male 1 65
BIN3020-N 1 Female 2 86
BIN3029-N 2 Male 1 66
BIN3029-N 2 Female 2 82
BIN3022-N 3 Male 1 51
BIN3022-N 3 Female 2 38
BIN3678-N 4 Male 1 59
BIN3678-N 4 Female 2 40
BIN3045-N 5 Male 1 59
BIN3045-N 5 Female 2 48
Table 8.7 SPSS codes

Page | 429
Enter these data into SPSS as illustrated in Figure 8.8.

Figure 8.8 SPSS Data

The next step is to weight the cases given the count values are frequencies.

Click on Data > Weight Cases

Move Frequency variable into the Frequency Variable box as illustrated in Figure 8.9.

Figure 8.9 Weight cases

Click OK

Now run SPSS to solve this problem.

Click Analyze > Descriptive Statistics > Crosstabs…

Transfer Course to Row(s) box and Gender to Column(s) box as illustrated in Figure
8.10.

Page | 430
Figure 8.10 SPSS Crosstabs menu

Click on Statistics,
Choose Chi-square
Click Continue.

Click on Cells
Choose Observed and Expected frequencies
Click Continue.

Click OK.

SPSS output

The SPSS output solution is split into three parts: a case processing summary (Figure
8.11), a crosstabulation table (Figure 8.12) and a table labelled ‘Chi-Square Tests’
(Figure 8.13).

Figure 8.11 Example 8.1 SPSS solution

Page | 431
Figure 8.12 Example 8.1 SPSS solution continued

Figure 8.13 Example 8.1 SPSS solution continued

If you compare the SPSS and Excel solution (Figure 8.6) you will observe that you have
the same results as with the Excel solution.

The chi-square p-value associated with the chi-square score (+11.267) is 0.024, which
means that there is a probability of 0.024 that we would get a value of chi-square as
large as the one we have if there were no effect in the population. Given p-value = 0.024
< 0.05 (significance level), we reject the null hypothesis and accept the alternative
hypothesis.

Conclusions

The value of the chi-square test statistic is 11.267 and is highly significant given the p-
value = 0.024 < 0.05. This indicates that gender of a person is a significant factor in the
type of course males and females attended.

How do you solve problems when you have raw data?

Calculation of a chi-square test of association using Excel and IBM SPSS when we have
raw data rather than data in a contingency table is illustrated in Example 8.2.

Page | 432
Example 8.2

If you have spent any time watching political campaign adverts, you probably know that
such adverts can vary in terms of what they emphasise. For example, some adverts may
emphasise polishing a candidate's image, while other adverts may emphasise the
candidate's stand on issues. You may also have noticed that adverts can vary in terms of
how they attempt to motivate voters. For example, some adverts may attempt to appeal
to fears that will scare voters into voting for the candidate, or at least against some
opponent of the candidate. Other adverts may rely on other motivational strategies.

Johnson and Kaid set out to learn, among other things, whether these two variables,
emphasis (on either image or issues) and persuasion strategy (using a fear appeal or not
using a fear appeal), might be related. (See Johnson, A. and Kaid, L.L. (2002), Image ads
and issue ads in U.S. presidential advertising: Using video style to explore stylistic
differences in televised from 1952 to 2000. Journal of Communication, 52, 281–300.)

As an example, they hypothesised that fear appeals are more common in issue adverts
than in image adverts. To investigate that possibility, Johnson and Kaid’s study
examined 1213 presidential campaign ads run between 1952 and 2000. Figure 8.14
provides a screenshot of the first 10 records out of a total of 1213 records.

Figure 8.14 First 10 records out of 1213 records

For this problem, we wish to check if there is an association between the type of advert
and the advert’s appeal. Therefore, this problem involves running a chi-square test of
association, where the null hypothesis states that we have no association between the
two category variables (type of advert and appeal of advert). The alternative hypothesis
states that an association exists. We will test at the 5% significance level ( = 0.05).

Excel solution

Enter data into Excel as shown in Figure 8.15 (only the first and the last 7 records are
shown, i.e. rows 11:1209 are hidden).

Page | 433
Figure 8.15 Example 8.2 Excel solution

Create the crosstab table.

In Excel, the crosstab table is an Excel PivotTable.

Click in the data area and click on Insert PivotTable (Figure 8.16).

Figure 8.16 Create PivotTable


Click OK.

Now we need to tell the Excel PivotTable what should go in the PivotTable rows,
columns, and sum values boxes (Figure 8.17).

Figure 8.17 Excel PivotTable

Page | 434
Now copy this PivotTable below the current table, but this time right-click and choose
Paste Special > Values. Repeat this step underneath your first copy to produce Figure
8.18.

Figure 8.18 Copy values from the PivotTable

You could change the number formatting in the PivotTable from numbers to a
percentage. This is achieved by clicking in the PivotTable data field (say, cell F4), and
the PivotTable Fields menu will appear. Now change in  Values box from Count of
Appeal of Advert to Value Field Settings…, Choose Show Values As, and choose % of Row
Total.

Figure 8.19 Excel PivotTable showing percentages

We observe from Figure 8.19 that 26.02% of issue adverts have a fear factor, compared
to 13.99% of image adverts. Now we can use the third table. Remove the observed
frequency values (cells F16:G17) and insert the Excel formulae into these cells to
calculate the expected frequencies (cell F16: =H10*F12/H12, cell F17: =H16*G18/H18,
cell G16: =H17*F18/H18, and cell G17: =H17*G18/H18).

Note that to aid understanding we have labelled the observed and expected frequency
tables in Figure 8.20.

If all you want is to decide if the null or the alternative hypothesis should be accepted
then calculate the chi-square p-value for this example using the Excel function
=CHISQ.TEST(F10:G11, F16:G17), as illustrated in Figure 8.20.

Page | 435
Figure 8.20 Example 8.2 Excel solution continued

From Excel:

• chi-square test statistic = 23.584 > Chi-square critical value = 3.841


• chi-square p-value = 0.0000012 < test significance level = 0.05

Both methods, as expected, yield the same conclusion: reject the null hypothesis. We
conclude that there is evidence to suggest an association between type of advert (image,
issue) and the appeal of the advert (no fear appeal, fear appeal).

Any expected frequencies < 5

Check from the table that no cell as an expected frequency < 5 (cell G37).

Yate’s continuity correction

For completeness, we can also use Yates’s correction for continuity (equation
(8.5)) instead of equation (8.3) by modifying the equation in cells H25:H28 as
illustrated in Figure 8.21. Our conclusion is unchanged.

Page | 436
Figure 8.21 Example 8.2 Excel solution with Yate’s continuity applied

SPSS solution

Import the data into SPSS to create the SPSS data file (the SPSS data file is available for
download from the online resource centre).

Figure 8.22 Example 8.2 SPSS data

Choose Analyze > Descriptive Statistics > Crosstabs… (Figure 8.23).

Figure 8.23 SPSS Crosstabs menu

Page | 437
Choose Statistics…
Select Chi-Square
Click Continue

Choose Cells…,
Select Observed and Expected counts
Click Continue

Click on OK

SPSS output

The first output table (Figure 8.24) shows that we have 1213 cases (or participants) to
complete the study with no missing cases observed – this agrees with the Excel solution.

Figure 8.24 SPSS solution

The second output table (Figure 8.25) shows the crosstabulation data for this example,
and we observe that more participants showed a greater fear with an advert issue (204)
than with an avert image (60).

Figure 8.25 SPSS solution continued

The final output table (Figure 8.26) shows the results of the chi-square test: the Pearson
chi-square value is 23.584, with a two-sided p-value given as 0.000. Although the SPSS
solution shows the p-value equal to 0.000 this does not imply that the probability is
zero but that it is extremely unlikely.

Page | 438
Figure 8.26 SPSS solution continued

For this example, the null hypothesis should be rejected. We conclude that there is
evidence to suggest a significant association between the type of advert (image, issue)
and the appeal of the advert (no fear appeal, fear appeal). This agrees with the Excel
solution provided in Figure 8.20.

Finally, given this is a 2 × 2 table, Yates’s chi-square continuity value is provided in


Figure 8.26, and is equal to 22.882. This value agrees with the Excel solution provided in
Figure 8.21.

The chi-square p-value associated with the chi-square score (+ 23.584) is 0.000, which
means that there is a probability of 0.000 that we would get a value of chi-square as
large as the one we have if there were no effect in the population. Given p-value = 0.000
< 0.05 (significance level), we reject the null hypothesis and accept the alternative
hypothesis.

Conclusions

The value of the chi-square test statistic is 23.584 and is highly significant given the p-
value = 0.000 < 0.05. This indicates that association exists between type of advert and
advert appeal.

Check your understanding

X8.1 A business consultant requests that you perform some preliminary calculations
before analysing a data set using Excel:

a. Calculate the number of degrees of freedom for a contingency table with


three rows and four columns.
b. Find the upper tail critical c2 value with a significance level of 5% and 1%.
What Excel function would you use to find this value?
c. Describe how you would use Excel to calculate the test p-value. What does
the p-value represent if the calculated chi-square test statistic equals 8.92?

X8.2 A trainee risk manager for an investment bank has been told that the level of risk
is directly related to the industry type (manufacturing, retail and financial).
Table 8.8 shows the frequencies for the type of risk and different industries, as

Page | 439
collected in a survey. For these data, analyse whether or not a perceived risk is
dependent upon the type of industry identified. Use a significance level of 5%. If
the two variables are associated, then what is the form of the association?

Industrial Class
Level of Risk
Manufacturing Retail Financial
Low 81 38 16
Moderate 46 42 33
High 22 26 29
Table 8.8 Type of industry versus level of risk

X8.3 A manufacturing company is concerned at the number of defects produced in the


manufacture of office furniture. The firm operates three shifts and has classified
the number of defects as low, moderate, high, or very high. Table 8.9 shows the
number of defects recorded for the different shifts over a period of time. Is there
any evidence to suggest a relationship between types of defect and shifts? Use a
significance level of 5%. If the two variables are associated, then what is the form
of the association?

Defect Type
Shift
Low Moderate High Very high
1 29 40 91 25
2 54 65 63 8
3 70 33 96 38
Table 8.9 Defect type per shift

X8.4 A local trade association is concerned about the level of business activity within
the local region. As part of a research project a random sample of business
owners were surveyed on how optimistic they were for the coming year. Table
8.10 shows their responses. Based upon the table, do we have any evidence to
suggest different levels of optimism for business activity? Use a significance level
of 5%. If the two variables are associated, then what is the form of the
association?
Optimism Type of business
level Bankers Manufacturers Retailers Farmers
High 38 61 59 96
No change 16 32 27 29
Low 11 26 35 41
Table 8.10 Type of business versus levels of optimism

X8.5 A group of 412 students at a language school volunteered to sit a test to assess
the effectiveness of a new method to teach German to English-speaking students.
To assess the effectiveness, the students sat two different tests, one in English
and the other in German. Table 8.11 shows their results. Is there any evidence to
suggest that the student test performances in English are replicated by their test
performances in German? Use a significance level of 5%. If the two variables are
associated, then what is the form of the association?

Page | 440
French
German
≥ 60% 40% - 59% < 40%
≥ 60% 90 81 8
40% - 59% 61 90 8
< 40% 29 39 6
Table 8.11 French versus German performance

Chi-square test for two proportions (independent samples)

In Chapter 7 we explored the application of the z test to solve problems involving two
proportions. If we are concerned that the parametric assumptions are not valid then we
can use the chi-square test to test two independent proportions. With the chi-square
test for two independent samples we have two samples that involve counting the
number of times a categorical choice is made. In this situation we can develop a
crosstabulation (or contingency table) to display the frequency that each possible value
was chosen.

Example 8.3

To illustrate the concept, consider the example of a firm trying to establish its
environmental footprint by conducting a survey whether employees use the bus to
travel to work. Table 8.12 summarises the responses for only the people who work on
Mondays and Fridays. The question is whether we have a significant difference between
the Monday and Friday employees who travel to work by bus.

Monday Friday Total


Take bus to work 105 90 195
Do not take bus to work 70 100 170
Total 175 190 365
Table 8.12 Employees’ method of travel to work

Column variable
1 2 Totals
1 n1 n2 N
Row
2 t1 – n1 t2 – n2 T-N
variable
Totals t1 t2 T
Table 8.13 Generic 2 x 2 contingency table

In general, a 2×2 contingency table can be structured as illustrated in Table 8.12. From
this table, we can estimate the proportion of employees who will use the bus (and by
extension the probability of their doing so) by calculating the overall proportion (ρ)
using equation (8.6):

n1 + n2 N
ρ= t1 + t2
= T
(8.6)

We can now use this estimate to calculate the expected frequency (E) for each cell
within the contingency table. We do this by multiplying the column total by ρ for the

Page | 441
cells linked to travel by bus and 1 – ρ for those cells who do not travel by train using
equation (8.7):

E =  × column total (8.7)

We then calculate the chi-square test statistic to compare the observed and expected
frequencies using equation (8.3):

( Oi − Ei )
2
n
 =
2
cal
i =1 Ei

Where Oi is the observed frequency in a cell, and Ei is the expected frequency in a cell
calculated if the null hypothesis is true. The number of degrees of freedom is given by
equation (8.4).

df = (r – 1)(c – 1)

In this case, we would expect the proportion of employees taking the train on the two
days to be the same. This fact can then be used to calculate the expected frequencies.
From the observed and expected frequencies, we can calculate the chi-square test
statistic. We would then compare this value with the critical chi-square test statistic or
calculate the test p-value and compare with the significance level.

The five-step procedure to conduct this test progresses as follows.

Step 1 State hypothesis

Given that the population proportions are 1 and 2, the null and alternative
hypothesis are as follows:

H0: 1 = 2 (proportions travelling by bus on the two days are the same)

H1: 1 ≠ 2 (proportions different)

The ≠sign indicates a two-tailed test.

Step 2 Select the test

• Two independent samples


• Categorical data
• Chi-square test for the difference between two proportions.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract relevant statistic

Page | 442
Calculate  using equation (8.6):

n1 + n2 N 195
ρ= = = = 0.534
t1 + t 2 T 365

Calculate the expected frequencies for each cell using equation (8.7). For
example, for the bus on Monday the expected frequency would be 195×175/365
= 195×0.479 = 93.493. Repeat this calculation for the other cells within the
contingency table. To calculate the chi-square test statistic, we now need to
calculate for each cell the ratio (O – E)2/E given in equation (8.2), as illustrated
in Table 8.14.

Observed Expected Chi-square Chi-square


frequency frequency value for value for
Monday Friday
Monday Friday Total EM EF (O – E)^2/E (O – E)^2/E
Bus 105 90 195 93.5 101.5 1.4 1.3
Not bus 70 100 170 81.5 88.5 1.6 1.4
Total 175 190 365
Table 8.14 Calculate the expected frequencies and chi-square values

Note we have checked that no expected frequency is less than 5.

We sum these values to give the χ2cal test statistic:

2 (𝑂−𝐸)2
𝜒𝑐𝑎𝑙 =∑ = 1.416 + 1.304 + 1.624 + 1.496 = 5.841
𝐸

The degrees of freedom are calculated from equation (8.4):

df = (r – 1)(c – 1) = 1.

Step 5 Make a decision

The critical value can be found using statistical tables with a two-tailed
significance level of 0.05 and degrees of freedom = 1.

Figure 8.27 Percentage points of the chi-square distribution

Page | 443
From Figure 8.27, the critical chi-square value is χ2cri = 3.84.

Does the test statistic lie within the rejection region?

Compare the calculated and critical chi-square values to determine which


hypothesis statement (H0 or H1) to accept. We observe that the χ2cal lies in the
rejection region (5.841 > 3.84), and we reject the null hypothesis H0.

We conclude that there is a significant difference in the proportions travelling by bus on


Monday compared to Friday.

Any expected frequencies < 5

From table 8.14, we have no expected frequencies < 5.

Yate’s continuity correction

You could calculate Yate’s continuity correction value of chi-square = 5.345 by


subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in table 8.14.

Relationship between Z and χ2 when we have 1 degree of freedom

When we have 1 degree of freedom, we can show that there is a simple


relationship between the value of χ2cal and the corresponding value of Zcal.

If you carry out the calculation of a two-independent-sample z test for


proportions:

X 1 105
p1 = = = 0.6
n1 175

X 2 90
p2 = = = 0.47368
n2 190

X 1 + X 2 195
p= = = 0.53425
n1 + n2 365

If H0 true, then 1 = 2, and Z equals

p1 − p2
z= = 2.4169
1 1 
p (1 − p )  + 
 n1 n2 

Note that Z2 = (2.4169)2 = 5.84139

This value is equal χ2cal .

Page | 444
If we are interested in testing for direction in the alternative hypothesis (e.g. H1: 1 > 2)
then you cannot use a χ2 test but you will have to use a normal distribution Z test to test
for direction.

Excel solution

Figures 8.28 and 8.29 illustrate the Excel solution.

Figure 8.28 Example 8.3 Excel solution

Expected frequencies are calculated as:

Expected Cell C13 Formula: =$E6*C$8/$E$8


frequencies Copy formula down and across C13:D14
(O-E)^2/E Cell E13 Formula: =(C6-C13)^2/C13
Copy formula down and across E13:F14
Table 8.15

Page | 445
Figure 8.29 Example 8.3 Excel solution continued

From Excel:

χ2cal = 5.841

χ2cri = 3.841

P-value = 0.016

Any expected frequencies < 5

Note we have checked that no expected frequency is less than 5.

Yate’s continuity correction

You could calculate Yate’s continuity correction value of chi-square = 5,345 by


subtracting 0.5 from the chi-square equation = (|O − E| − 0.5)2⁄E in cells
E13:F14.

Does the test statistic lie within the rejection region?

Page | 446
Critical value solution

Given χ2cal = 5.841 > χ2cri = 3.841, reject the null hypothesis

P-value solution

Given p-value = 0.016 < 0.05, reject the null hypothesis

We conclude that there is a significant difference in the proportions travelling by bus on


Monday compared to Friday.

SPSS solution

Enter data into SPSS

Figure 8.30 Example 8.3 SPSS data

Codes: Day (1 = Monday, 2 = Friday) and Bus (1 = take bus to work, 2 = do not take bus
to work). Given this is in grouped frequency form, we need to weight cases in SPSS so
that SPSS knows the count (or frequency) values for each pairing of Day and Bus.

Figure 8.31 SPSS Weight Cases menu


Click OK.

Select Analyze > Descriptive Statistics > Crosstab (Figure 8.32).

Transfer Day variable into Rows and Bus variable into Columns.

Click Statistics
Select Chi-square
Select Continue.

Page | 447
Click Cells
Select Observed and Expected
Select Continue.

Figure 8.32 SPSS Crosstabs menu

Click OK.

SPSS output

The output is shown in Figures 8.33–8.35.

Figure 8.33 Example 8.3 SPSS solution

Figure 8.34 SPSS solution continued

Page | 448
Figure 8.35 SPSS solution continued

The Pearson chi-square test statistic is 5.841 with a two-sided p-value if 0.016. This is
the same as the Excel solution.

Conclusions

Conclude that there is a significant difference in the proportions travelling by bus on


Monday compared to Friday [Chi-square test statistic = 5.841, p-value = 0.016 < 0.05].

Check your understanding

X8.6 A local wine shop is considering running an advertisement to sell wine during
the time when a major athletics competition takes place. The shop owner has
conducted a small survey to check on his customers’ likely television viewing
habits during this athletic competition (watch, do not watch) and whether they
drink wine during the competition (drink wine, do not drink wine). Based upon
the sample data presented in Table 8.16, carry out an appropriate test to test the
hypothesis that the proportions drinking wine are not affected by their viewing
habits.

Drink wine? Watch Do not watch


athletics athletics
Drink wine 16 24
Do not drink wine 4 56
Table 8.16 Television viewing habits

X8.7 Extra Hotels owns two hotels in a popular tourist town. The managing director
would like to check if the two hotels (X, Y) have similar return rates and has done
a small survey to ascertain if this is true. The results are presented in Table 8.17.
Conduct a suitable test to test that the proportions returning are the same for
hotels X and Y.

Choose hotel again? X Y


Yes 163 154
No 64 108
Table 8.17 Hotel return rates
Page | 449
McNemar’s test for the difference between two proportions (dependent
samples)

The contingency table methods described so far are based on the data being
independent. For a 2 × 2 table consisting of frequency counts that result from matched
pairs, the independence rule is violated. In this situation, we can use McNemar’s test for
matched pairs. In general, the 2 × 2 contingency table can be structured as illustrated in
Table 8.18. From this table, we observe that the sample proportions are given by
equations (8.8) and (8.9):

Condition 2
Column variable
Yes No Totals
Yes a b a+b
Condition 1 No c d c+d
Row variable Totals a+c b+d N

Table 8.18 Generic 2 x 2 table

Where:

a = number who answer yes to condition 1 and yes to condition 2.


b = number who answer yes to condition 1 and no to condition 2.
c = number who answer no to condition 1 and yes to condition 2.
d = number who answer no to condition 1 and no to condition 2.
N = total sample size who answered the questions

The question we have is are the population proportions answering yes to condition 1
the same as for condition 2. In other words, is 1 = 2. The sample proportions who
answer yes under conditions 1 and 2 are given by equations (8.8) and (8.9).

a+b
ρ1 = (8.8)
N

a+c
ρ2 = (8.9)
N

Equation (8.10) represents the McNemar test used to test the null hypothesis H0: 1 =
2).

b−c
Zcal = (8.10)
√b+c

For H0 true, if b + c  25, then McNemar’s z-test statistic defined by equation (8.10) is
approximately normally distributed. McNemar’s test can be one- or two-tailed. If you
have a two-tailed test, then the test statistic follows a chi-square distribution and
approximately follows a normal distribution. If your test is one-tailed, then you can only
use the approximate normal distribution test.

The assumptions for the text are as follows:

Page | 450
1. The sample data have been randomly selected from the population.
2. The sample data consist of matched pairs of frequency counts.
3. The sample data are at the nominal level of measurement.

Exact binomial solution

If either b or c is small (b + c < 10) then the test statistic is not well approximated by the
chi-square distribution. In this case, an exact binomial test can be used, where the
smaller value of b and c is compared to a binomial distribution with size parameters
min(b, c), n = b + c, and p = 0.5. To achieve a two-sided p-value, the p-value of the
extreme tail should be multiplied by 2 as in equation (8.11):

Exact binomial 2 − tail p − value = 2 ∑nk=b(nk) 0.5k (1 − 0.5)n−k (8.11)

This is simply twice the binomial distribution cumulative distribution function with p =
0.5 and n = b + c. The corrected version of the McNemar’s test approximates the
binomial exact p-value.

Example 8.4

Consider the problem of establishing the impact a change in recipe has on consumers of
pop tarts. Two focus groups of consumers are selected at random and their opinions,
after blind-tasting the product (‘I would never buy this product’ or ‘I would definitely
buy this product’), recorded. Both groups are then offered the product with the
modified recipe and their revised opinions recorded. The question that arises is
whether the change in recipe yielded the desired effect.

In this case, we have two groups who are recorded before and after, and we recognise
that we are dealing with paired samples. To solve this problem, we can use McNemar’s
test for two sets of nominal data that are randomly sampled. Table 8.19 shows
consumers’ before and after preferences.

After
Before
Would never buy Would buy
Would never buy 138 42
Would buy 21 99
Table 8.19 Before versus after voting intentions

The question is whether the change in recipe has been successful and moved those
consumers who would never buy this product to the group of those that would buy it.
To simplify the problem, we shall look at whether the proportion of those who would
never buy has significantly changed.

Our hypotheses are:

H0: Proportion stating that they would never buy has not changed

H1: Proportion stating that they would never buy has changed

Page | 451
In mathematical notation this can be written as: H0: 1 = 2, H1: 1 ≠ 2, where 1 is the
population proportion stating they would never buy before the recipe change and 2 is
the population proportion stating that they would never buy after the recipe change.

The five-step procedure to conduct this test progresses as follows.

Step 1 State hypothesis

Given that the population proportions are 1 and 2, respectively denoting the
proportion stating that they would never buy before and after the recipe change,
then the null and alternative hypothesis are as follows:

H0: 1 = 2,

H1: 1 ≠ 2,

The ≠ symbol indicates a two-tailed test.

Step 2 Select the test

• Sample data have been randomly selected.


• The sample consists of matched pairs.
• The data area at the nominal level of measurement.
• Since b = 43 and c = 21, b + c = 63 > 25 we can use a normal distribution or an
exact binomial solution. Let us use McNemar’s z test and compare it with the
exact binomial test.

Step 3 Set the level of significance

α = 0.05

Step 4. Extract relevant statistic

Using McNemar’s z test, we have, by equation (8.10):

b−c
Zcal =
√b + c

42 − 21
Zcal =
√42 + 21

Zcal = 2.6458

For a two-tail critical z value with 5% significance, Zcri =  1.96.

Binomial solution

Page | 452
For comparison, the exact two-tailed binomial solution is given by solving
equation (8.11):
n
n
Exact binomial 2 − tail p − value = 2 ∑ ( ) 0.5k (1 − 0.5)n−k
k
k=b

This is simply twice the binomial distribution cumulative distribution function


with p = 0.5, b = 42 and n = b + c = 42 + 21 = 63. We obtain:
n
n
p = 2 ∑ ( ) 0.5k (1 − 0.5)n−k
k
k=b

63
63
p = 2 ∑ ( ) 0.5k (1 − 0.5)63−k
𝑘
k=42

63 63 63
p = 2 × [( ) 0.542 (0.5)21 + ( ) 0.543 (0.5)20 … … . . + ( ) 0.563 (0.5)0 ]
42 43 63

This is quite a lengthy calculation to solve manually, so it is a good idea to use a


software package like Excel as illustrated in Figure 8.36.

Figure 8.36 Calculation of the 2-tail binomial p-value

From Figure 8.36 (cell F3):

p = 0.011141

Page | 453
Step 5 Make a decision

McNemar’s z test solution

Zcal = 2.6458

Critical Zcri =  1.96

Our calculated test statistic is Zcal = 2.6458, and the Zcri is ± 1.96. We
observe that Zcal lies in the rejection region (2.6458 > 1.96), so we accept
the alternative hypothesis, H1.

We conclude that there is a significant difference in ‘never would buy’


intentions after the recipe change compared with before the recipe
change.

Binomial solution

For comparison, the two-tailed binomial exact p-value is 0.0111, which is


less than the significance level of 0.05, so we accept the alternative
hypothesis, H1.

We conclude that there is a significant difference in ‘never would buy’


intentions after the recipe change compared with before the recipe
change.

Excel solution

Figures 8.37 and 8.38 illustrate the Excel solution for McNemar’s z and 2 tests.

Figure 8.37 Example 8.4 Excel solution

Page | 454
Figure 8.38 Example 8.4 Excel solution continued

Using McNemar’s method, we have:

McNemar’s Z-test statistic = 2.6458


Two-tailed critical chi-square statistic = 1.96
Two-tailed Z-test p-value = 0.0082.

Given Z = 2.6458 > + 1.96, reject the null hypothesis and accept the alternative
hypothesis.

Given the two-tailed p-value = 0.0082 < 0.05, reject the null hypothesis and
accept the alternative hypothesis.

Using the binomial exact solution, we have:

two-tailed binomial p-value = 0.0111.

Given two-tailed binomial p-value = 0.0111 < 0.05, reject the null hypothesis and
accept the alternative hypothesis.

We observe that the manual and Excel solutions agree.

Page | 455
We conclude that there is a significant difference in ‘never would buy’ intentions after
the recipe change compared with before the recipe change.

SPSS solution

Enter data

Figure 8.39 Example 8.4 SPSS data

Code: Before (1=No, 2=Yes) and After (1=No, 2=Yes).

Given this is frequency data we will weight cases by count (frequency value).

Select Data > Weight Cases > transfer Count variable into the Frequency Variable: box
(Figure 8.40).

Figure 8.40 SPSS Weight Cases menu

Click OK.

Select Analyze > Descriptive Statistics > Crosstabs

Transfer Before into Row(s) box


Transfer After into Column(s) box
Select Statistics and choose McNemar’s.
Select Cells and choose Counts (Observed and Expected) and Percentages (Rows
and Columns).

Page | 456
Figure 8.41 SPSS Crosstabs menu

Click OK.

SPSS output

SPSS outputs descriptive statistics as shown in Figure 8.42 and test statistics as shown
in Figure 8.43. In SPSS, the binomial distribution is used for McNemar's test.

Figure 8.42 Example 8.4 SPSS solution continued

Page | 457
Figure 8.43 Example 8.4 SPSS solution continued

Conclusions

The 2-tail binomial exact p-value = 0.0111 < significance level of 0.05, accept the
alternative hypothesis H1. We conclude that there is a significant difference in ‘never
would buy’ intentions after the recipe change compared with before the recipe change.

This agrees with the manual and Excel solution given in Figure 8.38.

Check your understanding

X8.8 A business analyst requests answers to the following questions:

a. What is the p-value when the 2 test statistic is 2.89 and we have 1 degree of
freedom?
b. If you have 1 degree of freedom, what is the value of the z test statistic?
c. Find the critical 2 value for a significance level of 1% and 5%.

X8.9 During the summer of 2018 petrol prices have raised concerns with new car
sellers that potential customers are taking prices into account when choosing a
new car. To provide evidence to test this possibility a group of five local car
showrooms agree to ask fleet managers and individual customers during August
2018 whether they are or are not influenced by petrol prices. The results are as
shown in Table 8.20. At the 5% level of significance, is there any evidence for the
concerns raised by the car showroom owners? Answer this question using both
the critical test statistic and p-value.

Are petrol prices Fleet customers Individual customers


influencing you in
purchasing?
Yes 56 66
No 23 36
Table 8.20 Are customers influenced.

X8.10 A business analyst has been asked to confirm the effectiveness of a marketing
campaign on people’s attitudes to global warming. To confirm that the campaign
was effective a group of 500 people were randomly selected from the population.
They were asked whether they agree that national governments should be
concerned, answering ‘yes’ or ‘no’. The results are as shown in Table 8.21. At the
5% level of significance, is there any evidence that the campaign has increased

Page | 458
the number of people requesting that national governments should be concerned
that global warming is an issue? Answer this question using both the critical test
statistic and p-value.

Before campaign After campaign


Yes No
Yes 202 115
No 89 75
Table 8.21 Attitude to global warming

8.3 Nonparametric tests


Many statistical tests require that your data follow a normal distribution. Parametric
statistical hypothesis tests assume that the data on which they are applied possess
certain characteristics or parameters:

1. The two samples are independent of one another.


2. The two populations have equal variance or spread.
3. The two populations are normally distributed.

There is no getting around assumption 1. That assumption must be satisfied for a t-test.
When assumptions 2 and 3 (equal variance and normality) are not satisfied but the
samples are large (say, greater than 30), the results are approximately correct. But
when our samples are small and our data is skewed or non-normal, we probably should
not place much faith in the t test.

This is where nonparametric tests come in, given they can solve statistical problems
where assumptions 2 and 3 are violated. Whereas the null hypothesis of the two-sample
t test is equal means, the null hypothesis of nonparametric tests is concerned with equal
medians.

Another way to think of the null hypothesis is that the two populations have the same
distribution with the same median. If we reject the null, that means we have evidence
that one distribution is shifted to the left or right of the other as illustrated in Figure
8.44.

Figure 8.44 Two distributions with the same shape but different median values

Page | 459
Since we are assuming our distributions are equal, rejecting the null hypothesis means
we have evidence that the medians of the two populations differ. All the methods
described in this section use the method of ranks rather than a distribution shape (e.g.,
the population or sampling distribution does not follow a normal or Student’s t
distribution) to carry out an appropriate nonparametric hypothesis test. We use the
median rather than the mean as a measure of the centre of the distribution. The
decision often depends on whether the mean or median more accurately represents the
centre of your data distribution:

• If the mean accurately represents the centre of your distribution and your
sample size is large enough, consider a parametric test.
• If the median better represents the centre of your distribution, consider a
nonparametric test.

Given that we need less information to use nonparametric tests compared to parametric
tests, it follows that nonparametric tests have less power than parametric tests. This is
only true if your data are normally distributed. In addition, if you have a very small
sample size, you might be stuck with using a nonparametric test (or collect more data
next time if possible!) As you can see, the sample size guidelines are not really that
large. Your chance of detecting a significant effect, when one exists, can be very small
when you have both a small sample size and you need to use a less efficient
nonparametric test.

In this section, we shall explore three nonparametric tests: the sign test, Wilcoxon
signed-rank test, and Mann–Whitney U test. Table 8.22 compares the nonparametric
tests with the equivalent parametric tests for one- and two-sample tests discussed in
Chapter 8.

Test Parametric test Non-parametric test


One sample One sample z test Sign test
One sample t test Wilcoxon signed-rank test
Paired samples Two paired sample z test Sign test
Two paired sample t test Wilcoxon signed rank test
Independent samples Two independent sample t Mann Whitney U test
test (Wilcoxon rank sum test)
Table 8.22 Comparison of nonparametric versus equivalent parametric tests

Sign test

The sign test is used to test the null hypothesis that the median of a distribution is equal
to some value. It can be used (a) in place of a one-sample t test, (b) in place of a paired t
test or (c) for ordered categorial data where a numerical scale is inappropriate but
where it is possible to rank the observations.

The assumptions of the sign test are as follows:

1. The sign test is a nonparametric (distribution-free) test, so we do not assume that


the data are normally distributed.

Page | 460
2. Data should be one or two samples. The population may differ for the two samples.
3. Dependent samples should be paired or matched.

Types of sign test

The sign test is used to test the null hypothesis that the median of a distribution is equal
to some value:

a. A one sample sign test where the sample median value is compared to the
hypothesized median value.
b. A two-sample sign test where we compare if two sample medians provide evidence
that the population median difference is zero.

The one-sample and two-sample sign tests replace the one-sample t-test and two-
sample paired t-tests where we have evidence to suggest the data is not normally
distributed.

The observations in a random sample of size n are X1, X2, …, Xn (these observations
could be paired differences); the null hypothesis is that the population median is equal
to some value M. Suppose that X+ of the observations are greater than M and X− are
smaller than M (in the case where the sign test is being used in place of a paired t test, M
would be zero). Values of X which are exactly equal to M are ignored; the sum of X+ and
X− may therefore be less than n– we will denote it by n′.

Hypothesis statements

Under the null hypothesis we would expect half the X’s to be above the median and half
below. Therefore, under the null hypothesis both X+ and X− follow a binomial
distribution with probability of success p = 0.5 and n = n′.

We will assume that our random variable X is a continuous random variable with
unknown median M. Upon taking a random sample X1, X2, X3, …., Xn we are interested in
testing whether the median M takes on a value Ma.

That is, we are interested in testing the null hypothesis:

H0: M = Ma

Against any of the possible alternative hypotheses, H1:

H1: M > Ma

or

H1: M < Ma

or

H1: M ≠ Ma

Page | 461
Binomial exact solution

In this case, the probability distribution is a binomial distribution with probability (or
proportion) of success = 0.5 and the number of trials represented by the number of
paired observations (n).

In this case we can model the situation using a binomial distribution X ~ Bin(n, p). The
value of the binomial probability is given by equation (8.12):

𝑃(𝑋 = 𝑥) = 𝐶𝑥𝑛 𝑝 𝑥 (1 − 𝑞)𝑛−𝑥 (8.12)

This value given by the binomial equation represents an exact p-value. The p-value for a
sign test is found using the binomial distribution where p = proportion of non-zero
values = 0.5, n = number of non-zero differences, and x = number of positive differences:

1. Upper one-tail: p > 0.5, p-value = probability that x is greater than or equal to x = P(X
≥ x)
2. Lower one-tail: p < 0.5, p-value = probability that x is less than or equal to x = P(X ≤
x)
3. Two-tail: p ≠ 0.5, p-value = twice the probability that x is greater than or equal to x =
2P(X ≥ x)

Solution procedure:

1. List the data values X1, X2, …., Xn.


2. List the values that are greater than M and those that are less than M.
3. Remove the data values equal to M.
4. Rank the positive values and sum the positive rank values to give x+.
5. Rank the negative values and sum the negative rank values to give x–.
6. Choose the value of x depending upon the alternative hypothesis statement from x–,
x+.
7. Calculate P(X  x) given binomial distribution and p = 0.5, n = n′.

Example 8.5

To illustrate the concept, consider a sales manager who oversees 16 inside and 16
outside sales representatives. Every inside sales representative is paired with an
outside sales representative. After interviewing all of them, the sales manager assigns
values to the level of understanding that every inside and every outside salesperson
have of their shared territories (see Table 8.23). A mark of 1 means great insight and a
mark of 5 means little or no insight.

The null hypothesis statement is that there is the same level of understanding and
insight of their respective territories among the inside and outside salespeople.

The alternative hypothesis is that the outside sales staff have a different insight
compared to the inside sales staff.

Page | 462
T A B T A B
1 3 4 9 4 5
2 4 3 10 5 4
3 3 5 11 2 4
4 5 3 12 2 1
5 3 2 13 4 3
6 5 3 14 5 2
7 1 2 15 2 5
8 4 2 16 4 5
Table 8.23 Sales representatives levels of understanding

The five-step procedure to conduct this test progresses as follows

Step 1 State hypothesis

H0: The median of the differences is zero.


H1: The median differences are different.
Two-tailed test

Step 2 Select test

• Two dependent samples, both samples consist of ordinal data assigned by the
sales manager, and no information on the form of the distribution.
• Conduct sign test.

Step 3 Set the level of significance

Significance level  = 0.05

Step 4 Extract relevant statistic

The solution process can be broken down into a series of steps:

Enter data
Denote the sales manager’s marks for inside salespeople by A and for outside
salespeople by B.

Calculate the differences d = B – A.


State the sign (+ or –) for each paired difference.

Territory A (Inside) B (Outside) d=B-A Sign


1 3 4 1 +
2 4 3 -1 -
3 3 5 2 +
4 5 3 -2 -
5 3 2 -1 -
6 5 3 -2 -
7 1 2 1 +

Page | 463
8 4 2 -2 -
9 4 5 1 +
10 5 4 -1 -
11 2 4 2 +
12 2 1 -1 -
13 4 3 -1 -
14 5 2 -3 -
15 2 5 3 +
16 4 5 1 +
Table 8.24 Calculation of the positive and negative signs

The median difference, d = median (1, - 1, 2, …., 1) = - 1.0

The median difference for B – A is equal to - 1.0, which suggests that the
understanding from the outside sales staff is less than the inside sales staff.

Allocate ‘+’ and ‘–‘, depending on whether d > 0 or d < 0 as illustrated in Table
8.24.

Total number of trials, n = 16

From Table 8.24:

Number of – (negative ranks), x– = 9

Number of + (positive ranks), x+ = 7

Calculate the number of paired values that give d = 0, and find x and n′

Number of values with d equal to zero, n0 = 0

Calculate x = max(x–, x+) = max (9,7) = 9

Adjust n to remove the number of values with d equal to zero, n′ = n – n0 =


16

Calculate binomial probabilities, P(X ≥ x)

This problem can be solved exactly given this is a binomial distribution


with x successes from n′ trials when the probability of each trial p = 0.5.

• Probability of success, p = 0.5.

• Number of trials, n’ = 16.

• Number of successes, x = 9.

Solve P(X ≥ x) when x = 9, n′ = 16. The corresponding binomial p-value is


calculated from equation (8.14):

Page | 464
16

𝑝 = 𝑃(𝑋 ≥ 𝑥) = ∑ 𝐶𝑥𝑛 𝑝 𝑥 (1 − 𝑞)𝑛−𝑥


𝑥=9

In this case we wish to solve the problem:

p = P(X ≥ 9) = P(X = 9, 10, 11, 12, 13, 14, 15, 16)

p = P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12) + P(X=13) + P(X=14) +


+ P(X=15) + P(X=16)

To find the binomial probability (X  9 where n = 16, p = 0.5)

P(X = x) = 𝐶𝑥𝑛 px qn-x

P(X  9) = P(X=9) + P(X=10) + …… + P(X=16)

P(X = 9) = 𝐶916 (0.5)9 (1 – 0.5)16-9

P(X = 9) = 𝐶916 (0.5)9 (0.5)7

P(X = 9) = 𝐶916 (0.5)16

Remember, that

n!
𝐶𝑥𝑛 =
x! (n − x)!

16!
𝐶916 =
9! (16 − 9)!

16!
𝐶916 =
9! (7)!

16 × 15 × 14 × 13 × 12 × 11 × 10 × 9!
𝐶916 =
9! 7!

16 × 15 × 14 × 13 × 12 × 11 × 10
𝐶916 =
7!

16 × 15 × 14 × 13 × 12 × 11 × 10
𝐶916 =
7×6×5×4×3×2×1

57657600
𝐶916 =
5040

𝐶916 = 11440

Page | 465
Therefore,

P(X = 9) = 𝐶916 (0.5)16

P(X = 9) = 11440  1.525878907 E -5

P(X = 9) = 0.1746

Repeat the calculations for the other terms to give P(X  9) as illustrated in Table
8.25

Solve P(X  9)
P(X = 9) = 0.174561
P(X = 10) = 0.122192
P(X = 11) = 0.066650
P(X = 12) = 0.027771
P(X = 13) = 0.008545
P(X = 14) = 0.001831
P(X = 15) = 0.000244
P(X = 16) = 0.000015
Table 8.25 Binomial probabilities

p = P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12) + P(X=13) + P(X=14) +


+ P(X=15) + P(X=16)

p = P(X  9) = 0.1746 + 0.1222 + ……. + 0.0002 + 0.0000

P(X  9) = 0.402

This represents the one-tailed binomial p-value.

Step 5 Make a decision

Given we have a two-tailed test, then the 2-tailed binomial p-value = 2 x 0.402 = 0.804.
Given 2-tailed binomial p-value = 0.804 > 0.05, fail to reject the null hypothesis.
Conclude the evidence suggests that we have no significant difference at a 5%
significance level.

Large sample sign test – Normal approximation

If n is sufficiently large (n > 30), we can use a normal approximation with the value of
the population mean and standard deviation given by equations (3.23) and (3.24): n′ =
16, p = 0.5.

For a binomial distribution

Binomial population mean = n  p

Page | 466
Binomial variance = n  p  q

Therefore, for n sufficiently large, then we can approximate the binomial with a
normal distribution

Normal population mean,  = n p

Normal population variance, 2 = n p q

Normal population standard deviation, σ = √n p q

Therefore, for the previous example:

 = n p = 16  0.5 = 8

2 = n p q = 16  0.5  (1 – 0.5) = 4

σ = √n p q = √4 = 2

Now we need to solve P(X  9) for the binomial distribution. Remember the
binomial distribution is a discreate distribution and we wish to approximate this
with the normal distribution, which is a continuous distribution. Therefore, we
need to state the value of X for the normal distribution given the binomial X value
is X 9.

P (X  9 binomial) = P(X  8.5 normal)

Now calculate the value of the Z test statistic

X− μ 8.5− 8
Zcal = = = 0.25
σ 2

Calculate the critical z-test statistic

From standardised normal table, Zcri =  1.96 given a two-tailed test at a


5% significance level.

Decision

Given Zcal = 0.25 lies between – 1.96 and + 1.96, then fail to reject the null
hypothesis. We conclude that outside salespeople have equal levels of
insight in their sales territory than inside salespeople.

Excel solution

Figures 8.45 and 8.46 illustrate the Excel solution. Values labelled as A represent the
inside salespeople, and B are the outside salespeople. The territories to which every
matched pair belongs are marked 1, …, 16. They are considered matched because they
exchange and discuss the business issues related to their territory.

Page | 467
Figure 8.45 Example 8.5 Excel solution

Observe in Figure 8.45, we created extra columns called d = B – A, and Sign. These are
calculated as follows: d = B – A (Cell E4, formula: = D4 – C4, copy formula down E4:E19),
and Sign (Cell G4, formula: = IF (E4 < 0, ” - “, IF (E4 > 0, ” + ”,” 0 ”), copy formula down
G4:G19).

Figure 8.46 Example 8.5 Excel solution continued

Identify rejection region using the p-value method

Page | 468
From Excel, the two-sided p-value is 0.804 (cell L37). Does the test statistic lie in the
rejection region? Compare the chosen significance level (α) of 5% (or 0.05) with the
calculated two-sided p-value of 0.804. We observe that the p-value is greater than α, and
we fail to reject the null hypothesis, H0. We conclude that outside salespeople have
equal levels of insight in their sales territory than inside salespeople.

Normal approximation

Figure 8.47 Example 8.5 Excel solution continued

From Excel:

P(X normal  8.5) = 0.40129

Therefore, two-tail p=value = 2  P(X normal  8.5) = 2  0.40129 = 0.803

Given two-tail p-value = 0.803 > 0.05, fail to reject the null hypothesis, H0.

We conclude that outside salespeople have equal levels of insight in their sales
territory than inside salespeople.

SPSS solution

Enter data into SPSS (Figure 8.48).

Page | 469
Figure 8.48 Example 8.5 SPSS data

Select Analyze > Nonparametric Tests > Legacy Dialogs > 2 Related samples.

Transfer variables A and B into Test Pairs box (Figure 8.49).

Figure 8.49 SPSS Two-Related-Samples Tests menu

Click on Exact and choose Exact (Figure 8.48).

Figure 8.50 SPSS Exact Tests option

Page | 470
Click Continue.

Click on Options.

Choose Descriptives and select Descriptives and Quartiles (Figure 8.51).

Figure 8.51 SPSS Two-Related-Samples Tests menu


Click Continue
Click OK.

SPSS output

Figures 8.52 – 8.54 represent the SPSS solutions

Figure 8.52 SPSS solution

Figure 8.53 SPSS solution continued

Page | 471
Figure 8.54 SPSS solution continued

The output is shown in Figures 8.52–8.54. From Figure 8.54, the two-tailed p-
value is 0.804. This is the same as with the Excel p-value of 0.804 for the exact
binomial method.

Conclusions

Conclude that outside salespeople have the same insight in their sales territory than
inside salespeople [two-tail p-value = 0.804 < 0.05].

Comparing one sample against a population median

If you have one sample, then you can conduct a sign test by using the
binomial test in Excel and SPSS. For example, consider a single sample of
results with the null hypothesis H0: median = 22, and the alternative
hypothesis H1: median ≠ 22.

Excel solution

Change the variable labelled B to the value of the population median = 22.

SPSS solution

Sample data X.
Compute a new variable called newX = X – 22. The 22 is the value from the
null hypothesis.

Now remove any zero values for newX. In SPSS, Select Data > Select Cases.

Click on If condition is satisfied and move the variable newX into Numeric
Expression box and add ~=0 (should read newX ~= 0). The SPSS symbol
~=0 represents  in the null hypothesis.

Now select Analyze > Nonparametric Tests > Legacy Dialogs > Binomial.
Transfer the new variable newX to the Test Variable box.

Under Define Dichotomy, choose Cut point = 0.


Test proportion = 0.5.

Click OK.

Page | 472
Check your understanding

X8.11 A researcher has carried out a sign test with the following results: the sum of
positive and negative signs is 15 and 4, respectively with 3 ties. Given that
binomial p = 0.5, assess (at the 5% significance level) whether there is evidence
that the median value is greater than 0.5.

X8.12 A teacher of 40 university students studying the application of Excel within a


business context is concerned that students are not taking a group work
assignment seriously. This is deemed to be important given that the group work
element is contributing to the development of personal development skills. To
assess whether this is a problem the module tutor devised a simple experiment
which judged the individual level of cooperation by each individual student
within their own group. In the experiment, a rating scale was employed to
measure the level of cooperation: 1 = limited cooperation, 5 = moderate
cooperation and 10 = complete cooperation. The testing consisted of an initial
observation, a lecture on working in groups, and a final observation. Given the
raw data in Table 8.26 conduct a test to assess (at the 5% significance level)
whether we can observe that cooperation has significantly changed.

5, 8 4, 6 3, 3 6, 5 8, 9 10, 9 8, 8 4, 8 5, 5 8, 9
3, 5 5, 4 6, 5 4, 4 7, 8 7, 9 9, 9 8, 7 5, 8 5, 6
8, 7 8, 8 3, 4 5, 6 6, 7 4, 8 7, 8 9, 10 10, 10 8, 9
8, 8 4, 6 4, 5 7, 8 5, 7 7, 9 8, 10 3, 6 5, 6 7,8
Table 8.26 Data in pairs. The number before the comma sign is assessment
before the experiment, and after the comma is after the experiment

X8.13 A leading business training firm advertises in its promotional material that class
sizes at its Paris branch are no greater than 25. Recently the firm has received
feedback from many disgruntled students complaining that class sizes are
greater than 25 for most of its courses in Paris. To assess this claim, the company
randomly selects 15 classes and measures the class sizes as follows: 32, 19, 26,
25, 28, 21, 29, 22, 27, 28, 26, 23, 26, 28, and 29. Carry out an appropriate test to
assess (at the 5% significance level) whether there is any justification to the
complaints (assess at 5%). What would your decision be if you assessed at the
1% significance level?

Wilcoxon signed-rank test for matched pairs

The Wilcoxon signed-rank test is the nonparametric equivalent of paired two-sample t


test. It is used in those situations in which the observations are paired, and you have not
met the assumption of normality.

The Wilcoxon signed-rank test assumptions are:

1. Each pair is chosen randomly and independently.


2. Data are paired and come from the same population.

Page | 473
3. Both samples consist of quantitative or ordinal (rank) data. Remember your
quantitative data will be converted to rank data.
4. The distribution of the paired differences is a symmetric distribution or at least
not very skewed.

If the fourth assumption is violated, then you should consider using the sign test, which
does not require symmetry.

As for the sign test, the Wilcoxon signed-rank test is used to test the null hypothesis that
the median of a distribution is equal to some value. The method considers the
differences between n matched pairs as one sample. If the two population distributions
are identical, then we can show that the sample statistic has a symmetric null
distribution.

When the number of paired observations is small (n ≤ 20) we need to consult tables; but
when the number of paired observations is large (n > 20) we can use a test based on the
normal distribution. Although the Wilcoxon signed-rank test assumes neither normality
nor homogeneity of variance, it does assume that the two samples are from populations
with the same distribution shape. It is also vulnerable to outliers, although not to nearly
the same extent as the t test. If we cannot make this assumption about the distribution,
then we should use a test called the sign test for ordinal data. McNemar’s test is
available for nominal paired data relating to dichotomous qualitative variables.

In this section we shall apply the Wilcoxon signed-rank test where we have a large and
small number of paired observations. In the case of many paired observations (n > 20)
we shall use a normal approximation to provide test of our hypothesis. Furthermore, for
many paired observations we shall use Excel to calculate both the p-value and critical z
value to decide. The situation of a small number of paired observations (n ≤ 20) will be
described together with an outline of the solution process.

The solution process can be broken down into a series of steps:

1. State hypotheses H0 and H1.


2. Calculate the differences between pairs (d = X – Y).
3. If the difference between pairs d = 0, then remove this data pair from the
analysis.
4. Record the sign of the difference in one column, the absolute value of the
difference in the other column.
5. Rank the absolute differences from the smallest to the largest.
6. Reattach the signs of the differences to the respective ranks to obtain signed
ranks, then average to obtain the mean rank.
7. Calculate number of paired values and adjust for shared ranks (n0), n′ = n – n0.
8. Calculate the sum of the ranks, T– and T+.
9. State Tcal = minimum value of T– and T+.
10. Use statistical tables for the Wilcoxon signed-rank test to find the probability of
observing a value of T or lower. This is your p-value if the test is one-sided. If
your alternative hypothesis is two-sided, then double this probability to give the
p-value. For large samples (n > 20), the T statistic is approximately normally

Page | 474
distributed under the null hypothesis that the population differences are centred
at zero.

We shall solve Wilcoxon signed-rank test using the normal approximation.

Example 8.6

A study is made to determine whether there is a difference between husbands and


wives’ attitudes towards online marketing advertisements. A questionnaire measuring
this was given to 24 couples with the results summarized in table above (ordinal range
0 (hate) to 20 (love). Is there a significant difference with the couple's attitude to the
online advertisements at a 5% level of significance?

ID Wife, W Husband, H
1 15 17
2 8 19
3 11 18
4 19 19
5 13 17
6 4 5
7 16 13
8 5 3
9 9 16
10 15 21
11 12 12
12 11 9
13 14 10
14 4 17
15 11 12
16 17 24
17 14 12
18 5 12
19 9 8
20 8 16
21 9 12
22 11 7
23 11 17
24 12 13
Table 8.27 Survey outcomes

The five-step procedure to conduct this test progresses as follows

Step 1 State hypothesis

Page | 475
Under the null hypothesis, we would expect the distribution of the differences to
be approximately symmetric around zero and the distribution of positives and
negatives to be distributed at random among the ranks.

H0: no difference in attitudes to adverts between husband and wife

Median W - Median H = 0

H1: difference in attitudes to adverts between husband (H) and wife (W)

Median W - Median H ≠ 0
Two-tailed test at 5% significance

Step 2 Select test

• Two dependent samples.


• Both samples consist of ordinal (rank) data.
• No information on the form of the distribution but we shall assume the
differences is a symmetric distribution.
• Wilcoxon signed-rank test.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract relevant statistic

Calculate the difference (d = W – H), and sign of the difference

Please note we have shaded the cell green where we have a difference d = W – H
= 0. These values are not included in the analysis.

Difference Difference
ID Wife, W Husband, H d=W-H (d = W - H) sign
1 15 17 -2 -
2 8 19 -11 -
3 11 18 -7 -
4 19 19 0
5 13 17 -4 -
6 4 5 -1 -
7 16 13 3 +
8 5 3 2 +
9 9 16 -7 -
10 15 21 -6 -
11 12 12 0
12 11 9 2 +
13 14 10 4 +

Page | 476
14 4 17 -13 -
15 11 12 -1 -
16 17 24 -7 -
17 14 12 2 +
18 5 12 -7 -
19 9 8 1 +
20 8 16 -8 -
21 9 12 -3 -
22 11 7 4 +
23 11 17 -6 -
24 12 13 -1 -
Table 8.28 Calculation of sign of the differences

Calculate magnitude of the differences (d = W – H) and rank the differences


depending upon both positive (r+) and negative (r-)

ID Magnitude of difference, d = W - H Rank difference, R


1 2.0 6.5
2 11.0 21.0
3 7 18.5
4
5 4 12
6 1 2.5
7 3 9.5
8 2 6.5
9 7 18.5
10 6 14.5
11
12 2 6.5
13 4 12
14 13 22
15 1 2.5
16 7 18.5
17 2 6.5
18 7 18.5
19 1 2.5
20 8 20
21 3 9.5
22 4 12
23 6 14.5
24 1 2.5
Table 8.29 Magnitude and rank of differences

Page | 477
Identify positive and negative ranks

ID Positive ranks, R+ Negative ranks, R-


1 6.5
2 21
3 18.5
4
5 12
6 2.5
7 9.5
8 6.5
9 18.5
10 14.5
11
12 6.5
13 12
14 22
15 2.5
16 18.5
17 6.5
18 18.5
19 2.5
20 20
21 9.5
22 12
23 14.5
24 2.5
Table 8.30 Identification of positive and negative ranks

Calculate the median difference

Median difference d=median (-2 + -11 + -7 + …….. + 4 + -6 + -1) = - 1.5

The median difference is – 1.5 which supports the alternative hypothesis


that d ≠ 0. The question then is whether this difference is statistically
significant.

Identify difference values d = 0 and remove from the analysis

The fourth and eleventh paired data point give a difference value d = = 0.
This value is then removed from the solution (cells shaded green in tables
8.28 – 8.30).

Calculate number of paired values

Page | 478
Number of paired ranks, n = 24

Number of data pairs where d = 0 is 2, that is, n0 = 2.

Adjust n to remove data pairs with d = 0, n′ = n – n0 = 24 – 2 = 22.

Calculate the sums of the ranks, T+ and T-:

T+ = Sum of positive ranks

T+ = 9.5 + 6.5 + 6.5 + 12 + 6.5 + 2.5 +12

T+ = 55.5

T– = Sum of negative ranks

T– = 6.5+ 21+ 18.5+ 12 + 2.5 + 18.5 + 14.5 + 22 + 2.5 + 18.5 + 18.5 +


20 + 9.5 + 14.5 + 2.5

T– = 198.5

Calculate the Wilcoxon signed-ranks test statistic, Tcal

Tcal = min (T–, T+)

Tcal = min(198.5, 55.5)

Tcal = 55.5

Critical value

At this stage, we can use Wilcoxon signed-rank test critical tables to look up a
critical T value for a 5% significance level and with 22 paired data values.

Figure 8.55 Critical values of the Wilcoxon matched-pairs signed rank test

Page | 479
From Figure 8.55, the critical value of T when we have n = 22 and we are testing
at a 5% significance is Tcri = 75.

Thus, Tcal < Tcri (55.5 < 75), reject the null hypothesis and accept the alternative
hypothesis.

Step 5 Make decision.

We conclude there is a significant difference between a wife and her husband's attitude
to online adverts.

Normal approximation to the Wilcoxon sign rank test

For large samples (n > 20), the T statistic is approximately normally distributed under
the null hypothesis that the population differences are centred at zero. When this is
true, the mean and standard deviation values are given by equations (8.13) and (8.14).

n′ (n′ +1)
μT = (8.13)
4

n′ (σ′ +1) (2 n′ +1)


σT = √ (8.14)
24

Then, for large n, the distribution of the random variable, Z, is approximately standard
normal, where

Tcal − μT
ZT = (8.15)
σT

Applying equations (8.13)–(8.15) given n’ = 22:

Population mean:

n′ (n′ + 1)
μT =
4

22 (22 + 1)
μT =
4

μT = 126.5

Population standard deviation:

n′ (σ′ + 1) (2 n′ + 1)
σT = √
24

Page | 480
22(22 + 1) (2 × 22 + 1)
σT = √
24

22770
σT = √
24

σT = 30.802

Value of Z given Tcal = 55.5, T = 126.5, and T = 30.802

Tcal − μT
ZT =
σT

55.5 − 126.5
ZT =
30.802

ZT = −2.305

Notes:

1. If the alternative hypothesis is upper one-tailed, then reject the null hypothesis if

Tcal − μT
ZT = > + 𝑍𝛼 (8.16)
σT

2. If alternative hypothesis is lower one-tail, then reject the null hypothesis if

Tcal − μT
ZT = < − 𝑍𝛼 (8.17)
σT

Decision

The calculated test statistic Zcal = -2.305.

The critical z values can be found from tables to give the two-tailed 5% Zcri = 
1.96.

Does the test statistic lie within the rejection region?

Compare the calculated and critical z values to determine which hypothesis (H0
or H1) to accept. Given Zcal < lower Zcri (– 2.305 < –1.96), reject the null
hypothesis.

We conclude there is a significant difference between a wife and her husband's attitude
to online adverts.

Continuity correction for Z

Page | 481
The standardised Z test is a continuous distribution that provides an
approximation to the discrete T statistic (ranked data) by applying a continuity
correction applied to equation (8.18):

|Tcal − 𝜇𝑇 |− 0.5
Zcal = (8.18)
σT

For this example, Tcal = 55.5, T = 126.5, T = 30.802. Therefore, corrected Z value
= + 2.2888 > + 1.96. We conclude there is a significant difference between a wife
and her husband's attitude to online adverts.

Issue of tied ranks

If there are a large number of ties then equation (8.14) can be replaced by
equation (8.19), which provides a better estimator of the standard deviation:

𝑛(𝑛+1)(2𝑛+1) 1
𝜎=√ − ∑𝑛𝑖=1(𝑓𝑖3 − 𝑓𝑖 ) (8.19)
24 48

Where i varies over a set of tied ranks and fi is the frequency that the rank i
appears.

Excel solution

Figures 8.56 – 8.58 illustrate the Excel solution.

Figure 8.56

Page | 482
Figure 8.57

Figure 8.58

Page | 483
Figure 8.59

Figure 8.60

Figure 8.61

We conclude there is a significant difference between a wife and her husband's attitude
to online adverts.

Page | 484
Values not corrected for ties

Estimate of change to variance is 2.625 based upon shared rank 12.5 with f = 4,
shared rank = 22.5 with f = 4, and shared rank 15.5 with f = 2. If you update the
sigma value, then your value of Z will be very close to the SPSS solution.

SPSS solution

Enter data into SPSS

Figure 8.62 Example 8.6 SPSS data

Click on Analyze > Nonparametric Tests > Legacy Dialogs > 2 Related Samples.

Transfer X and newY variables to the Test Pairs box.

Click on Wilcoxon test

Page | 485
Figure 8.63 SPSS Two-Related-Samples Tests menu

Click Options

Choose Descriptives and Quartiles

Figure 8.64 Options

Click OK

SPSS output

Figure 8.65 Example 8.6 SPSS solution

Page | 486
Figure 8.66 Example 8.6 SPSS solution continued

Figure 8.67 Example 8.6 SPSS solution continued

The rows labelled Asymptotic Sig. and Exact Sig. tell us the probability that a test
statistic of at least that magnitude would occur if there were no differences between
groups.

1. If you have a small sample use the Exact Sig. value.


2. If you have a large sample (n > 10) use the Asymptotic Sig. value. The test value Z
is approximately normally distributed for large samples.

From SPSS:
Z = - 2.311 with 2-tail asymptotic. p-value = 0.021.

From Excel:
Z = - 2.305 with 2-tail asymptotic p-value = 0.021.

The two-tail p-value associated with the Z-score (- 2.311) is 0.021, which means that
there is a probability of 0.021 that we would get a value of Z as large as the one we have
if there were no effect in the population. Given two-tail p-value = 0.021 < 0.05
(significance level), we reject the null hypothesis and accept the alternative hypothesis.

Conclusion

We conclude there is a significant difference between a wife and her husband's attitude
to online adverts.

Page | 487
Check your understanding

X8.14 The Wilcoxon paired ranks test is more powerful than the sign test. Explain why.

X8.15 A company is planning to introduce new packaging for a product that has used
the same packaging for over 20 years. Before it decides on the new packaging the
company decides to ask a panel of 20 participants to rate the current and
proposed packaging (using a rating scale of 0–100, where higher scores are more
in favour of change); see Table 8.31. Is there any evidence that the new
packaging is more favourably received than the older packaging? Assess at the
5% significance level.

Participant Before After Participant Before After


1 80 89 11 37 40
2 75 82 12 55 68
3 84 96 13 80 88
4 65 68 14 85 95
5 40 45 15 17 21
6 72 79 16 12 18
7 41 30 17 15 21
8 10 22 18 23 25
9 16 12 19 34 45
10 17 24 20 61 80
Table 8.31 Rating of current and proposed packing

X8.16 A local manufacturer is concerned at the number of errors made by machinists in


the production of kites for a multinational retail company. To reduce the number
of errors being made the company decides to retrain all staff in a new set of
procedures. To assess whether the training worked, a random sample of 30
machinists was selected and the number of errors made before and after the
training recorded, as illustrated in Table 8.32. Is there any evidence that the
training has reduced the number of errors? Assess at the 5% significance level.

Machinist
1 2 3 4 5 6 7 8 9 10
Before 49 34 30 46 37 28 48 40 42 45
After 22 23 32 24 23 21 24 29 27 27
11 12 13 14 15 16 17 18 19 20
Before 29 45 32 44 49 28 44 39 47 41
After 23 29 37 22 33 27 35 32 35 24
21 22 23 24 25 26 27 28 29 30
Before 33 38 35 35 47 47 48 35 41 35
After 37 37 24 23 23 37 38 30 29 31
Table 8.32 Number of errors

Mann–Whitney U test for two independent samples

The Mann–Whitney U test is a nonparametric test that can be used in place of an


unpaired t test. It is used to test the null hypothesis that two samples come from the

Page | 488
same population (i.e. have the same median) or, alternatively, whether observations in
one sample tend to be larger than observations in the other. This test is also called the
Mann-Whitney-Wilcoxon test and is equivalent to the Wilcoxon Rank Sum test.

Although it is a nonparametric test it does assume that the two distributions are similar
in shape. Where the samples are small, we need to use tables of critical values to
determine whether to reject the null hypothesis. Where the sample is large, we can use
a test based on the normal distribution.

The basic premise of the test is that once all the values in the two samples are put into a
single ordered list, if they come from the same parent population, then the rank at
which values from sample 1 and sample 2 appear will be by chance. If the two samples
come from different populations, then the rank at which the sample values will appear
will not be random and there will be a tendency for values from one of the samples to
have lower ranks than values from the other sample.

We are thus testing for different locations of the two samples. When you want to
compare the distributions in two samples which are independent of each other then you
have two tests you can apply: the Mann–Whitney U test or the equivalent Wilcoxon
rank-sum test. In this section, we will adopt the Mann–Whitney U test method to solve
this type of problem. Whenever the sample sizes are greater than 20,a large-sample
approximation can be used for the distribution of the Mann–Whitney U statistic.

The Mann–Whitney U test assumptions are as follows:

1. Random samples from populations.


2. Independence within samples and mutual independence between samples.
3. Both samples consist of quantitative or ordinal (rank) data. Remember your
quantitative data will be converted to rank data.
4. Populations for each sample have the same shape. This implies the that the two
populations have the same median value.

Although it is a nonparametric test it does assume that the two distributions are similar
in shape. This can be assessed by the creation of an histogram (or five-number-
summary, boxplot) for the two samples and using the information to make a
comparison of the shapes of each sample shape.

This is the nonparametric equivalent to the two-sample t test with equal variances. It is
used primarily when the data have not met the assumption of normality (or should be
used when there is enough doubt). The test is based on ranks. It has good efficiency,
especially for symmetric distributions. There are exact procedures for this test given
small samples with no ties, and there are large-sample approximations.

The solution process can be broken down into a series of steps:

1. For each observation create a column that can be used to identify the group
membership for each sample value.
2. Add the sample data values into the next column.
3. Now rank the data – if samples values are the same then share the rank value.

Page | 489
4. Calculate the size of sample 1 and sample 2.
5. Calculate the sum of ranks for sample 1 and sample 2.
6. Calculate the test statistic U1 for sample 1 and U2 for sample 2.
7. Calculate the Mann-Whitney test statistic U = min (U1, U2).
8. Use statistical tables for the Mann-Whitney U test to find the probability of
observing a value of U or lower. If the test is one-sided, this is your p-value; if the
test is a two-sided test, double this probability to obtain the p-value.

For large samples (n > 20), the U statistic is approximately normally distributed under
the null hypothesis that the population differences are centred at zero.

Example 8.7

A company is considering adopting a new method of training for its employees. To


assess whether the new method improves employees’ effectiveness, the firm has
collected two random samples from the population of employees sitting the training
assessment. Training type 1 employees have studied via the traditional method, and
training type 2 employees via the new method.

The exam scores are given in Table 8.33. The firm has analysed previous data, and the
outcome of the results provides evidence that the distribution is not normally
distributed but is skewed to the left. This information provides concerns about the
suitability of using a two-sample independent t test for the analysis. Instead, we decide
to use a suitable distribution-free test. In this case, the appropriate test is the Mann–
Whitney U test.

Training Sample Training Sample Training Sample Training Sample


type value type value type value type value
2 36 2 38 2 25 2 23
1 34 2 42 2 33 2 39
2 37 2 39 2 32 2 27
1 40 1 38 2 24 2 35
1 21 2 32 1 39 2 31
2 34 1 27 1 40 2 21
2 28 2 28 2 32 1 36
1 34 1 35 2 35 2 34
2 24 2 34 2 40 1 29
1 23 2 39 2 34 2 39
1 35 2 36 2 34 1 38
2 43 1 38 1 28 1 30
1 23 2 31 2 38 1 36
2 33 1 29 2 39 1 38
1 38 1 27 2 28 2 34
1 25 1 34 2 28 2 39
2 30 2 40 2 18 1 27
2 35 1 22 1 31 2 40

Page | 490
2 21 2 37 2 40 1 29
2 36 2 42 2 39 1 40
1 28 2 33 2 40 2 30
2 25 2 35 2 37 1 39
2 43 1 27 2 27 2 35
2 34 1 39 1 32 1 25
1 31 2 28 1 23 1 34
Table 8.33 Training method comparison

The five-step procedure to conduct this test progresses as follows.

Step 1 State hypothesis

Null hypothesis

H0: no difference in effectiveness between the two training methods

Median 1 – Median 2 = 0

This is equivalent to saying that the two samples come from the same
population.

Alternative hypothesis

H1: difference exists between the training methods

Median 1 – Median 2 ≠ 0

Two-tailed test.

Step 2 Select test

• Comparing two independent samples


• Both samples consist of ordinal (ranked) data
• Unknown population distribution but assumed similar shapes between the
two populations.

Mann–Whitney U test.

Step 3 Set the level of significance

 = 0.05

Step 4 Extract relevant statistic

The solution process can be broken down into a series of steps:

Input samples into two columns and rank data

Page | 491
Combined sample: sample 1 = 1, and sample 2 = 2. The convention is to
assign rank 1 to the smallest value and rank n′ to the largest value. If you
have any shared ranks, then the policy is to assign the average rank to
each of the shared values as illustrated in tables 8.34 – 8.37.

Combined
ID Training type Samples Rank*
1 2 36 66
2 1 34 51
3 2 37 70
4 1 40 92.5
5 1 21 3
6 2 34 51
7 2 28 25
8 1 34 51
9 2 24 10.5
10 1 23 8.5
11 1 35 60
12 2 43 99.5
13 1 23 8.5
14 2 33 44
15 1 38 75
16 1 25 13.5
17 2 30 33
18 2 35 60
19 2 21 3
20 2 36 66
21 1 28 25
22 2 25 13.5
23 2 43 99.5
24 2 34 51
25 1 31 36.5
Table 8.34 Calculation of ranks

Combined
ID Training type Samples Rank*
26 2 38 75
27 2 42 98.5
28 2 39 83.5
29 1 38 75
30 2 32 40.5
31 1 27 18.5
32 2 28 25

Page | 492
33 1 35 60
34 2 34 51
35 2 39 83.5
36 2 36 66
37 1 38 75
38 2 31 36.5
39 1 29 30
40 1 27 18.5
41 1 34 51
42 2 40 92.5
43 1 22 5
44 2 37 70
45 2 42 98.5
46 2 33 44
47 2 35 60
48 1 27 18.5
49 1 39 83.5
50 2 28 25
Table 8.35 Calculation of ranks cont.

Combined
ID Training type Samples Rank*
51 2 25 13.5
52 2 33 44
53 2 32 40.5
54 2 24 10.5
55 1 39 83.5
56 1 40 92.5
57 2 32 40.5
58 2 35 60
59 2 40 92.5
60 2 34 51
61 2 34 51
62 1 28 25
63 2 38 75
64 2 39 83.5
65 2 28 25
66 2 28 25
67 2 18 1
68 1 31 36.5
69 2 40 92.5
70 2 39 83.5
71 2 40 92.5

Page | 493
72 2 37 70
73 2 27 18.5
74 1 32 40.5
75 1 23 8.5
Table 8.36 Calculation of ranks

Combined
ID Training type Samples Rank*
76 2 23 8.5
77 2 39 83.5
78 2 27 18.5
79 2 35 60
80 2 31 36.5
81 2 21 3
82 1 36 66
83 2 34 51
84 1 29 30
85 2 39 83.5
86 1 38 75
87 1 30 33
88 1 36 66
89 1 38 75
90 2 34 51
91 2 39 83.5
92 1 27 18.5
93 2 40 92.5
94 1 29 30
95 1 40 92.5
96 2 30 33
97 1 39 83.5
98 2 35 60
99 1 25 13.5
100 1 34 51
Table 8.37 Calculation of ranks cont.

Median values for type 1 and type 2

If you calculate the median value for the two training types, then the
median values are:

M1 = 32 for sample 1
M2 = 34 for sample 2

Page | 494
We can observe that the median for sample 2 is larger than for sample 1
(34 > 32). The question now reduces to whether this difference is
significant?

Count the number of data points in each sample:

Number in sample 1, n1 = 39

Number in sample 2, n2 = 61

Calculate the sum of the ranks, T1 and T2:

T1 = Sum of sample 1 (traditional method) ranks

T1 = 51 + 92.5 + + 13.5 + 51 = 1778.00

T2 = Sum of sample 2 (new method) ranks

T2 = 66 + 70 + ………. + 33 + 60 = 3273.00

Calculate U1, U2 and the test statistic Ucal

The values of U1 and U2 are given by equations (8.20) and (8.21):

n1 (n1 +1)
U1 = n1 n2 + − T1 (8.20)
2

n2 (n2 +1)
U2 = n1 n2 + − T2 (8.21)
2

The test statistic U is equal to the difference between the maximum possible
values of T for the sample versus the observed values of T: U1 = T1,max – T1 and U2
= T2,max – T2.

Applying equations (8.20) and (8.21) enables U1 and U2 to be calculated:

n1 (n1 + 1)
U1 = n1 n2 + − T1
2

39 (39 + 1)
U1 = 39 × 61 + − 1778.00
2

U1 = 2379 + 780 − 1778.00

U1 = 1382

61 (61 + 1)
U2 = 39 × 61 + − 3273
2

U2 = 2379 + 1891 − 3273

Page | 495
U2 = 997

Please note we only need to calculate either U1 or U2 given that we can find the
other value from equation (8.22):

U1 + U2 = n1n2 (8.22)

Check:

U1 + U2 = 1382 + 997 = 2379

n1n2 = 39  61 = 2379

The value of Ucal can be either U1 or U2 and for this example we will choose Ucal =
minimum value of U1, U2.

Ucal = min(U1, U2) = min(1382, 997) = U2 = 997

The Wilcoxon rank sum test statistic (W) is defined as the smaller of the two
ranks T1, T2. For this example, W is as follows:

T1 = 1770
T2 = 3273

W = Minimum (T1, T2) = minimum (1770, 3273) = 1778.00.

Critical U value

Next, we can use Mann-Whitney tables for n1 = 39 and n2 = 61 to find Ucri


or the associated test statistic p-value. Given both n1 and n2 are large we
will use a normal approximation.

Normal approximation to the Mann–Whitney test

If the null hypothesis is true, then we would expect U1 and U2 both to be centred
at the mean value µU and variance given by equations (8.23) and (8.24):
n1 n 2
μU = (8.23)
2

39 × 61
μU = = 1189.5
2

n1 n2 (n1 + n2 +1)
σu = √ (8.24)
12

39 ×61 (39+ 61+1) 240279


σu = √ =√ =141.5035
12 12

Page | 496
For large sample sizes (both at least 10), the distribution of the random variable
given by equation (8.25) is approximated by the normal distribution.

Ucal −μU
Z= (8.25)
σU

In our example the calculated Z test statistic is:

Ucal − μU
Z=
σU

997 − 1189.5
Z= = −1.3604
141.5035

Critical value

The critical z value can be found from statistical tables.

For a two-tail test at 5%, Zcri =  1.96.

Does the test statistic lie within the rejection region?

Compare the calculated and critical z values to determine which


hypothesis statement (H0 or H1) to accept.

Given Zcal lies between the lower and upper Zcri values (- 1.96 < – 1.3604 <
+ 1.96), we fail to reject the null hypothesis, H0.

Step 5 Make a decision

There is not enough evidence at the 5% significance level to indicate that the
effectiveness has improved.

Continuity correction for Z

The standardised z test is a continuous distribution that provides an


approximation to the discrete T statistic (ranked data) by applying a continuity
correction to equation (8.24), as shown in (8.26):

|Ucal − μU |−0.5
Zcal = (8.26)
σU

For this example, Ucal = 997, U =1189.5, U = 141.5035

Zcal = - 1.3569, two-tail p-value = 0.1748

Issue of tied ranks

Page | 497
If there are many ties, then equation (8.27) provides a better estimator of the
standard deviation:

3 3
f −f
n n n −n
σ = √n12−n2 ( 12 − ∑ni=1 i12 i ) (8.27)

Where i varies over the set of tied ranks and fi is the number of times (i.e.
frequency) the rank i appears, and n = n1 + n2.

Excel solution

Figures 8.68 and 8.69 illustrate the calculation of the ranks (first and final 10 data
points used to illustrate)

Figure 8.68 Example 8.7 Excel solution

Figure 8.69 Example 8.7 Excel solution continued

Figures 8.70 and 8.71 illustrate the Excel Mann–Whitney U test solution.

Page | 498
Figure 8.70 Example 8.7 Excel solution continued

Figure 8.71 Example 8.7 Excel solution continued

From Excel:

Median sample 1 = 32
Median sample 2 = 34
Number in group 1, n1 = 39
Number in group 2, n2 = 61
Sum of group 1 ranks, T1 = 1778.00
Sum of group 2 ranks, T2 = 3273.00

T1 max = 3159
T2 max = 4270
U1 = 1382
U2 = 997

Check:

U1+U2 = n1 n2

Page | 499
U1+U2 = 2379

n1 n2 = 2379

Chose Ucal = minimum value of U1 and U2

Ucal = min(U1, U2) = Ucal = min(1382, 997)

Ucal = 997 (this is from the second sample, n2 = 61)

Mann-Whitney test statistic, Ucal = 997

Wilcoxon rank sum test statistic, W= Minimum (T1, T2) = 1778.00

Normal approximation

Two-tail p=value = 0.1748 > 0.05, fail to reject the null hypothesis
H 0.

There is not enough evidence at the 5% significance level to indicate that the
effectiveness has improved.

SPSS solution

Enter data into SPSS as illustrated in Figures 8.72

The first column is called Group and the values 1 and 2 represent the traditional method
and the new method, respectively. The second column shows the performance of
students.

Page | 500
Figure 8.72 Example 8.7 SPSS data

Figure 8.73 Example 8.7 SPSS data cont.

Page | 501
Figure 8.74 Example 8.7 SPSS data

Figure 8.75 Example 8.7 SPSS data cont.

Select Analyze > Nonparametric Tests > Legacy Dialogs > 2 Independent Samples

Page | 502
Transfer Combined_sample to the Test Variable List.

Transfer Training_type to the Grouping Variable box.

Figure 8.76 SPSS Two-Independent-Samples Tests

Click on Options and choose Descriptives and Quartiles

Figure 8.77 Enter number of samples

Click Continue

Figure 8.78 SPSS Two-Independent-Samples Tests

Page | 503
Click OK.

SPSS output

Figure 8.79 Example 8.7 SPSS solution

Figure 8.80 Example 8.7 SPSS solution continued

Figure 8.81 Example 8.7 SPSS solution continued

The rows labelled Asymptotic Sig. and Exact Sig. tell us the probability that a test
statistic of at least that magnitude would occur if there were no differences between
groups.

• If you have a small sample (n < 50) use the Exact Sig. value.
• If you have a large sample use the Asymptotic Sig. value. The test value Z is
approximately normally distributed for large samples.

From SPSS:

Mann-Whitney U test statistic = 997, Z = - 1.363 with 2-tail asymptotic p-value =


0.173.
[The Wilcoxon rank sum test statistic, W = minimum (T1, T2) = 1778.00].

From Excel:

Page | 504
Mann-Whitney U test statistic = 997, Z = - 1.360, with 2-tail p-value = 0.174.
[The Wilcoxon rank sum test statistic, W = 1778.00].

Conclusion

There is not enough evidence at the 5% significance level to indicate that the
effectiveness has improved.

Check your understanding

X8.17 What assumptions need to be made about the type and distribution of the data
when the Mann–Whitney test is used?

X8.18 Two groups of randomly selected students are tested on a regular basis as part of
professional appraisals that are conducted on a two-year cycle. The first group
has eight students, with their sum of ranks equal to 65 and the second group has
nine students. Is there sufficient evidence to suggest that the performance of the
second group is higher than the performance of the first group? Assess at the 5%
significance level.

X8.19 The sale of new homes is closely tied to the level of confidence within the
financial markets. A developer builds new homes in two European countries (A
and B) and is concerned that there is a direct relationship between the country
and the interest rates obtainable to build properties. To provide answers the
developer decides to carry out market research to see what interest rates would
be obtainable if he decided to borrow €300,000 over 20 years from ten financial
institutions in country A and 13 financial institutions in country B. Based upon
the data in Table 8.38, do we have any evidence to suggest that the interest rates
are significantly different?

A: 10.20 10.97 10.63 10.70 10.50 10.30 10.65


10.25 10.75 11.00
B: 10.60 10.80 11.40 10.90 11.10 11.20 10.89
10.78 11.05 11.15 10.85 11.16 11.18
Table 8.38 Regional interest rates

Chapter summary
In this chapter we have explored the concept of hypothesis testing for category data
using the chi-square distribution. After that, we extended the parametric tests to the
case of nonparametric tests (or so-called distribution-free tests). These tests do not
require the assumption of a normal population (or sample) distribution. This chapter
adopted the simple five-step procedure described in Chapter 7 to aid the solution
process.

Page | 505
The main emphasis is placed on the use of the p-value. This gives the probability of
rejecting the null hypothesis. The value of the p-value depends on whether we are
dealing with a two- or one-tailed test.

The second part of the decision-making described the use of the critical test statistic in
making decisions. This is the traditional textbook method. It uses published tables to
provide estimates of critical values for various test parameter values.

In the case of the chi-square test we looked at a range of applications, including:

1. Testing for differences in proportions


2. Testing for association
3. Testing how well a theoretical probability distribution fits collected sample data.

In the case of nonparametric tests, we looked at a range of tests, including:

1. Sign test for one sample


2. Two-paired-sample Wilcoxon signed-rank test
3. Two-independent-sample Mann–Whitney test.

In the case where we have more than two samples then we would have to use
techniques like the Kruskal–Wallis test if we are dealing with independent samples. For
dependent samples we would use the Friedman tests.

Test your understanding


TU8.1 During a local election the sample data in Table 8.39 were collected to ascertain
how local voters see green issues. Given these data, is there any evidence for a
difference in how the voters who completed the survey are likely to vote? Test at
the 5% significance level.

Political party Support green Indifferent Opposed


issues
Conservative 37 72 1
Labour 67 85 1
Liberal democrats 106 124 3
Table 8.39 Local voters and green issues

TU8.2 A factory makes specific components that are used in a range of family cars. The
company undertakes regular sampling to check the quality of the components
from three different machines. Each component in the sample is checked for any
defects which would result in a batch being rejected by the company. Based on
the sample data in Table 8.40, is there any association between the proportion of
defectives and the machine used? Test at the 5% significance level.

Page | 506
Outcome Machine 1 Machine 2 Machine 3
Defective 16 12 20
Non-defective 70 81 75
Table 8.40 Check on quality of components

TU8.3 Further historical research by the company described in TU2 shows that
machine 1 has a production record as follows: 80% excellent, 17% good, 3%
rejected. After machine 1 has completed a refit a sample of 200 components from
machine 1 produced the following results: 157 excellent, 42 good and 1 rejected.
Carry out a chi-square goodness-of-fit test to test whether the refit has changed
the quality of the output from machine 1. Test at the 5% significance level.

TU8.4 The manager of a university computer network has collected data on the number
of phishing attempts on university staff computers. Based on the sample data in
Table 8.41, do we have any evidence that we may be able to model the
relationship with a Poisson distribution? Test at the 5% significance level.

Number of phishing 0 1 2 3 4 5 6
attempts per day
Frequency 200 240 125 52 27 16 12
Table 8.41 Number of computer phishing attempts

TU8.5 Reconsider TU8.4 with the Poisson population average equal to 1.25.

TU8.6 A dairy farmer’s milk production over the last 12 months is shown in Table 8.42.
Based upon historical data, the milk production has been found to be uniformly
distributed. Based upon the sample data given in the table, is there any statistical
evidence at the 5% significance level that the sample data is not uniformly
distributed?

Month Milk quantity Month Milk quantity


(litres) (litres)
1 2678 7 2410
2 2602 8 2350
3 2649 9 2495
4 2588 10 2558
5 2530 11 2602
6 2397 12 2665
Table 8.42 Quantity of milk produced

TU8.7 The waiting times at a railway station for rail tickets are normally distributed
and have a population standard deviation of 8 minutes. The station master would
like to know if the variation in waiting times has been reduced after a review of
performance at the customer windows. The station master has collected the
following sample data: sample standard deviation s = 4 and sample size n = 25. Is
there evidence at the 5% significance level that the population standard
deviation has been reduced?

Page | 507
TU8.8 The business manager for several health centres suggested that the median
waiting time for each patient to see the doctor at a practice was 22 minutes. It is
believed that the median waiting time in other practice health centres is greater
than 22 minutes. A random sample of 20 visits to other health centres resulted in
the following results: 9.4, 13.4, 15.6, 16.2, 16.4, 16.8, 18.1, 18.7, 18.9, 19.1, 19.3,
20.1, 20.4, 21.6, 21.9, 23.4, 23.5, 24.8, 24.9, 26.8. Conduct a sign test to assess if
there is statistical evidence to conclude that the median visit length in these
other health practices is greater than 22 minutes.

TU8.9 Table 8.43 represents the amount spent by 17 people at two restaurants (X, Y) in
a city. Conduct a Wilcoxon signed-rank test to assess whether the amount spent
is different between the two restaurants. Use a significance level of 0.05.

X Y X Y
20.2 22.8 20.3 20.9
19.5 14.2 19.2 22.6
18.6 14.1 19.5 16.9
20.9 16.1 18.7 21.4
23.1 25.2 18.2 18.5
18.6 20.2 21.6 23.4
19.6 16.7 22.4 21.3
23.2 21.3
21.8 18.7
20.2 22.8
Table 8.43 Restaurant spend

TU8.10 The registrar in a business school is exploring staff resources allocated to


courses to provide staff support to students while studying for their
examinations. Two modules have been chosen, and the data in Table 8.44 are
from two independent random samples of 11 students studying statistics on the
business degree and 11 students studying e-commerce on the business
technology degree. Test the sample data given in Table 8.44 to assess whether
there is a difference in the median hours studied between the two groups of
students?

Page | 508
Hours spent studying for Hours spent studying for the
the statistics examination e-commerce examination
Person Rating Person Rating
1 7 1 14
2 8 2 13
3 12 3 12
4 10 4 11
5 9 5 9
6 13 6 17
7 11 7 16
8 9 8 11
9 5 9 12
10 14 10 9
11 13 11 6
Table 8.44 Hours spent studying

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A8Wa Chi-square goodness of fit


2. A8Wb Chi-square test for one population variance
3. A8Wc Chi-square test for normality.

Factorial experiments workbook

Furthermore, you will find a factorial experiments workbook that explores using Excel
and SPSS the solution of data problems with more than two samples using both
parametric and non-parametric tests.

Page | 509
Chapter 9 Linear correlation and regression analysis
9.1 Introduction and chapter overview
Cross tabulation and the chi-square distributions we covered in previous chapters were
used to illustrate and define possible associations between two variables. In this
chapter we will introduce other methods that will help us to quantify the strength of the
relationship between two variables. Once we established that there is some form of
relationship between the two variables, we can build models to see how one variable
impacts the other.

The simplest way to visualise possible association between two variables is to create a
scatter plot of one variable against the other. Such a plot will help us decide visually if
an association exist, as well as what is the possible form of this association, e.g. linear or
non-linear. If the scatter plot suggests a possible association, then we can also use least
squares regression to fit this model to the data set, as will see in the second part of this
chapter.

The first part of this Chapter is dedicated to establishing possible relationships, or


associations, between two interval, or ordinal, variables. For interval data, the strength
of this association can then be assessed by calculating Pearson’s correlation coefficient.
For ordinal data, we can use Spearman’s rank order correlation coefficient. Although the
relationships between variables could take various shapes, in this Chapter we are
focusing strictly on linear types. This will lead us to learn how to model linear
relationships using a single equation. The basic model that we will introduce is a simple
linear regression analysis.

What are the typical applications of the methods covered in this chapter? Imagine that a
local council office contracts you to participate in a project about the impact of
homelessness on the local community. You conduct some data mining and establish that
the level of homeless people in the municipality is correlated with the level of crime.
Does this mean that homeless people commit more crime? You do not know that. But
you can speculate that some other factors might be behind both variables. Could it be
drug abuse? You do not know that either, but at least now you know what must be
investigated next. This is an example of by how going through a discovery phase and
using a simple tool such as correlation you can decide what the next step in this project
could be.

The correlation coefficient will tell you if there is association (and how strong it is)
between two variables. Regression analysis is in a way an extension of the principle of
association between the variables, but it goes beyond that. When people apply for loans
or credit cards, their credit rating is checked. The model that decides if you are going to
be approved for a credit card is possibly based on regression analysis. The model inputs
several variables that define you and your lifestyle and then predicts from these
variables if you are “credit-worthy”.

If you were in marketing, you can use this technique to predict the changes in demand as
you modify the pricing policy, or how one brand will fare based on your knowledge of

Page | 510
another brand. The examples are numerous. In other words, correlation and regression
analysis are powerful business tools for modeling relationships between data sets and
for predicting the results. You can think of this technique as one of the most valuable
“assets” that you should carry with you into your professional life.

Learning objectives

On completing this unit, you should be able to:

1. Understand the meaning of calculating the correlation coefficient.


2. Apply a scatter plot to demonstrate visually a possible association between two
data variables.
3. Calculate Pearson’s correlation coefficient for interval data and be able to
interpret the strength of this value.
4. Calculate Spearman’s rank correlation coefficient for ordinal ranked data and be
able to interpret the strength of this value.
5. Understand the meaning of simple linear regression analysis.
6. Fit this simple linear model to the scatter plot.
7. Fit a simple linear regression model to two variables.
8. Predict/estimate a dependent variable using an independent variable.
9. Use the coefficient of determination (or the r-squared value) to establish the
strength of the model.
10. Estimate how well the regression model fits the variables.
11. Check the model assumptions and asses if they have been violated.
12. Construct a prediction interval to the population parameter estimate.
13. Solve problems using the Microsoft Excel and SPSS.

9.2 Introduction to linear correlation


Correlation analysis is one of the simplest and most effective tools to provide greater
insight into data. It is based on the intuitive assumption that if two variables, move in
the same direction, there must be some sort of “connection” between them. Equally, if
they move in the opposite direction to one another, the connection is still there, but
with the inverse effect.

Imagine that you work for a food conglomerate that makes a variety of snacks. By sifting
through data, you discover that the increase in sale of a chocolate bar is accompanied by
the increase in sale of a certain dog food. When one drops, the other one drops too. You
have no idea why this is happening, but you have established that they are related. We
call this correlation. However, one thing that you cannot say is that the increase in the
sale of sale of one of the items is causing the increase in sale of another item. They just
move in the same direction for the reasons unknown to us.

Correlation is often confused with causation, which is a mistake. Two variables could
show very high level of correlation, but this may not mean that either of them can cause
changes in the other. It may mean that they both respond in the same way to some other
undefined and invisible variables. The fact that they may not cause changes to one
another, does not mean that they cannot be used as good predictors of one another. This
is the feature that we will explain in the sections that follow. The only limitation to the

Page | 511
techniques described in this Chapter is that the relationship between the variables
needs to be linear. Otherwise our linear correlation measurement is not appropriate.

What do we mean by linear relationship? Imagine that the supply of strawberries on the
market grows from week to week in June as follows: 10, 20, 30, 40 tons per week.
Because the increments between the numbers are constant (10 tone per week), this is
called linear growth. The non-linear growth would be: 10, 20, 40, 80 tons per week.
Figures 9.1 and 9.2 illustrate these examples in a graphical format.

Figure 9.1 Linear trend representing growth

Figure 9.2 Non-linear trend representing growth

As we can see, in the case of non-linear growth, the increments from week to week are
10, 20, 40, etc., i.e. they are doubling every week. The price for strawberries might
decline from week to week as: 4.00, 3.50, 3.00, 2.50 GBP per kilo. This is also linear
movement, though a declining one. A non-linear decline would be something like 4.00,
2.00, 1.00, 0.50 GBP per kilo. Figures 9.3 and 9.4 illustrate these two examples.

Page | 512
Figure 9.3 Linear trend representing decline

Figure 9.4 Non-linear trend representing decline

Note that non-linear does not necessarily look like a parabola, as in our two examples,
but it can be any other curve that is not a straight line.

Using the techniques from this chapter, you can only calculate correlation between the
two variables that are moving in linear fashion. For non-linear movements, other
techniques not mentioned in this textbook are required.

9.3 Linear correlation analysis


If we suspect that there is relationship between various sets of data, then our aims are
to confirm that a relationship exists, and how strong it is. Even if we do not suspect the
relationship, using the techniques from this chapter, we might discover that there is
one. The techniques we will use are:

• Scatter plots.
• Pearson’s correlation coefficient (r) for interval data.
• Spearman’s rank correlation coefficient (rs) for ordinal ranked data.
• Undertake an inference test on the value of the correlation coefficients (r and rs)
being significant (online only)

Page | 513
Scatter plots

To introduce the concept of correlation, we need to start with scatter plots. Scatter
plots are like line graphs in that they use horizontal and vertical axes to plot data points.
However, the objective of scatter plots is to show if the two variables are in any way
connected. If two variables are connected in some way, then we can say that there is a
relationship between them. The relationship between two variables is called their
correlation.

A point on a scatter plot is where the two variables intersect. These points will create a
“cloud”. The closer the “cloud” of data points come to making a straight line when
plotted, the higher the correlation between the two variables. The higher the
correlation, the stronger the relationship. Note that this line does not have to be
straight, but in this case the relationship is non-linear, and we will not consider it in this
chapter.

Example 9.1

Table 9.1 consists of two sets of data from the Office for National Statistics in the UK.
Both data sets cover the period between January 2016 and May 2019. The first set
shows monthly level of employment in the UK, in percentages, for the ages of 16 and
above. The second set shows monthly visits abroad, in thousands, for all ages of UK
citizens in the same period.

Table 9.1 Two UK data sets for period from Jan 2016 to May 2019 (Source: ONS UK)

By just looking at numbers, we can see that both data sets are increasing. We can
present these two data sets in a graph, and we use the graph with the left axis showing
the scale for employment, and the right axis showing the scale for all UK visits abroad.

Page | 514
Figure 9.5 UK data sets for age 16 and over employed and for all visits abroad (all
ages)

From Figure 9.5, we can see both data sets flow with an upward trend from month to
month, as the corresponding numbers show. We might speculate that if we knew the
number of UK people 16 and over that are employed, we might be able to predict how
many people will travel abroad from the UK. In other words, travel abroad might be
related to the number of people that are employed. Think of travel abroad as a dependent
variable and percentage of people employed, as an independent variable. Park this
thought, we will return to it later in the chapter as this is the key point.

Excel Solution

Rather than showing the two data sets as line graphs, as in Figure 9.5, we could plot
each pair of values as a point on a graph. This graph, shown in Figure 9.6, is called a
scatter plot. How to construct the scatter plot using Excel has already been
demonstrated in Chapter 1. We will therefore skip this part of explanation and
concentrate on the content of the plot.

A dot on the graph represents the intersection of the two values, both of them are
captured at the same point in time. In August 2017, for example, we have the value of
61.0 on the x axis intersecting with the value of 7,210 on the y axis. This intersection is
shown as a dot. Clearly scatter plots are not showing any time dimension (the date
when the intersection happened is not shown on this graph), just what value from the
first variable corresponds with the value from the second variable.

Page | 515
Figure 9.6 Scatter plot for the percentage of age 16 over employed and all visits
abroad

A scatterplot hints that there is some form of relationship. Look at the graph in Figure 9.6;
as the percentage of those that are employed increases (horizontal axis), there is a
tendency for more people to travel abroad too (vertical axis). The data, therefore, would
indicate a positive relationship. As we will show later, it is possible to describe this
relationship by fitting a line or curve to the data set. This will enable us to predict the
number of visits abroad based on any number of employed people between 16 and over.

SPSS Solution

SPSS data file: Chapter 9 Example 1 Scatter plot.sav (only first 8 records illustrated
below)

Figure 9.7 Example 9.1 SPSS data

A scatter plot is constructed following the procedure below.

Graphs > Legacy Dialogs > Scatter/Dot

Choose Simple Scatter

Page | 516
Figure 9.8 SPSS scatter/plot menu

Click Define

We move number visited to Y Axis and employed to X Axis.

Figure 9.9 SPSS simple scatterplot menu

Click OK

SPSS Output

Figure 9.10 SPSS solution scatterplot

Just as the same plot constructed in Excel, this one shows how employed and number of
visits are distributed. We also see that low values of employed have low values of
number of visits and high values of employed have high values of number of visits. Let
us now conduct a little experiment. In Figure 9.11 we modified the y axis scale to run
from 4,000 to 11,000 instead 5,000 to 9,000. More importantly, we also changed one
point to illustrate the issue of outliers.

Page | 517
Figure 9.11 Identifying outliers in data

What are outliers? The scatter plot can be used to identify possible outliers within the
data set. We can see in Figure 9.11 the same data set as in Figure 9.6 or 9.10, but with
one data value of y changed (Jan 2018 = 10000, instead of 7450). This value far greater
than any other data point for y values and it is called an outlier. It could have been a
data-entry error, or it could have been a genuine “freaky” number. The point here is that
outliers could have undue influence on the values of the correlation coefficients
estimated and, therefore, need to be somehow handled.

Regardless of whether an outlier is an error or a genuine extreme value, we cannot


leave it as it is. This one single value would distort our model and give us false
representation of reality. One of the solutions to this problem is to delete the outlier
value from the data set. If the data set is not time based, then deleting the outlier is an
acceptable option. By deleting just one observation from a large set will not distort the
results. However, if we are using time-based data, then deleting the data point does not
seem right. By deleting one observation we create a gap in the continuum of the time
series. For this type of data, it is better to substitute the outlier with some more
acceptable value that is more “in line” with other values in the data set.

There is no universally accepted method on how to deal with outliers. Some advocate that
outliers that lie beyond ±1.5 standard deviations around the mean value should be
excluded. In many cases, we suggest a “quick and dirty” method, which is to use the two
neighbouring values, calculate their average and substitute this value for an outlier. If we
followed this approach in our example, we would use 7260 and 7250 (values for Dec 2017
and Feb 2018) and find that their average is 7255. This value should replace 10000, (the
value for Jan 2018) which was an outlier. Given that we know that the original value was
7450, the substitute of this value with 7255 is much more acceptable than 10000. It will
have no significant impact on the overall relationship between the two variables.

Covariance

The scatter plot enabled us to visualize and conclude that two variables are potentially
jointly related, i.e. they show us the movements of one variable in relationship to the

Page | 518
movements of the other variable. A quantitative measure that tells us the same thing is
called covariance. Covariance is a measure of how changes in one variable are
associated with the changes in a second variable.

Covariance takes either positive or negative values, depending on whether the two
variables move in the same or opposite directions. If the covariance value is zero, or
close to zero, then this is an indication that the two variables do not move closely
together. Equation (9.1) defines the sample covariance.

∑𝑛
𝑖=1(𝑥𝑖 −𝑥̅ )(𝑦𝑖 −𝑦
̅)
𝑆𝑥,𝑦 = (9.1)
𝑛−1

In the equation (9.1) “x” represents individual values of the first data set (percentage of
employed, for example) and “y” represents individual values of the second data set
(number of people traveling abroad, for example). The average value for people
employed is 𝑥̅ and the average value for the variable describing people travelling abroad
is 𝑦̅. The symbol n represents the number of cases per data set, which in our case is 50.

Example 9.2

We are using the same data as in Example 9.1 to demonstrate how the calculations are
executed. Table 9.2 illustrates the calculations (only the first row and the last four rows
are shown).

Table 9.2 Example 9.2 data table and calculation of column statistics

Following equation (9.1), we first calculate the average for x and y, which was calculated
as:

𝑥̅ = 60.8 𝑦̅ = 7299.5

From there, we calculate deviations for every value from their average, multiply them
and then add them all up:

∑((𝑥1 − 𝑥̅ )(𝑦1 − 𝑦̅) + (𝑥2 − 𝑥̅ )(𝑦2 − 𝑦̅) + ⋯ + (𝑥𝑛 − 𝑥̅ )(𝑦𝑛 − 𝑦̅))


𝑆𝑥,𝑦 =
𝑛−1
(60.4−60.8)(6350−7299.5)+(60.2−60.8)(6760−7299.5)+⋯+(61.7−60.8)(7370−7299.5)
𝑆𝑥,𝑦 = 47

𝑆𝑥,𝑦 = 131.46

Page | 519
The covariance value for these two data sets is 131.46. As it will be explained shortly,
the covariance is an important building block for calculating the coefficient of
correlation between two variables.

Excel solution

Figure 9.12 illustrates the Excel solution (rows 12-40 are hidden to make the table more
compact for presentation purposes). The two Excel functions that were used,
=COVARIANCE.P(), aimed at the population data, and =COVARIANCE.S(), aimed at the
sample data, show similar values. If these two datasets were the whole population,
which they are not, then the covariance level of 137.64 would be the appropriate value
to use. We will use the sample data value of 140.57, because the datasets we used are
the subsets of larger datasets. The “population” is much larger than just the interval
between Jan 2016 and Dec 2019, so this is just a sample.

Figure 9.12 Excel solution to calculate covariance

The sample covariance value implies that the relationship is positive (because the
number is positive) and that its strength is 131.46. But what is the meaning of 131.46?
Not much, used in isolation. The covariance is expressed in some mixed “units” between
the two variables. In other words, it is difficult to interpret and to compare with other
covariances, if this is what we wanted to do.

From Excel, the sample covariance is 131.46, implying that both variables are moving in
the same direction (indicated by the positive value). A major shortcoming of calculating
just the covariance is that the variable can take any value and we are unable to establish
the strength of the relationship in relative terms. For this value of 131.46 we do not know
if this represents a strong or weak relationship between UK employment and UK visits

Page | 520
abroad. To measure this strength in relative terms, we need to use the correlation
coefficient.

SPSS solution

SPSS data file: Chapter 9 Example 2 Covariance.sav (only first 8 records illustrated
below)

Figure 9.13 Example 9.2 SPSS data

Select Analyze > Correlate > Bivariate

Figure 9.14 SPSS correlate menu selection

Transfer both variables to the Variables box


Select Options

Page | 521
Figure 9.15 SPSS bivariate correlations menu

Choose Means and standard deviations and Cross-product deviations and


covariances.

Figure 9.16 SPSS bivariate correlations options

Click Continue

Click OK

SPSS Output

Page | 522
Figure 9.17 SPSS solution

The value of the sample covariance is printed twice. The first one shows covariance
between UK employment level and UK visits abroad, and the second one the reverse, i.e.
UK visits abroad and UK employment level, which is the same. From SPSS, the
covariance = 131.46.

In summary, the difficulty with covariance is that it will be larger if the values of X and Y
are larger, and smaller if the values of the two variables are smaller numbers.
Effectively, the value of covariance is defined by the range of values of X and Y, so it is
impossible to get any meaningful interpretation of the covariance number. Neither can
we compare two covariances. To address the problem of interpretation and comparison,
we need to standardize the measure that will tell us more about the relationship
between variables. This new measure, or the statistic, is called the correlation
coefficient. Think about the covariance as the building block that will help us calculate
the correlation coefficient.

Pearson’s correlation coefficient, r

The sample correlation coefficient that can be used to measure the strength and
direction of a linear relationship is called Pearson’s product moment correlation
coefficient, r, defined by equation (9.2).
𝑆𝑥𝑦
𝑟= (9.2)
𝑆𝑥 𝑆𝑦

Page | 523
Where Sx is the sample standard deviation of sample variables x, Sy is the sample
standard deviation of sample variables y, and Sxy is the sample covariance between
variables x and y. If we substitute equation (9.1) into (9.2), we get an alternative
equation representing Pearson’s correlation coefficient, that is often used. This is given
by equation (9.3).

1 (𝑥𝑖 −𝑥̅ ) (𝑦𝑖 −𝑦̅)


𝑟= ∑𝑛𝑖=1 (9.3)
𝑛 𝑆𝑥 𝑆𝑦

Where symbols Sx, Sy are the same as in equation (9.2) i.e. they are the standard
deviations for variables x and y respectively, and n is the number of paired (x, y) data
values. In some textbooks, you will find even more complex equation for calculating
Pearson’s coefficient of correlation.

∑𝑛 𝑛
𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
∑𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 −
𝑟= 𝑛
(9.4)
2 2
(∑𝑛
𝑖=1 𝑥𝑖 ) (∑𝑛 𝑦 )
√(∑𝑛 2 )(∑𝑛 2 − 𝑖=1 𝑖 )
𝑖=1 𝑥𝑖 − 𝑛
𝑦
𝑖=1 𝑖 𝑛

Equation (9.4) was frequently used before the days of computers and spreadsheets and
today we have much more elegant ways to calculate the coefficient of correlation.

Example 9.3

In Example 9.2 we already calculated the sample covariance as 131.46. We have also
calculated (not shown here) the standard deviations for x and y, and they are respectively
sx=0.37 and sy=436.74. Following equation (9.2), this produces the correlation coefficient
as:

131.46
𝑟= = 0.81
0.37 × 436.74

How do we interpret this value of 0.81? The values of r can be anywhere between -1 and
+1. The way this number is interpreted is as follows:

a) If r lies between – 1 ≤ r ≤ - 0.7 or 0.7 ≤ r ≤ 1, a strong association, or correlation, is


assumed.
b) If r lies between – 0.7 ≤ r ≤ - 0.3 or 0.3 ≤ r ≤ 0.7, a medium association, or correlation,
is assumed.
c) If r lies between – 0.3 ≤ r ≤ - 0.1 or 0.1 ≤ r ≤ 0.3, a weak association, or correlation, is
assumed.
d) If r lies between – 0.1 ≤ r ≥ 0.1, there is virtually no association, or correlation.
e) If r = 0, this implies with certainty there is no association, or correlation, between the
variables.

In our case r = 0.81, which represents a very strong association (correlation) between
these two variables.

Page | 524
Figure 9.18 and 9.19 show examples of perfect positive correlation (r=+1) and perfect
negative correlation (r=-1).

Figure 9.18 Perfect positive correlation example

Figure 9.19 Perfect negative correlation example

The rules above on how to interpret the strength of association between the two
variables are not rigorous or strict. Depending on the context, they could be relaxed.
Think of them as the “consensus” views typically taken in business and management.

Excel solution

Figure 9.20 represents the Excel solution with rows 10 to 40 hidden.

Page | 525
Figure 9.20 illustrates the Excel solution to calculate Pearson’s correlation
coefficient, r.

We used three different Excel functions to calculate the coefficient of correlation. The
first two, =PEARSON() and =CORREL() are standard Excel functions and the third one
converts equation (9.2) into Excel syntax. In any case, all three are returning the same
value, which is +0.81. This value indicates a strong positive linear association between
the value of the visits abroad (y) and the percentage of employed age 16 and over (x),
confirming the impression from the scatter plot in Figures 9.6 and 9.10.

SPSS solution

SPSS data file: Chapter 9 Example 3 Pearson correlation analysis.sav (only first 8
records illustrated below)

Figure 9.21 Example 9.3 SPSS data


Run SPSS Correlation Test - Pearson

Select Analyze > Correlate > Bivariate


Transfer UK employment and UK visits abroad variables to the Variables box
Choose Pearson

Page | 526
Figure 9.22 SPSS bivariate correlations menu
Click OK

SPSS Output

Figure 9.23 SPSS solution

The correlation itself is 0.81. This indicates a strong (positive) linear relationship
between employed and number of visits abroad. The p-value, denoted by “Sig. (2-
tailed)”, is 0.000. This value of the p-value = 0.000 is smaller than 0.05, which is the level
of significance , and we conclude that the correlation between the two variables is
significant. The results are based on N = 41 cases. In summary, a strong linear
relationship is observed between the variable employed and the number of visits with
Pearson correlation = 0.81 and p-value = 0.000 (2-sided).

Page | 527
On top of how employment and number of visits are distributed separately, we also see
that low values of employed have low values of number of visits and high values of
employed have high values of number of visits.

It should be noted that if we included the outlier illustrated in Figure 9.11 rather than the
original value for Jan 2018, then the value of the correlation coefficient (r) would reduce
to 0.63 and would suggest a reduced correlation between the two variables (x & y). This
illustrates how much one single outlier can distort the true correlation between two
variables if not handled properly.

Let us clarify what does the value of 'r' not indicate?

1. Correlation only measures the strength of a relationship between two variables


but does not prove a cause and effect relationship.
2. A value of r ≈ 0 would indicate no linear relationship between x and y, but this may
also indicate that the true form of the relationship is non-linear.

To show the case of negative correlation, we will take a look at the relationship between
the UK unemployment rate and visits abroad in Figure 9.24. Note that Figure 9.24
replaces employment data from Table 9.1, with the unemployment data (same source)
vs. the total number of people traveling abroad in the UK.

In Figure 9.6 we observe that in this case as x increases, y decreases. The correlation
between x and y is negative in this case. In Figure 9.24 the line goes from a high-value on
the y-axis down to a high-value on the x-axis, the variables have a negative correlation.
The correlation coefficient in this case is -0.85.

Figure 9.24 Example of negative correlation

The correlation coefficient of -0.87 between the unemployment rate of age 16 and over
in the UK and the total travel abroad in the UK indicates that there is a strong negative
correlation between these two variables. As the unemployment goes up, the fewer people
in the UK travel abroad. Conversely, you can also say the fewer people are unemployed,
the more people will travel abroad.

Page | 528
The coefficient of determination, r2 or R-Squared

What happens if we take the Pearson correlation coefficient to the power of two, in
other words r2? We get a new measure that is called the coefficient of determination.
Most of the software packages refer to this statistic as R-squared or R-square. In our
Example 9.3, the value of r=0.81. This means that R-squared, or r2 = 0.812 = 0.65. How
do we interpret this statistic and the corresponding value?

Unlike the coefficient of correlation whose range is between -1 (negative correlation),


via 0 (no correlation), to +1 (positive correlation), R-squared goes only between 0 and
1. The meaning of 1 is that the changes in one variable are 100% accompanied by the
changes in another variable. The meaning of 0 is that the two variables have no impact
on one another.

The value of 0.65 for R-squared in the example above means that 65% of the variations
in visits abroad in the UK can be explained by the variations in employment. The
remaining 35% (100-65=35) of variations is attributed to some other factors beyond the
employment rate.

We will return to the coefficient of determination in the context of linear regression and
define how R-Squared can be used to help us decide if our predictions meet the original
variable.

We still have not examined how significant is linear correlation expressed as the
correlation coefficient r. That is, do the conclusions we have made about the sample
data apply to the whole population. To do this, we need to conduct a hypothesis test.
The result will confirm if the same conclusion applies to the whole phenomenon
(population) and, importantly, at what level of significance. This test is included in the
web chapters that accompany this textbook: AW8a Testing the significance of a linear
correlation between the two variables.

Spearman’s rank correlation coefficient, rs

We will now cover an example for data collected in ranked form. In this case, a ranked
correlation coefficient can be determined. Equations (9.2 – 9.4) provide the value of
Pearson’s correlation coefficient between two data variables x and y, which are both
measure on interval scales. The question then arises, what do we do if the data variables
are both ranked? In this case we can show algebraically that equation (9.4) is equivalent
to equation (9.5).

6 ∑𝑛
𝑟=1(𝑋𝑟 − 𝑌𝑟 )
2
𝑟𝑠 = 1 − 2
(9.5)
𝑛 (𝑛 −1)

Where Xr = rank order value of X, Yr = rank order value of Y, and n = number of paired
observations.

Equation (9.5) is known as Spearman’s rank correlation coefficient. If the


characteristics of any two variables cannot be expressed quantitatively, but can be
ranked, we can still measure if they are correlated, but using Spearman’s rank

Page | 529
correlation coefficient. Equivalence between equations (9.4) and (9.5) will only be true
for situations where no tied ranks exist. When tied ranks exist, then the discrepancies
between the value of r and rs exist. As with the majority of other nonparametric tests
included in this textbook, ties are handled by giving each tied value the mean of the rank
positions for which it is tied.

The interpretation of rs is like that for r, namely:

(a) A value of rs near 1.0 indicates a strong positive relationship, and


(b) A value of rs near - 1.0 indicates a strong negative relationship.

As a reminder, note that similar to r, the same rules apply to rs:

a) If rs lies between – 1 ≤ rs ≤ - 0.7 or 0.7 ≤ rs ≤ 1 (strong association)


b) If rs lies between – 0.7 ≤ rs ≤ - 0.3 or 0.3 ≤ rs ≤ 0.7 (medium association)
c) If rs lies between – 0.3 ≤ rs ≤ - 0.1 or 0.1 ≤ rs ≤ 0.3 (weak association)
d) If rs = 0, as before, there is association between the ranks

Example 9.4

A company makes seven different brands (A, B, …, G) and they are sold to two export
markets, Germany and France. The brands are ranked differently in each market. You are
asked to decide whether the brand rank in Germany correlates with the rank for seven
brands in France. The data is provided in Table 9.3. Since the information is ranked, we
use Spearman's correlation coefficient to measure the correlation between German and
French ranks.

or
Table 9.3 Ranks of different brands (brand A to brand G) in two markets
(Germany and France)

Excel solution

Excel does not offer a built in Spearman rank correlation coefficient, so we will use Excel
to conduct manual calculations. Figure 9.25 shows these calculations.

Page | 530
Figure 9.25 Excel solution for calculating the Spearman’s rank correlation

From Figure 9.25, the Spearman rank correlation is positive, rs = 0.643, indicating that
there is a reasonably high positive rank correlation in this case.

The way our brands A to G are ranked in German and French market indicates moderate
correlation. This implies that these two markets, as far as the ranking of our brands are
concerned, are not very strongly correlated. If this number was closer to +1, we would be
able to claim even stronger positive rank correlation.

Although Excel does not have a procedure for directly computing Spearman’s ranked
correlation coefficient, there is a workaround. Since the formula for Spearman’s
essentially measures the same thing as the Pearson’s correlation coefficient (but for
ranked values), we can use Pearson’s, providing that we have first converted the x and y
variables to rankings (see in Excel tabs: Data > Data Analysis > Rank and Percentile).

SPSS solution

Spearman correlation coefficient

SPSS data file: Chapter 9 Example 4 Spearman correlation analysis.sav

Figure 9.26 Example 9.4 SPSS data

Select Analyze > Correlate > Bivariate

Transfer German Rank and French Rank variables to the Variables box
Click on Spearman

Page | 531
Figure 9.27 SPSS bivariate correlations menu
Click OK

SPSS Output

Figure 9.28 SPSS solution

From SPSS: Spearman’s correlation coefficient is 0.643.

This indicates a moderate (positive) linear relationship between the French and
German ranks. For pairs of data considered to have a strong relationship, just as in the
case of Pearson’s correlation coefficient, you will need to confirm that the value is
significant. Using Excel, this test is included in the web chapters that accompany this
textbook: AW8b Testing the significance of Spearman’s rank correlation coefficient, rs.

Using SPSS, this test conducted by using the p-value. In Figure 9.27 under the title “Sig.
(2-tailed)”, we can see the value of 0.119. This is the p-value. It indicates that the p-value
= 0.119 > 0.05, so we conclude that the correlation between the two variables is not
significant. The results are based on N = 7 cases.

Check your understanding

X9.1 Why do you need to identify outliers and decide if you need to deal with them?

Page | 532
X9.2 If the correlation coefficient is zero, does this mean that there is no correlation
between the two variables, or that the correlation is negative?

X9.3 What is the difference between r and r-squared?

X9.4 For measuring correlations between the ranked values, would you use the
Pearson’s or Spearman’s correlation coefficient formula?

X9.5 What values of the correlation coefficient would you use to describe: strong
association between two variables, medium association and weak association?

9.4 Introduction to linear regression


Most of the time, just measuring association is not enough. Once we know that there is
association between the variables, we can also fit a line equation to a data set. This
becomes a model that will allow estimates / predictions to be provided for a dependent
variable (y) given an independent (or predictor) variable (x).

Linear regression analysis is one of the most widely used techniques for modelling a
linear relationship between variables. It is frequently used in business, economics,
psychology, and social sciences in general. This section is dedicated to linear regression
only, but the online chapters cover also non-linear and multiple regression modelling.

It is important to realize that regression analysis can be used on all types of data:

• temporal or time series data (monthly sales values, for example).


• categorical or nominal data (example is the gender of the respondent).
• ordinal data (example is ordering respondents into low economic status,
medium and high economic status).
• interval data (example are equally spaced categories, such as earnings below
20K, between 20-40K and 40-60K).

The point here is that regression analysis can be applied to virtually any kind of data type.
In some cases, data manipulations might be necessary. Coefficients of correlation just
establish the strength of the relationship between two variables, regression analysis, on
the other hand, is a technique that will help you establish how to estimate, or predict,
changes in one variable if you know the other.

9.5 Linear regression


Linear regression analysis attempts to model associations between two variables in the
form of a straight-line relationship. If the relationship between any two variables was
linear, and if wanted to describe this relationship for the whole population, then we
would use the equation (9.6).

𝑌 = 𝛽0 + 𝛽1 𝑋 (9.6)

Page | 533
Where, Y is the dependent variable, X is the independent variable, 0 is the intercept for
this linear equation and 1 is the slope, i.e. the rate at which this straight line grows or
declines. No doubt you are familiar with this equation as it is identical to the straight-line
equation.

However, in real life we seldom have a luxury of dealing with the whole population. Most
of the time we deal with a sample, or just a section of all the data available. This means
that our equation (9.6) is more likely to look like equation (9.7).

𝑦̂ = 𝑏0 + 𝑏1 𝑥 (9.7)

Where, ŷ is the estimated value of the dependent variable (y) for the given values of the
independent variable (x). The values of constants b0 and b1 are effectively estimates of
some true values of 0 and 1.

Let us assume that using our model in (9.7) we are trying to estimate one of the data
points, for example Y7. Our model is estimating this point to be: 𝑦̂7 = 𝑏0 + 𝑏1 𝑥7 . Clearly
the true point Y7 and the estimate 𝑦̂7 might not be identical (because b0 and b1 are just
estimates of some true values of 0 and 1). This means that effectively for every estimate
𝑦̂, there is a potential error element: 𝑦̂ = 𝑏0 + 𝑏1 𝑥 + 𝑒𝑟𝑟𝑜𝑟. We can use a practical
example to explain this better.

Imagine if you could gather the data from all the students in a country telling you how
much time they spent preparing for every exam. Let us call this population data set X.
Then you check the test results (from 0 to 100) they got for every exam. We will call this
population data set Y. If you knew this, the relationship between these two variables
would be something like Y = 40 + 0.5X (note that this equation is purely fictional and used
here just for the illustration purposes).

What can you conclude from this? First, if X=0, then Y=40. This means, if you have spent
zero hours on revisions, you are likely to get the test score of 40. What if X=50? Well, in
this case Y=65 (i.e. =40+0.5×50). In other words, 50 hours of revision is likely to give you
the test score of 65. However, the point here is that you do not know that β0 is 40 and that
β1 is 0.5. If you have conducted a quick survey at your campus, you might get the values
of b0 to be 45 and b1 to be 0.35. This shows that b0 and b1 are just estimates of β0 and β1.
By using b0 and b1 you generalize that if students spend x number of hours revising for
an exam, they will get the test score equivalent to 𝑦̂. To validate this conclusion, you will
need to conduct some further tests. However, it is quite clear that 𝑦̂ and Y might not be
identical, which is the reason why we said that individual values in the model are likely
to contain some error element.

Every model, including regression analysis models, is just an abstraction of reality, or an


approximation of reality. This means that all the models contain a certain amount of error
when approximating this reality. Analysing these errors, as we will see shortly, will help
us decide how good is our model.

How do we calculate constants b0 and b1 from the limited number of data points
representing x and y? To do this, regression analysis uses the method of least squares.

Page | 534
The method assumes that the line will pass through the point of intersection of the mean
values of x and y ( x , y ).

Let us demonstrate this using the following illustration. Let us assume that we have asked
six students from our campus to tell us how many hours they have spent revising for their
last Statistics examination. We also asked them to tell us the test results they obtained for
this examination. The data is captured as follows:

Hours of revision (x) 20 80 170 130 210 110


Test results (y) 45 35 82 70 77 65
Table 9.4 Hours spent studying for the statistics examination

On average, these students spend 120 hours on revision and on average they got the score
of 62. These are the mean values: 𝑥̅ = 120 and 𝑦̅ =62. We can plot the results as a scatter
diagram and insert the lines representing 𝑥̅ and 𝑦̅ into the graph, as in Figure 9.29.

Figure 9.29 Line fitted through mean value of data points fo x and y variables

We also added the regression line (red colour in this graph), although we have not shown
yet how it was calculated (patience just for a bit longer). We just wanted to illustrate the
point above, which is that the regression line will have to pass through the intersection
point of the two means (𝑥̅ , 𝑦̅).

OK, so now we know that the regression line must go through the intersection, but what
is the correct angle? There could be many lines going through the intersection. Figure
9.30 shows several possible options.

Page | 535
Figure 9.30 Possible line fits

The answer is that: the least square method ensures that the regression line is pivoted
about this intersection point until:

I. The sum of the vertical distances of the data points above the line equals those
below the line (i.e. the sum is zero).
II. The sum of the vertical squared distance of the data points is a minimum.

Again, graphical representation of what we just said is as in Figure 9.31.

Figure 9.31 Fitting a line to data using least squares method

We can see in Figure 9.31 that we measure distances of every actual point from the
regression line. When the two conditions are satisfied then we can say that we found the
regression line that best represents (or fits) this relationship. In practise, it would be very
difficult to pivot the regression line until it meets the two criteria, so we can use some

Page | 536
algebra to achieve this more efficiently. Algebraically, the above conditions are defined
as:

I. ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) = 0

and

II. ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)2 = 𝑚𝑖𝑛𝑖𝑚𝑢𝑚

Where 𝑦̅ represents the values on the regression line, which are effectively estimates of
y. From this concept, two 'normal equations' are defined:
𝑛 𝑛

∑ 𝑦𝑖 = 𝑛 𝑏0 + 𝑏1 ∑ 𝑥𝑖
𝑖=1 𝑖=1

and
𝑛 𝑛 𝑛

∑ 𝑥𝑖 𝑦𝑖 = 𝑏0 ∑ 𝑥𝑖 + 𝑏1 ∑ 𝑥𝑖2
𝑖=1 𝑖=1 𝑖=1

The reason the phrase “normal equations” is used, implies that distances from every point
and the regression line must be orthogonal (see Figure 9.31), which is also called
“normal”.

By solving the above equations simultaneously, we obtain the values of b0 and b1. These
give estimated equation of the line of regression of Y on X, where Y is the dependent
variable and X is the independent variable. If we rearrange the two ‘normal equations’,
then b0 and b1 are calculated using equations (9.8) and (9.9).

𝑛 ∑𝑛 𝑛 𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 −(∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖 )
𝑏1 = (9.8)
𝑛 ∑𝑛 2 𝑛
𝑖=1 𝑥𝑖 −(∑𝑖=1 𝑥𝑖 )
2

∑𝑛 𝑛
𝑖=1 𝑦𝑖 −(𝑏1 ∑𝑖=1 𝑥𝑖 )
𝑏0 = (9.9)
𝑛

Later, we will show that Excel can be used in several different ways to undertake
regression analysis and calculate the required coefficients b0 and b1. The three possible
options that Excel offers are:

1. Dedicated statistical functions – Excel contains embedded functions that allow a


range of regression coefficient calculations to be undertaken.
2. Standard worksheet functions – Standard Excel functions can be used to reproduce
the manual solution e.g. use =SUM(), =SQRT() functions, etc.
3. Excel Data Analysis > Regression – This method provides a complete set of solutions.

Regardless of what option is used, or if we do it manually, the process of going through


regression analysis can be split into a series of steps:

Page | 537
• Always start with a scatter plot to get an idea about a possible model.
• Calculate and fit model to sample data.
• Conduct a goodness-of-fit test of the model (using the coefficient of
determination, for example).
• Test whether the predictor variables are significant contributors (undertake a t-
test).
• Test whether the overall model is a significant contributor (undertake an F-test if
you have more than one independent variable).
• Calculate a confidence interval for the population slope, 1.
• Check model assumptions.

The third Excel method listed above (Excel Data Analysis option) will automatically
deliver most of the steps listed here. The first two require manual calculations, though
they are supported by the ready-made Excel functions.

Fit line to a scatter plot

In previous sections we learned how to create a scatter plot. In this section, we will learn
how to fit a line to such a scatter plot and what does that imply. Excel contains several
functions that allow you to directly calculate the values of b0 and b1 in equations (9.8) and
(9.9).

Example 9.5

We will use the same data sets as in Example 9.1. As both Excel and SPSS offer very elegant
and time saving options to complete the regression analysis, we will avoid manual
calculations using equations (9.8) and (9.9).

Excel Solution

We will use built in Excel functions for the intercept and slope to calculate the fitted line,
as per equations (9.8) and (9.9), that will go through the data points on a scatter diagram.
Figure 9.32 represents the Excel solution to fitting a line to the Example 9.5 data set (note
that rows 10-40 are hidden).

Page | 538
Figure 9.32 Calculating the slope and intercept in Excel

From Excel: b0 = -50324.72 and b1 = 947.42.

This mean that the equation of the sample regression line is ŷ = - 50324.72 + 947.42x, or
if we rearrange the parameters ŷ = 947.42x – 50324.72.

The above equation tells us that the number of UK visits abroad can be predicted from
the percentage of employed in the UK for the ages of 16 and over as:

Number of visits abroad = (947.42 × Percentage employed) – 50324.72

If x = 60.9 (percentage of employed), then the number of visits abroad according to the
model is 7373 (we simply “plug in” the numbers: 947.42.6 × 60.9 - 50324.72 = 7373).
This is the number of visits abroad that the model returns, given this percentage of
employed. However, from Figure 9.32 we can see that in February 2018 we had 60.9%
employed, and the number of visits abroad was actually 7250, and not 7373. Obviously,
this model, as we already know, creates some errors, but we will learn how to deal with
this later.

We can also show a relationship between the correlation coefficient r and the slope of the
𝑆𝑦 𝑆
linear regression b. They are expressed as 𝑏 = 𝑟 (𝑆 ), which means that 𝑟 = 𝑏 (𝑆𝑥 ), where
𝑥 𝑦
Sx and Sy are the standard deviations for x and y respectively. Try these equations and
you will see that they work.

For every value of x (percentage of employed age 16 and over) we can now, using the
model, estimate a value of the people traveling abroad. If we plotted these estimated
values, they would represent a line of regression, (sometimes called a trend line). The
calculated regression line (column E from Figure 9.32) has been fitted as a dotted line to
the scatter plot as shown in Figure 9.33.

Page | 539
Figure 9.33 A regression line fitted through the data on the scatter diagram

From Figure 9.33 you can see that most of the data points do not reside on the fitted line,
which is to be expected. Nevertheless, the line seems to approximate the direction that
these two variables are taking. Since not all the points lie on the fitted line, we call this a
model error (sometimes called residuals or variations) between the data y value and the
value of the line at each data point ŷ (which is an estimate). This concept of error is used
to establish different types of statistics indicators, including:

• coefficient of determination (COD, or r2, or R-squared)


• standard error of estimate (SEE)
• a range of inference indicators used to assess the suitability of the regression
model

These indicators we listed above have different purpose to complete the analysis, and we
will return to them later.

The simplest method to calculate the regression line in Excel is just to right-click on one
of the data points in the graph and select Add Trendline option from the box (see Figure
9.34).

Page | 540
Figure 9.34 Using Excel to fit the line automatically

Figure 9.35 illustrates the Format Trendline menu.

Select the following options:

• Trend/Regression Type – Linear.


• Display Equation on chart.
• Display R-squared on chart.

Click Close.

Figure 9.35 Excel Format Trendline menu

Page | 541
With this simple operation, we now get not just the regression line, but also the
equation, which is identical to the one we calculated using Excel functions =SLOPE() and
=INTERCEPT(), as well as the R-squared value.

Figure 9.36 The result of the Add Trendline option in Excel

We already know that the value of R2 of 0.65 implies that this regression line captures
65% of the variations in visits abroad related to the variations of the percentage of
employed of 16 and over in the UK. In other words, 35% of the variations in visits abroad
are not explained by this model but are dependent on some other factors not captured by
this model. As 65% is a reasonably high number, we can be satisfied that our model
represents the reality well.

As we already know, the observed values of y and those estimated by the regression line
(ŷ) will not be identical. In other words, the model will not be perfectly accurate and it
will have some errors. These errors, or the differences, in the context of regression
analysis are also called the residuals, and are defined by equation (9.10).

Residual = 𝑦 − 𝑦̂ (9.10)

Figure 9.37 below shows the value of these errors, or residuals, calculated at every point
(rows 10 – 40 hidden).

Page | 542
Figure 9.37 Example 9.5 Excel solution

These errors, or residuals, as they are called in the context of linear regression analysis,
are very important part of regression analysis. Before we learn how to handle residuals
in regression analysis, we need to be familiar with just a few additional concepts.

Sum of squares defined

Regression analysis involves identifying three important measures of variation:


regression sum of squares (SSR), error sum of squares (SSE), and total sum of
squares (SST). Figure 9.38 illustrates the relationship between these different measures.
For the sake of clarity, we are showing only one point, yi. Note that this principle applies
when we sum up the squared differences of all the points, hence the phrase sum of
squares.

Figure 9.38 Understanding the relationship between SST, SSR, and SSE

Page | 543
As in our illustration at the beginning of this chapter, we can see the intersection of 𝑥̅ and
𝑦̅, and the regression line passing through this intersection. We can see that the data point
yi (blue dot) is somewhat above the regression line. The distance between this data point
and the regression line will be used to calculate the SSE (Sum of Squares of Error). In fact,
we use the word Sum, because we will square and sum up all the distances between all
the points and the regression line, though in this example only one single point is shown.

On the other hand, the distance between the regression line and the value of 𝑦̅, when
squared and summed up for all the points, will be called SSR (Sum of Squares for
Regression). If we add these two sums of squares (SSE and SSR), we get SST (Sum of
Squares in Total), which measures the difference between every data point and their
mean value 𝑦̅. Again, if we used algebra, these expressions can be expressed in a more
elegant way.

Regression sum of squares (SSR) – sometimes called explained variations is defined as:

SSR = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 (9.11)

Regression error sum of squares (SSE) – sometimes called unexplained variations is


defined as:

SSE = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 (9.12)

Regression total sum of squares (SST) – sometimes called the total variation is defined
as:

SST = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 (9.13)

The above equations include the following symbols:

yi = actual data or observation


𝑦̅ = the mean value of actual data set
ŷi = predicted data using the regression model

The total sum of squares is equal to the regression sum of squares plus the error sum of
squares:

SST = SSR + SSE (9.14)

What we observe here is that SSR, or explained variations, measure deviations of the
predicted values from the overall data mean. SSE, or unexplained variations, measure
deviations of the actual from the predicted values. And lastly, SST, or total variations
measure deviations of the actual data values from their mean. However, remember that
the word “deviations” that we are using here is in fact the sum of the squared value of all
these deviations.

Page | 544
Regression assumptions

Now, we have completed our calculations and produced the equation that defines the
regression line for our two variables. How do we know that this equation is
appropriate? Well, for a regression equation to truly represent the relationship between
the variables, it must satisfy certain assumptions, and there are four of them. The four
assumptions of regression are: (1) linearity, (2) independence of errors, (3) normality
of errors, and (4) constant variance.

1. Linearity

Linearity assumes that the relationship between the two variables is linear. One
of the methods to assess linearity is by plotting the residuals (or errors) against
the independent variable, x. In Excel if you go to Data > Data Analysis >
Regression, you will automatically get this plot if requested. From Figure 9.39 we
cannot see any apparent pattern between the residuals and x. Also, the residuals
are evenly spread out around the zero line (if the dot falls on this line, this means
that the error is zero).

Figure 9.39 Example 9.5 residuals versus x.

In this example, because the errors are randomly spread, the conclusion is that a
line fit to the dataset would appear appropriate. If form the scatter plot we
conclude that the relationship is non-linear (i.e. there is some pattern), then a
non-linear model should be fitted to the data set.

2. Independence of errors

The regression independence of errors assumption implies that the current


error values are not dependent of the previous error values. To measure this
effect, we use the Durbin-Watson statistic and the effect is called serial
correlation. Some textbooks will use the expression autocorrelation. Serial
correlation and autocorrelation are synonyms, though they are usually used in a
different context. Both autocorrelation and the Durbin-Watson statistic are
covered online.

Page | 545
3. Normality of errors

The assumption of normality of errors implies that the measured errors (or
residuals) have to be normally distributed for each value of the independent
variable, x. The violation of this assumption can produce unrealistic estimates for
the regression coefficients b0, b1, as well as the measures of correlation. Also,
calculations of the confidence intervals assume that the errors are normally
distributed. Normality of errors assumption can be evaluated using two
graphical methods:

(i) Construct a histogram for the errors against x and check whether the shape
looks normal, or
(ii) Create a normal probability plot of the residuals (available from the Excel
Data > Data Analysis > Regression).

Figure 9.40 illustrates a normal probability plot based upon the Example 9.1 data
set.

Figure 9.40 Example 9.5 normal probability plot of the residuals

For a plot to confirm the normality of errors, it must follow an approximately


straight line, i.e. it cannot be non-linear or “jump” too much up and down. From
Figure 9.40 we observe that the relationship is linear, and we conclude that the
normality assumption is not violated.

4. Constant variance

The equal variance assumption (or homoscedasticity) assumes that the


variance of the errors is constant for all values of x. This requirement implies
that the variability of the y values is the same for all values of x. In order to make
correct inferences about 0 and 1, we must adhere to this assumption.

In Figure 9.41 we observe that the errors are not growing as the value of x
changes. This plot shows that the variance assumption is not violated. If the
value of error changes greatly as the value of x changes then we would assume
that the variance assumption is violated.

Page | 546
Figure 9.41 Example 9.5 residuals

If there are violations of this assumption, then we can use some form of data
transformations to attempt to improve model accuracy (beyond the scope of this book).
What happens if any of the four assumptions are violated? In this case the conclusion is
that linear regression is not the best method for fitting to the data set. We would in this
case need to find an alternative method or model.

Test how well the model fits the data (Goodness-of-fit)

Of the several methods used to assess how good is the regression line (or the model) we
shall discuss:

(a) Coefficient of Determination (COD), or R-squared.


(b) Residuals and the Standard Error of the Estimate (SEE).

These two statistics measure two completely different properties, but both fall under
the goodness of fit measures.

Coefficient of Determination or R-squared

We already introduced the coefficient of determination in relation to the coefficient of


correlation. Here we explain the role of this statistic in the context of regression
analysis.

The regression line effectively summarises the relationship between x and y. However,
the line will only partially explain the variability of the observed values. We saw that
when we examined the residuals. As we stated already explained, the total variability of
y can be split into two components:

(i) variability explained or accounted for by the regression line.


(ii) unexplained variability as indicated by the residuals.

The COD, or R2, is defined as the proportion of the total variation in y that is explained
by the variation in the independent variable x. This definition is represented by
equations (9.15) and (9.16) below:

Page | 547
Regression sum of squares SSR Explained variations
R2 = = SST = (9.15)
Total sum of squares Total variations

∑𝑛 (𝑦̂ −𝑦̅)2
R2 = ∑𝑖=1
𝑛
𝑖
(9.16)
(𝑦 −𝑦̅)2
𝑖=1 𝑖

Equation (9.16) is a more efficient and symbolic representation of (9.15). We can also
show that the coefficient of determination (COD) can also be given by equation (9.17):

COD = (Correlation Coefficient)2 = R2 (or r2) (9.17)

Equation (9.17) shows that if we take the square root of the COD, we get the correlation
coefficient (which means that COD is r squared). This explains why the coefficient of
determination is also called the R-squared. Note that the coefficient of determination
equation (9.15) can be re-written in terms of SSE and SST by making use of the
relationship SSR = SST - SSE.

SSR SST-SSE SSE


r2 = SST = =1- SST (9.18)
SST

To summarise, coefficient of determination (COD), or R-squared value as it is often


called, indicates how well the model fits the data, or to put it differently, how much of
variability in dependent variable is explained by the variability of independent variable.

Standard Error of the Estimate (SEE)

Residuals play an important role in establishing if we selected the correct model to fit
the dataset. One of the first steps is to calculate the residuals, which as we now know,
are the difference between the actual values (y) and the values predicted by the model
(𝑦̂). The next step is to calculate the standard deviation for these residuals. This
measure, or statistic, is known as the standard error of the estimate (SEE):

∑𝑛 ̂)2
𝑖=1(𝑦−𝑦
SEE = √ (9.19)
𝑛−2

We can see that the numerator in equation (9.19) is the Regression Error Sum of
Squares shown in equation (9.12), also called unexplained variations. We can,
therefore, also re-write equation (9.19) into equation (9.20).

SSE
SEE = √𝑛−2 (9.20)

Standard Error of the Estimate (SEE) provides a measure of the scatter of observed
values y around the corresponding estimated values ŷ on the regression line. SEE is
measured in the same units as y. In other words, if y are inches or gallons, SEE will be
expressed in the same units.

SEE is effectively the standard deviation of actual values from the predicted values. We
remember that +/- 1 standard deviation covers 68% of the population/sample. This

Page | 548
means that you can say that we can be 95% confident that the true value of y will be in
the interval of 𝑦̂ ±1.96SEE. To be 99% certain, we would need to take 2.58SEE, i.e. 𝑦̂ ±
2.58SEE, etc.

The Excel function =STEYX() enables the calculation of the standard error of the
estimate (SEE) as illustrated in Figure 9.42.

Remember: R-squared (or COD) measures if the model selected is a good fit for the
actual dataset. SEE measures how well the actual data points are estimated by the
regression model. The higher the value of R-squared, the more confident we are that we
have selected the correct model. The smaller the SEE, the more precise our model is,
and the estimated values are a better fit for the actual data.

Online chapters provide more detailed explanations and illustrations of how to


interpret these two statistics in regression analysis.

Example 9.6

Re-consider the Example 9.1 data set and test the linear regression model reliability.
Figure 9.42 illustrates the Excel solution to calculate the coefficient of determination
and the standard error of the estimate (again rows 10-40 are hidden in this Figure).

Figure 9.42 Example 9.6 Excel solution

By plotting the regression line onto the scattergram, as shown in Figure 9.36, we know
that many of the observed data points do not lie on the line. These differences between
the actual points and the fitted points, or residuals, are calculated in column G in Figure
9.42. It is always advisable to plot the residuals.

Plotting the residuals provides information about possible modifications of, or areas of
caution, in applying the regression line. In plotting the residuals, we would look for a

Page | 549
random, equal scatter about the zero-residual line. This would indicate that the derived
line was relatively free from error. Figure 9.43 shows the errors for our Example.

Figure 9.43 Example 9.6 errors plot

In Excel, as in many other software packages, the coefficient of determination (COD) is


labelled R-squared. The value is calculated in cell D60 in Figure 9.42 using the Excel
function =RSQ(). This is just one method to calculate the coefficient of determination. As
we already said r2 = 0.65, implies that this regression model can interpret 65% of all the
variations in travel abroad with the variations in employment level. The remainder of
variations are subject to some other influences not embedded in this model. This is a
reasonable number (65%), so we will accept this linear model as a good fit for this
relationship. Just to repeat, from the value of COD, we can find Pearson’s correlation
coefficient r = √𝐶𝑂𝐷 = √0.65 = 0.81.

In cell D51 in Figure 9.42 we have the value of SEE of 260.56. As the visits abroad are
expressed in thousands (for example, June 2018 shows 7,690,000 visits), this means
that the standard error of the estimate is also in thousands of visits (260,560 visits). We
will show below how to use SEE to draw further conclusions about our regression
model.

SPSS Solution

SPSS can be used to undertake the calculations described in this chapter based upon the
data in Example 9.1. In this example, we are fitting a straight line relationship between
the number of visits (y) and whether employed (x).

Input data into SPSS

SPSS data file: Chapter 9 Example 6 Linear regression analysis.sav

Page | 550
Figure 9.44 SPSS data (Only the first 6 rows of data are shown)

Select Analyze > Regression > Linear

Figure 9.45 SPSS linear regression menu selection

Transfer UK visits to the Dependent box

Transfer UK Employment to the Independent(s) box

Page | 551
Figure 9.46 SPSS linear regression menu

Note that the Method option selected is Enter

Click on Statistics
Tick the boxes as shown

Figure 9.47 SPSS linear regression statistics options

Click Continue

Click OK

SPSS Output – this will be saved to an SPSS output file

Given that we have selected several option boxes in the Statistics menu option, the
output from SPSS is comprehensive.

Page | 552
Figure 9.48 SPSS solution

Page | 553
Figure 9.49 SPSS solution continued

Figure 9.50 SPSS solution continued

If you compare Figure 9.49 with Figure 9.42 (Excel output), you will see that R-Square is
0.65 and that SEE is 260.56, which is matching Excel results.

Prediction interval for an estimate of Y

Two variables, one independent and one dependent, can be modelled using a linear
regression equation ŷ = b0 + b1 x . In this case ŷ are the estimates of y for the given value
of x. For example, we may want to know what the number of UK visits abroad (y) would
be if the UK age 16 and over employed value was set at 60% (x). The prediction interval
for y, at a value of x, is given by equation (9.21).

𝑦̂ − 𝑒 < 𝑦 < 𝑦̂ + 𝑒 (9.21)

This implies that, within a certain probability, we are confident that the true value of y is
somewhere in the interval between 𝑦̂ ± 𝑒. The error term e is calculated using equation
(9.22), where xp is the value of x for which the error is calculated.

1 𝑛(𝑥𝑝 −𝑥̅ )2
𝑒 = 𝑡𝑐𝑟𝑖 × 𝑆𝐸𝐸 × √1 + 𝑛 + 𝑛(∑ 𝑥 2 )−(∑ 𝑥)2 (9.22)

We can see in equation (9.22) that to calculate the error values for the prediction interval,
it is not enough to just multiply the value of SEE with the t-value (or z-value). We need
another, complicated expression, given under the square root. The value of e calculated

Page | 554
in this way and combined with the value of ŷ that comes from the linear regression, will
provide the prediction interval.

Prediction interval is essentially a confidence interval for linear regression models. Just
like the confidence interval for the population mean states that you are confident that the
true mean is somewhere in the interval of values =𝑥̅ SE (SE=standard error of the
mean), the prediction interval provides a confidence level that the true possible data
value y is somewhere in the interval of values y=𝑦̂SEE (SEE= standard error of the
estimate).

Example 9.7

Fit a prediction interval at x = 60 (i.e. if employment was 60%) to the Example 9.1 data
set. Figures 9.51 and 9.52 illustrate the Excel solution to calculate the predictor interval
(cells 21 – 42 hidden)

Figure 9.51 Example 9.7 Excel solution

Page | 555
Figure 9.52 Example 9.7 Excel solution continued

From Excel:

x = 60, n = 41, significance level = 0.05, tcri = ± 2.012, SEE = 260.56, x = 60.8, x =
2493.7, and x2 = 151677.3.

Substituting values into equation (9.22) gives:

1 41(60.0 − 60.8)2
𝑒 = 2.012 × 260.56 × √1 + + = 564.23
41 41 × (151677.3 − (2493.7)2

The value of 𝑦̂ is calculated using equation (9.6), which is: 𝑦̂ = 𝑏0 + 𝑏1 𝑥 = -50324.72 +


947.4 × 60 = 6520.9. In the above equation for e, note that 2.012 is the t-value for 95%
confidence interval.

The 95 % prediction interval for x = 60 is between 5956.5 to 7085.0 (calculated as:


6520.8 ± 564.23).

If the percentage of 16 and over of employed in the UK was 60%, we can


predict/estimate that the number of UK visits abroad would in this case reach 6520.9. In
fact, we can state that we are 95% confident that this number of visits abroad would be
somewhere between 5965.5 and 7085.0.

SPSS solution

The prediction interval can be calculated using SPSS as follows. Let us say we want to
create a prediction interval for when the employment level is 60% (x = 60). To do a

Page | 556
prediction, simply enter the value of the predictor variable at the last row of the data
sheet under the predictor variable and go through the model building.

Figure 9.53 Enter predictor value in the SPSS data file (Xp = 60)

Select Analysis > Regression > Linear

Transfer UK visits to the Dependent box


Transfer UK Employment to the Independent(s) box
Method: Enter

Figure 9.54 SPSS Linear Regression menu

Click on Save

Page | 557
Figure 9.55 SPSS Linear Regression Save menu

Now in the box labelled Prediction Values, click on Unstandardized.


This will give the predicted Y-values from the model.
The data window will have a column labelled pre_1.

For the prediction intervals, in the boxes near the bottom labelled Prediction
Intervals, put check marks in front of Mean and Individual. In the data window,
will now be columns, labelled LMCI_1, UMCI_1, LICI_1, and UICI_1. LMCI and
UMCI stand for Lower Mean Confidence Interval and Upper Mean Confidence
Interval, respectively. LICI and UICI stand for Lower Individual Confidence
Interval and Upper Individual Confidence Interval, respectively.

The values for LICI and UICI are 5956.546 and 7085.003 respectively, which is
95% confidence interval for the UK visits abroad providing that the Employment
level is 60, as specified.

Click Continue

Page | 558
Figure 9.56 SPSS Linear Regression menu

Click OK

SPSS output

The calculated values are stored in the SPSS data file as illustrated in Figure 9.57.

Figure 9.57 95% Confidence and predictor intervals

From the SPSS data file, we have:

1. Predicted number of visits given x = 60 is given by PRE_1 = 6520.77


2. The 95% confidence interval for the individual response when x = 60 is given by
LICI_1 and UICI_1 (5956.54, 7085.00).
3. We also get the 95% confidence interval for mean response when x = 60. This is
given by LMCI_1 and UMCI_1 (6319.31, 6722.23).

We can see that SPSS solutions agree with Excel.

Excel data analysis regression solution

If we wanted to avoid doing most of the calculations from the previous example, we can
use Excel ToolPak Regression. This package provides a complete set of solutions,
including:

• Calculate equation of line


• Calculate measures of goodness of fit
• Check that the predictor is a significant contributor (T and F tests)
• Calculate a confidence interval for b0 and b1.

Page | 559
Example 9.8

Re-consider Example 9.1 data set and use the Excel Data Analysis tool to fit the linear
regression model and calculate required reliability and significance test statistics.

Excel Solution

Select Data > Data Analysis > Select Regression

• Y Range: D5:D45
• X Range: C5:C45
• Confidence Interval: 95%
• Output Range: G3
• Click on residuals, residual plots, and normal probability plot

Figure 9.58 Example 9.8 Excel data analysis regression menu

Click OK

Excel will now calculate and output the required regression statistics and charts as
Illustrated in Figure 9.59.

Page | 560
Figure 9.59 Excel data analysis regression solution

The printout in Figure 9.59 might look somewhat puzzling, so we will explain all the cells
from this printout and connect some of them with the terms from the Section on Sum of
Squares.

Cell H6 = Multiple R (can also be obtained using =RSQ(), =CORREL() or =PEARSON()


function
Cell H7 = R-Square (can be obtained as =(H6)^2)
Cell H8 = A refined version of R2, adjusted R-Square for the sample size and the number
of dependent variables (not described in this textbook)
Cell H9 = Standard Error of Estimate (SEE) (can also be obtained using =STEYX()
function)
Cell H10 = Number of observations n
Cell H14 = dfA (the number of degrees of freedom for regression v1)
Cell H15 = dfB (the number of degrees of freedom for residuals v2=n-m-1, where m=
number of independent variables, i.e. 1 in this case)
Cell H16 = dfT (total of H14 and H15)
Cell I14 = SSR (Explained variations)
Cell I15 = SSE (Unexplained variations)
Cell I16 = SST (=SSR+SSE)
Cell J14 = MSR (this is the result of I14/H14, i.e. =SSR/v1)
Cell J15 = MSE (this is the result of I15/H15, i.e. =SSE/v2). If you take a square root of this
value, you get the standard error of the estimate, as per Cell H9.
Cell K14 = F-statistic(this is the result of J14/J15, i.e. =MSR/MSE)
H19 = b0
H20 = b1
I19 = sb0 (Standard Error for b0)
I20 = sb1 (Standard Error for b1)
J19 = t-stat, or t-calc for b0 (this is the result of F19/G20)
J20 = t-stat, or t-calc for b1 (this is the result of F20/G20)
K19 = p-value for the intercept
K20 = p-value for the slope

Page | 561
From Figure 9.59 we can identify the required regression statistics as illustrated in
Table 9.5:

Calculation Regression statistic Excel cell


Fit model to sample data b0 = -50324.72 Cell H19
b1 = 947.4 Cell H20
Test model reliability using the COD = 0.65 Cell H7
coefficient of determination SEE = 260.5 Cell H9
Test whether the predictor variables
are significant contributors – t test t = -7.48, p = 4.727E-9 Cells J19 and K19
H 0 : β 0 = 0 vs. H 1 : β 0  0 t = 9.56, p = 1.690E-10 Cells J20 and K20
H 0 : β 1 = 0 vs. H 1 : β 1  0
Calculate the test statistics and p-
values using Excel – F test F = 73.38, p = 1.690E-10 Cells K14 and L14
H 0 : β 1 = 0 vs. H1 : β1  0
Confidence interval for 0 and 1
95% CI for 0 –63931.4 to –36719.0 Cells L19 and M19
95% CI for 1 723.71 to 1171.13 Cells L20 and M20
Table 9.5 Linear regression test statistics

Regression and p-value explained

Cells K19 and K20 in Figure 9.59 contain the so-called p-value. What is the p-value?

The p-value measures the chance (or probability) of achieving a test statistic equal to or
more extreme than the sample value obtained, assuming H0 is true. We are already
familiar with p-value from the previous chapters on hypothesis testing. As we already
know, in order to make a decision when testing the hypothesis, we compare the
calculated p-value with the level of significance (say 0.05 or 5%). If p < 0.05, then we
reject H0. In case of linear regression, the same principle applies. If p < 0.05 (assuming
we used 5% significance level), then our model is valid. As before, H0 implies that there
is no connection between x and y (remember, we set H0 with intentions to reject it).

In practical terms, Excel applies the t-test to tell us if the predictor variable (in this case
the percentage of employed, x) is a significant contributor to the value of y (UK visits
abroad, y), given that the p-value in cell K20 in Figure 9.59 (= 1.690E-10) < 0.05. As
1.690E-10 < 0.05, we conclude that x is a significant contributor to y. Beside this, we can
see in cell K19 in Figure 9.59 that the intercept (b0) is also a significant contributor to
the value of y. The value of p shown in cell K19 in Figure 9.59 shows the value of p =
4.727E-09. This is much less than 0.05, hence the conclusion that the intercept is very
important in this equation.

The significance of the F-test shown in cell L14 in Figure 9.59, confirms that the model is
a significant contributor to the value of the dependent variable (p = 1.690E-10 < 0.05).
This confirms the t-test solution and we conclude that there is a significant relationship
between the percentage of employed and visits abroad. For simple models with just one
dependent variable, the relationship between t and F is given as: 𝑡 = √𝐹 = √73.380 =
9.566. See cell J20 in Figure 9.59, which is the t value, and cell K14 in Figure 9.59, which

Page | 562
is the F value. The Regression Data Analysis also helps with checking of some of the
assumptions, namely: linearity, constant variance, and normality as illustrated in
Figures 9.60 –9.62.

Figure 9.60 Residual output (note that rows 36-56 are hidden for clarity of
output)

Figure 9.61 Plot of residuals against x

Figure 9.61 demonstrates that we have no observed pattern within the residual plot. We
can, therefore, assume that the linearity assumption is not violated. A further conclusion
is that both the residuals and the variance are not growing and are bounded between a
high and low point. We use this conclusion to state that the variance assumption is also
not violated.

Page | 563
Figure 9.62 Assumption check for normality

The normal probability plot in Figure 9.62 illustrates that the relationship is linear. We
conclude that the normality assumption is not violated. In summary, our model does not
violate any of the linear regression analysis assumptions, and therefore, it is a good
representation of the relationship between the level of employment in the UK and travel
abroad from the UK.

Check your understanding

X9.6 State what do the two coefficients, b0 and b1, in the regression equation represent
and what are they estimating.

X9.7 What is the point at which the regression line pivots and what are the conditions
that need to be satisfied for the best fit regression line?

X9.8 What is the meaning of the word residual in regression analysis and what other
expressions are used for the same term?

X9.9 Explain what another term is to describe the total variations in regression analysis,
and what kind of variations constitute the total variations in this type of analysis.

X9.10. State the four assumptions that need to be satisfied for the linear regression to be
considered appropriate.

Chapter summary
In this chapter, we have introduced the concepts of correlation, coefficient of
determination, or R-squared value, and regression analysis. These concepts explain
relationships between variables / data sets. As we progress through the remaining
chapters, we will see that they are also important building blocks for other statistical
techniques. The coefficient of correlation is based on the concept of covariance that
measures how closely the two variables are associated. If the number is positive, it
implies positive association (as one variable grows, so will the other). If the number is
negative, it implies negative association (as one variable grows, the other will decline).
However, the absolute value of the covariance has very little meaning, and it is
impossible to compare the variances for data sets that are using values from a different
range of numbers. To address this issue, we introduced the coefficient of correlation.

Page | 564
The correlation coefficient standardizes the variances and, regardless of what range of
values or units is used in the data set, it always returns the values between -1 and +1.
The closer the coefficient of correlation to +1, the stronger the association between the
variables (growth in one variable is accompanied by the growth in another). The
opposite case is: the closer the coefficient of correlation to -1, the more opposite the
association between the variables (growth in one variable is accompanied by the
decline in another). The value of the coefficient of correlation that is zero, or close to
zero, indicates that there is no meaningful linear association between the variables.
However, there may be a non-linear association between the variables.

We also emphasized that correlation and causation are not to be confused. The fact that
two variables are highly correlated does not necessarily mean that they influence or
cause movements between each other. Two specific correlation coefficients were
introduced. The first one was Pearson’s correlation coefficient and the second one was
Spearman’s rank correlation coefficient. Pearson’s coefficient of correlation measures
linear relationship between two variables, whilst the Spearman’s coefficient measures
the ranked values rather than raw data. As both coefficients of correlation apply to
sample data, we also introduced (online) how to use the hypothesis testing to gather
evidence as to whether the sample data results can be applied to the whole population.

The squared value of the coefficient of correlation is called the coefficient of


determination (r2, or R-squared), and we introduced this concept too. Unlike the
correlation coefficient, the coefficient of determination can take values between 0 and 1.
These numbers can be interpreted as percentages indicating what percentage of the
variations in one variable is associated with the variations in another variable. The
balance between 1 and the value of coefficient of determination indicates how much the
variations in one variable are dependent on other, i.e. external factors, that are beyond
the association between these two variables.

Continuing with the premise that the two variables are in linear relationship, we
introduced a simple linear regression analysis tool. We demonstrated how to fit an
appropriate model to the data set using the least squares regression. We defined the
meaning of different types of variations and what is the relevance of residuals, i.e. errors
in regression analysis. This led us to explain the assumptions of linear regression, as
well as how to test the reliability of the model using either the t-test or the F-test (online
chapters). We also introduced the standard error of the estimate as well as how to use it
to create a prediction interval.

Test your understanding


TU9.1 Take the following two data sets: X1: 2, 4, 7, 5, 9, 13, 13, 15, 14, 18 and X2: 1, 3, 8,
4, 4, 9, 12, 22, 14, 15. Construct the scatter diagram.

TU9.2 How would you deal with an outlier in TU9.1?

TU9.3 Calculate the correlation coefficient for the TU9.1 example.

Page | 565
TU9.4 Show that the correlation coefficient from TU9.3 is significant and not just
random.

TU9.5 Level 1 university students ranked their five top Lecturers as: John, Lucy, Steve,
Mark and Alice. Level 2 students ranked the same lecturers as follows: Lucy,
John, Steve, Alice and Mark. How closely are the Level 1 students and Level 2
students’ perceptions about their top Lecturers aligned?

TU9.6 In the regression equation for yˆ = b0 + b1 x , the value of b0 is given by the equation:

2
 Y − b1  X  Y − b1  X
A. b0 = B. b0 =
n 2n
 Y − b1  X  Y − n X
C. b0 = D. b0 =
n n

TU9.7 In the regression equation for yˆ = b0 + b1 x , the value of b1 is given by the equation:

n  XY 2 −  X  Y n  XY −  X Y
A. b1 = B. b1 =
n  X − ( X )
2 2
n  X 2 − ( X )2
n  XY −  X  Y n  XY −  X  Y
C. b1 = D. b1 =
n  X − ( X ) 2
n  X 2 − ( X )

Use the ANOVA table 9.6 to answer exercise questions TU9.8 – TU9.11:

ANOVA df SS MS F Significance F
Regression 1 169759.3 169759.3 261.6392 9.76E-16
Residual 28 18167.14 649.8263
Total 29 187925.5
Table 9.6 ANOVA table

TU9.8 Calculate the coefficient of determination, COD:

A. 0.78 B. 0.9 C. 0.80 D. 1.80

TU9.9 Calculate the value of Pearson’s correlation coefficient, r:

A. 0.99 B. 1.89 C. 0.95 D. 0.89

TU9.10 Calculate the value of the standard estimate of the error, SEE:

A. 155.20 B. 133.18 C. 35.47 D. 25.47

TU9.11 In 2019 Pet-Dog Ltd. ascertained the amount spent on advertising and the
corresponding sales revenue by ten marketing clients.

Page | 566
Advertising (£000s), xSales (£000s), y

5 104
13 173
11 121
16 156
2 50
19 182
22 199
8 76
11 95
Table 9.7 Spend on advertising versus sales

(a) Plot a scatter plot and comment on a possible relationship between sales
and advertising.
(b) Use Excel regression functions to undertake the following tasks:
i. Fit linear model,
ii. Check model reliability (r and COD),
iii. Undertake appropriate inference tests (t and F test),
iv. Check model assumptions (residual and normality checks),
v. Provide a 95% confidence interval for the predictor variable,

TU9.12 Over the last 14 days you register the morning temperature in degrees Celsius
and count the number of students present in your Business Statistics 101 class.
The two variables are showing the following values:

Morning 16 18 18 16 18 24 22 18 16 17 20 24 20 18
temperature
Students 78 65 70 76 75 60 64 72 75 72 70 59 65 70
in the class
Table 9.8 Temperature versus attendance

Establish if there is any correlation between these two events and conduct the
tests to determine if the outside temperature in general has impact on the
student numbers attending Business Statistics classes.

TU9.13 The cinemas are currently showing four new blockbuster movies. The movies
are labelled as A to D. In London, the movies are ranked per number of viewers
as: D,A,B,C. In Manchester the same movies are ranked per number of viewers as:
D,C,B,A. Are the audience tastes in these two cities comparable, at least as far as
these four movies are concerned?

TU9.14 Assignment and final examination marks for 13 undergraduate students in


Business Statistics are given in Table 9.9. Fit an appropriate equation to this data
set to create a model from which the final examination marks can be predicted
given the assignment marks.

Page | 567
Assignment 72 40 48 37 100 80 100 88 60 45 70 48 46
Examination 80 64 70 62 80 81 81 73 65 58 69 59 60
Table 9.9 Examination versus assignment mark

(a) Plot a scatter plot and comment on a possible relationship between sales
and advertising.
(b) Use Excel regression functions to undertake the following tasks:
i. Fit linear model,
ii. Check model reliability (r and COD),
iii. Undertake appropriate inference tests (t and F test),
iv. Check model assumptions (residual and normality checks),
v. Provide a 95% confidence interval for the predictor variable,

Want to learn more?


The web chapters contain additional sections to provide further information on the
following topics:

1. A9Wa Testing the significance of linear correlation between the two variables.
2. A9Wb Testing the significance of Spearman rank correlation coefficient.
3. A9Wc The use of the t-test to test whether the predictor variable is a significant
contributor.
4. A9Wd The use of the F-test to test whether the predictor variable is a significant
contributor.
5. A9We Confidence interval estimate for the slope.
6. A9Wf Autocorrelation.
7. A9Wg Standard error for the autocorrelation function.
8. A9Wh Significance of the autocorrelation coefficients and evaluation.
9. A9Wi Partial autocorrelation coefficient.
10. A9WJ Error and residual inspection.
11. A9Wk Non-linear regression analysis.
12. A9Wl Multiple regression analysis.
13. A9Wm Linear regression and Durbin Watson test for autocorrelation.
14. A9Wn Regression goodness of fit.

Page | 568
Chapter 10 Introduction to time series data, long-term
forecasts and seasonality
10.1 Introduction and chapter overview
Time series are time-based variables. They are a series of data points listed in time
order. The aim of this chapter is to provide the reader with a set of tools which can be
used for time series analysis and extrapolation. This chapter will allow you to apply
several time series methods for long-term forecasting and extrapolation. The methods
we will cover are applicable to all types of temporal data. However, we are restricting
our applications to one single time series. This is often called the univariate approach to
time series analysis and forecasting. The methods covered are suitable for many
applications in economics, business, finance, social sciences, and natural sciences.

We will start the chapter by explaining what types of time series are likely to be found
and what differentiates them. Next, we will explain how different types of models can be
fitted to different types of time series. This will be followed by a brief overview of
different types of error measurements that are used to establish the quality of our
forecasts and the suitability of the model used. Further on, we will cover the prediction
interval for time series analysis. We will modify the formula for calculating the standard
error of the estimate to show that the prediction interval should grow wider the further
in the future we extrapolate the time series. This will match the intuitive assumption
that the further into the future we go, the uncertainty grows larger and larger.

The last topic we will cover will be dedicated to seasonal time series. We will use
classical decomposition method to extract different components from the time series
and learn how to predict seasonality and cyclical movements in a variable.

In many ways, this chapter is an extension of the Linear Regression chapter, but with
one difference. While simple regression analysis involved two variables that may not be
temporal, this chapter applies strictly to time series. In fact, we will still use two
variables. One variable, measured in time, will be treated as a dependent variable. The
other variable, the independent one and a predictor, will be just sequential units
representing the time.

The practical value of time series analysis is immense. The applicability of these methods
is universal and covers almost any commercial or scientific discipline. No matter what
you do, the chances are that you will have some data ordered in time. This could be sports
results, inflation figures, mortality data, spending patterns, crime rate, etc. Once you have
these figures ordered in a time series, you will invariably ask yourself: can I establish a
trend here and see where this is going? If you can establish “where this is going”, you can
effectively predict what this variable will be in x number of time units from now.

Why is predicting something so important? Well, if we take a business example, you will
soon realize why. If someone told you that next month the demand for your product is
going to double, what would you do? You would probably double your production today,
in order to meet the next month’s demand. This is precisely the purpose of forecasting.
You can take actions today and be better prepared for the future. Forecasts provide

Page | 569
insight into uncertainty that tomorrow brings. By learning how to forecast future events,
you are effectively managing the uncertainty of the future. This means that you can take
actions, make decisions, or plan better to be ready for this future.

If you can gain glimpses into what tomorrow brings, you will be more confident in making
decisions today. When we use the phrase “gain glimpses”, we do not mean that you will
know exactly what will happen. We mean that the statistics will help you identify the most
probable area, or range of numbers, likely to happen in the future. This is what time series
analysis, or forecasting, is all about.

Learning objectives

On completing this unit, you should be able to:

1. Understand the terminology associated with time series analysis.


2. Be able to inspect and prepare data for forecasting.
3. Plot the data and visually identify patterns.
4. Fit an appropriate model to the data set using the time series approach.
5. Use the identified model to provide forecasts.
6. Calculate a measure of error for the model fit to the data set.
7. Learn how to calculate prediction interval.
8. Learn how to handle seasonal time series and apply the decomposition method
9. Solve problems using the Microsoft Excel and SPSS.

10.2 Introduction to time series analysis


In this chapter, we apply the regression analysis principles to one single variable,
measured in time. This implies that we are still using two variables, but one of them is the
time. Previously we fitted a line equation that best describes relationship between a
dependent variable (y) and an independent (or predictor) variable (x). In this chapter we
still use only one dependent variable (y), whilst the independent variable (x) is a
sequence of numbers representing time.

In linear regression, we used the model to estimate the value of y, given the knowledge of
x. This means the objective was primarily to estimate y for the current range of x (though
it could be used for extrapolating y for the future values of x). Forecasting objective is
primarily to extrapolate the variable into the future, by relying on its history.
Extrapolation implies that we can take the future values of x and estimate what the
potential value of y might be.

So, how do we define a time series? A time series is a variable that is measured and
recorded in equidistant units of time. A good example is inflation. We can record monthly
inflation, quarterly inflation, or annual inflation. All three data sets represent a time
series. In other words, it does not matter what units of time we use as long as we are
consistent, and the time units are sequential. By consistent, we mean that we are not
allowed to mix the units of time (daily with monthly data or minute with hourly data, for
example). By sequential, we mean that we are not allowed to skip any data points and
have empty values for any point in time. Should this happen, we need to somehow

Page | 570
estimate the missing value. The easiest way is to calculate the average of the two
neighbouring values. Other more sophisticated method might be even more appropriate.

What is the purpose of time series analysis? Well, the main purpose of time series analysis
methods is to identify the historical pattern that a time series exhibits through time, and
to predict the future movements of a variable based on these historical patterns. In other
words, forecasting the future values is the main concern. To assess if the correct
forecasting method has been used, several other auxiliary methods has been developed.
They all fall in the category of time series analysis. Nevertheless, forecasting remains the
main purpose.

To be fully equipped to deal with time series and forecasting, like with any other area,
you need to understand the terminology of this area of statistics. This terminology is
primarily related to the types of data that you will encounter, and to the types of methods
that are available to be used. The following sections will provide a brief overview of the
terminology used in time series analysis.

Stationary and non-stationary data sets

Example 10.1 consists of two time series data sets. The data sets come from the World
Bank and they show the birth rates for India and Switzerland per 1,000 people from
2000-2018. If we just looked at the data in columns B and C, in Figure 10.1, we could
conclude very little. However, if we plot these numbers, as in Figure 10.2, we
immediately see the difference.

As we will describe below, some time series oscillate around the fixed mean value (they
are called stationary time series) and some around a continuously changing value (they
are called nonstationary time series). The estimation principles and the statistics used
to describe the horizontal time series vs. upward or downward moving time series are
different.

This implies that very often two different sets of methods are used when handling
stationary vs. nonstationary time series. Stationary time series have constant mean,
variance, etc., whilst nonstationary time series do not. Because of that, nonstationary
time series violate a number of assumptions that are used for modelling stationary time
series, and therefore, we have a separate set of models dedicated to nonstationary time
series.

Example 10.1

Page | 571
Figure 10.1 Birth rates for India and Switzerland between 2000-2018

Figure 10.2 A graph for birth rates for India and Switzerland between 2000-2018

The data for Switzerland shows a line that seems to be moving horizontally, oscillating
around some central value. Meanwhile, the data for India are undoubtedly moving
downwards, which implies that data do not oscillate around some central value, but
around some moving value. The time series representing birth rates for Switzerland,
following a horizontal line, is called stationary, whilst the time series for India is called
a non-stationary time series. Every time series must fall in one of these two categories.
Why is this important? Most of the time you cannot use the same method to successfully

Page | 572
handle a stationary and a non-stationary data set. A variety of methods have been
invented to handle either the stationary or non-stationary time series.

Another point that can be inferred from the opening of this section is that we can ‘see’
very little by just looking at the data values. This implies that charting the data is not
optional. It is one of the pre-requisites in time series analysis.

Figure 10.3 Birth rates in Switzerland 2000-2018

Before we proceed, let us look at Figure 10.3. We charted the values for the birth rates
in Switzerland between 2000 and 2018. The axis of the chart on the left specifies the
years, but we could have easily shown just sequential numbers, which is what we did
with the chart on the right. As the title indicates (or sometimes the chart legend) that
the starting year is 2000, the number 1 on the x axis would imply that this is 2000, the
number 2 that this is 2001, etc. until we come to 2018, which is sequential number 19.

This also implies that when dealing with time series, the variable that we are charting is
typically the dependent variable and the independent variable is simply time. The time
is in this case defined by the context, but we can use the expression ‘time period’ and
mark every observation with the sequential numbers starting from one onwards. This
column will in fact become a variable, as we will see in the pages to follow.

Seasonal time series

So, every time series is either a stationary or non-stationary, and furthermore, every
time series can also be either seasonal or non-seasonal. Intuitively we understand that
the word seasonal means a time series that shows some repeated pattern over the units
of time less than a year (monthly, quarterly, etc.). If the pattern repeats over longer
time intervals, like several years, the time series is called cyclical. However, as both
types show repeated pattern regardless of the time units, for forecasting purposes they
are treated in a similar way. A variety of methods exist to treat seasonal time series.

Example 10.2

Here is an example of one seasonal and stationary time series (Figure 10.4) and one
seasonal and non-stationary time series (Figure 10.5).

Page | 573
Figure 10.4 Seasonal stationary time series

Figure 10.5 Seasonal non-stationary time series

Both seasonal and or cyclical patterns repeat themselves after some fixed number of
time units (days, weeks, months, or quarters for the seasonal data and years for the
cyclical data). This is also sometimes called periodicity. Remember that seasonal data
sets represent a special set of time series and we will learn that there are methods
dedicated to use exclusively with seasonal time series.

Check your understanding

X10.1 Chart the following time series and decide if it is stationary and/or seasonal.

x 1 2 3 4 5 6 7 8 9 10
y 24 60 72 72 48 60 84 60 96 108
Table 10.1

X10.2 The time series in Table 10.2 is seasonal. What is the periodicity, or the number
periods over which you think this time series shows seasonality?

Page | 574
x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y 15 45 75 45 15 45 75 105 75 45 75 105 135 105 75
Table 10.2

X10.3 Is it possible to have a time series that is seasonal and non-stationary? If so, can
you draw a graph showing how could one such series look like?

X10.4 Go to one of the web sites that allow you to download financial time series (such
as http://finance.yahoo.com/) and plot the series of your choice in several
identical line graphs. Change the scale of the y-axis on every graph and make
sure that they are radically different scales. What can you say about the
appearance of every graph?

X10.5 Describe why do you think that different forecasting methods are used for the
stationary as opposed to those used for non-stationary time series?

10.3 Trend extrapolation as long-term forecasting method


Long term forecasting is essentially the same as trend extrapolation. If we are going to
produce long-term forecasts, this implies that we are not necessarily interested in every
detail of the future time series. We are interested in the general direction and the speed
(slope) at which the future values of the variable are going to happen.

Long-term forecasts do not mean that we must go several years in the future. It means
that we want to go several time periods in the future. What we are saying is that
regardless of units of time (years, days, minutes), long term-forecasting implies that we
are going to forecast the values of the variable for a considerable number of time periods.
What is considerable number of time periods? Well, it depends how long is the time
series. A general and empirical rule is that long-term forecasts should not exceed one
third of the total number of observations (n/3) we have in our data set (we call this
historical observations).

As an example, if your time series consists of 36 observations (n), the maximum number
of forecasts to be produced should not exceed 12 (n/3).

In general, we need to exercise common sense and see how far in the future it makes
sense to forecast. If we go too far, the confidence level will be so wide that the uncertainty
will not be reduced. In other words, if everything is possible, why bother to forecast?

A trend component

A trend can be described as a general shape and direction in which something is


moving. In the context of forecasting, a trend is a component that best describes the
shape and direction of the time series. Graphically, this shape and direction can be
approximated by any curve, though in this textbook we mainly deal with linear trends.
In addition to trend, other components can be present in time series, but for now we
will focus on the trend component only.

Page | 575
Let us say that, for all practical purposes, we are only interested in estimating the trend.
After we calculated the trend that models the time series, the difference between every
actual value in the time series and every trend value is effectively a series of residuals
(R). We can say that the time series Y, using this simplified model, consists of only two
components, as defined by equation (10.1).

Y=T+R (10.1)

However, we also need to make one further assumption and that is, in this case, that the
residuals are something that should randomly oscillate around the trend (we know this
already from regression analysis in Chapter 9). In other words, if we can estimate the
underlying trend of a time series, we will not worry about these random residuals
fluctuating around the trend line (at least not in the context of long-term forecasts).

After we extrapolated this trend, the trend effectively becomes a forecast of the time
series. We realise that this forecast will not be 100% accurate, i.e. every trend value will
not be the same as the actual value, but that is OK because we are only interested in the
shape and direction of the time series.

In practice, extrapolating the trend is the same as producing the long-term forecasts.

Example 10.3

Our simplified example in Figure 10.6 shows that the trend we calculated is in fact the
estimate (or, forecast) Ŷ. If we add the residuals to the trend, we will get the values of
the actual time series Y. Just as per our equation (10.1).

Figure 10.6 The actual data Y, their trend value T and the residuals R

This simplified example makes the point that you can create reasonable long-term
forecasts by only focusing on the trend. For short term forecasts this approach would be
“short” of expected precision, but we will tackle this problem in the following Chapter.
Let us stay with the trend for now.

Another key point here is that before we extrapolate the trend, we always need to fit the
trend to the existing values. Figure 10.6 does not show any extrapolated values, just fitted
trend values (column C in Figure 10.6) to historical values of the data set, i.e. the time
series.

Page | 576
Fitting a trend to a time series

If a trend is the underlying pattern that indicates the general movements and the
direction of a time series, then this implies that a trend can be described by some
regular curve. This usually means a smooth curve, such as a straight line, a parabola or a
sinusoid, or any other well-defined curve.

Rather than starting with manual calculations, on this occasion we will start with Excel.
Excel is very well equipped to help us define the trend and fit it to time series and
extrapolate it into the future. The way Excel is used to achieve this, is identical to the
way we used it to demonstrate how to apply certain elements of regression analysis in
previous Chapter.

Example 10.4

We will use a time series representing average annual UK petrol prices in pence per litre
1983-2020 (leaded 4-star up to 1988, unleaded thereafter). The time series consists of
38 observations, as illustrated in Table 10.3.

Price per litre Price per litre


Year (p) Year (p)
1983 36.7 2002 69.9
1984 38.7 2003 77.9
1985 42.8 2004 77.9
1986 38.2 2005 79.9
1987 37.8 2006 88.9
1988 34.7 2007 87.9
1989 38.4 2008 103.9
1990 40.2 2009 89.9
1991 39.5 2010 111.9
1992 40.3 2011 129.9
1993 45.9 2012 134.1
1994 48.9 2013 138.9
1995 50.9 2014 130.9
1996 52.9 2015 109.9
1997 57.9 2016 103.9
1998 60.9 2017 117.9
1999 61.9 2018 115.9
2000 76.9 2019 111.9
2001 77.9 2020 111.9
Table 10.3 A time series with 38 observations

When charted as a line graph, the time series looks as illustrated in Figure 10.7.

Page | 577
Figure 10.7 A graph of the time series from Table 10.3

Excel Solution

To fit a trend line to the time series is a very easy graphical process in Excel, as we
already demonstrated in Chapter 9.

Figure 10.8 Fitting a trend to a time series in Excel

To fit a trend line to the time series right-click on any data point in the Excel graph, as
illustrated in Figure 10.8, and select Add Trendline. After selecting Add Trendline,
choose Linear Option, as well as Display Equation on chart, and Display R-squared on
chart (see Figure 10.9). Click on close.

Page | 578
Figure 10.9 Excel Trendline options box

The final graph with the trend line is automatically added as illustrated in Figure 10.10
with the line equation and coefficient of determination (R2) included.

Figure 10.10 Graph of the time series from Table 10.3 and its trend line with the
trend equation details

As we can see, we were able to get instantly a straight line that describes the underlying
movement and the direction of our time series, i.e. the trend. This trend, when
extrapolated, becomes a forecast.

Page | 579
Using a trend chart function to forecast time series

It is quite a simple task to use the fitted Excel trend line to forecast, for example, ten
time periods into the future. Right-click on the trend line on the graph and choose
Format Trendline as illustrated in Figure 10.11.

Figure 10.11 Formatting the existing trendline in Excel

Figure 10.12 Trendline options box

Page | 580
After we clicked on Format Trendline, a dialogue box as in Figure 10.12 will appear. We
are already familiar with this dialogue box. Under the Forecast option, we can see the
field called Forward. In the box next to it we will opt for an extension of the trend line
that will extrapolate the trend to future 10 periods, as seen in Figure 10.12.

Figure 10.13 illustrates the modified time series chart with the trend line extended by
10 time periods to provide a forecast for time points from 38 to 48.

Figure 10.13 Graph of the time series from Table 10.3 and its trend line with the
trend equation details and 10-period forecasts

We can see that the actual time series is not a smooth straight line, but it oscillates
around the smooth straight line that we have estimated. By extrapolating our straight
line, or linear trend, into the future, we are anticipating that our forecasts might not be
completely accurate, but we believe that this trend represents well the direction and
how steep are the movements of our variable.

Excel does not just give us a pictorial of this trend line, but the actual equation of this
line. From Figure 10.13, we can see that this trend line is moving in accordance with the
equation: y = 20.964 + 2.88x. The R-squared (or R2) value is 0.89.

As we know, the closer R2 is to the value of 1, the better the fit of the trend to time series.
In our case R-squared is 0.89, which is very good. This confirms that our trend is
approximating, or fitting, the historic data very well. Only 11% (1 - 0.89 = 0.11) of data
variations are not explained by this linear model. This is more than reasonable.

SPSS Solution

SPSS data file: Chapter 10 Example 4 Time series trend.sav

Enter data into SPSS – only first 15 data values shown in Figure 10.14.

Page | 581
Figure 10.14 The first 15 values of the time series from Table 10.3

SPSS Curve Fit

Analyze > Regression > Curve Estimation

Transfer the variable Series to the Dependent (s) box

Choose Independent Variable: time

Choose Models: Linear

Figure 10.15 Selecting the linear trend model in SPSS Curve Estimation mode

Select Save

Choose Save Variables:


Page | 582
• Predicted values
• Residuals
• Prediction intervals, 95% confidence interval
• Predict Cases – Predict through Observation = 40

Figure 10.16 Selecting 10 future observations to be forecasted (the time series


has 30 observations)

Select Continue
Select OK

When you click OK the following menu appears, choose OK

Figure 10.17 A notification box from SPSS informing about the creation of 4
additional variables

SPSS Data File

SPSS data file modified to include predicted values, residuals, forecast values for time
points 31-45, and prediction intervals. We will return to prediction intervals in greater
detail later in this chapter. Figure 10.18 represents only the first 15 data values in
screenshot

Page | 583
Figure 10.18 The first 15 observations of data from the table 10.3 and four new
variables created whilst building the forecasting model

Figure 10.19 represents the forecast values for time points 31-40 [manually entered
time points t = 31 to 40].

Figure 10.19 The forecasts (future 10 observations of data from the table 10.3)
and four new variables created whilst building the forecasting model

SPSS Output

Model summary

The trend line equation statistics are provided in Figure 10.20 (T = 20.964 + 2.88 * time
point) with a time series plot in Figure 10.21.

Figure 10.20 SPSS model summary for linear trend forecasts for data from Table
10.3

Page | 584
Figure 10.21 SPSS graph of the data from table 10.3 and the fitted linear trend

Going back to the trend equation, we said that the trend line equation, in this case, was
y = 20.964 + 2.88x. Excel extrapolated ten periods in the future our trend line, but
unlike the SPSS printout, we do not know neither the past values, nor the future values
of this trend line. All we have is the chart that does this for us. We need to learn how to
calculate these values manually, or by using the built in Excel functions.

Trend parameters and calculations

Think of the equation y = 20.964 + 2.88x as a specific case, fitted to our data set. In
previous chapter we used equation (8.7) ŷ = b0 + b1x, which look similar. In fact, this is
the same equation.

Do not be confused with the notation. In most textbooks this equation is written as

y = ax + b, or y = a + bx

Whatever the case, the letter that stands alone (without x) is called an intercept and the
other letter associated with x, is called the slope.

In our case, the value of the intercept is 20.964 and the value of the slope is 2.881.
Chapter 9 explains the meaning of these two parameters. Refresh your memory if you
need to and re-read Chapter 9. Equations (9.8) and (9.9) refer to b1 and b0 respectively.
Equation (9.9) that applies to b0, or letter a as used here, is the intercept. Equation (9.8)
that applies to b1, or letter b as used here, is the slope of the trend. Effectively, to
calculate our past and future trend values we just need these two parameters. The
values of x are presented by the sequential numbers that represent time periods.

Page | 585
Example 10.5

We will use the same numbers as in Example 10.4 to calculate the intercept a, as well as
the slope b and to show how these calculations are executed manually. We will modify
equations (9.8) and (9.9) to make them a bit easier to apply for calculating trend, so the
equations for slope (10.2) and the intercept (10.3) are:

∑𝑛 ̅
𝑖=1 𝑥 𝑦 −𝑛 𝑥̅ 𝑦
𝑏= ∑𝑛 2 2 (10.2)
𝑖=1 𝑥 −𝑛 𝑥̅

𝑎 = 𝑦̅ − 𝑏 𝑥̅ (10.3)

Where, y is the variable, x are the incremental numbers representing the time periods, 𝑦̅
is the mean value for y, 𝑥̅ is the mean value for x and n is the number of observations, or
data points available. The manual calculations are illustrated in Figure 10.22:

Figure 10.22 Manual procedure for fitting a linear trend to data from Table 10.3

Summary statistics: n=38, 𝑥̅ = 110.5, 𝑦̅ = 77.123, xy=70308.8, n𝑥̅ 𝑦̅ = 57148.6,


x2=19019

Page | 586
70308.8 − 57148.6
𝑏= = 2.879
19019 − 38 × 110. 52

𝑎 = 77.123 − 2.879 × 110.5 = 20.964

As we can see these numbers agree with what we got from Excel in Figure 10.13 (as
well as SPSS in Figure 10.20). Let us now show how to calculate the same values in an
even more elegant way using Excel built in functions.

Example 10.6

To fit a trend line to a time series data set given the value of the slope of the trend line
and its intercept. We can either use built in Excel functions =SLOPE() and
=INTERCEPT() or use one single function called =TREND().

Excel Solution

Figures 10.23 and 10.24 illustrates both approaches. In addition, we extrapolated the
values of ŷ ten periods into the future (rows 8-27 are hidden).

Figure 10.23 Excel calculations for finding the intercept and the slope of the
linear trend

Page | 587
Figure 10.24 Excel calculations for linear trend using the coefficient method or
single function =TREND() method

Column H in Figure 10.24 shows manual calculations of the trend, after we used Excel
functions for the slope and intercept. Column L in Figure 10.24 shows the same values,
but this time calculated using a dedicated Excel function for trend. The forecasts, or
future trend values at time points 31 to 40 (H34:H43 and L34:L43) are produced in the
same way as the historical values.

The future values of x should always be a sequential continuation of the time period
numbers used in the past. In our case, the last observation is for period 38, which means
that the future values of x are 39, 40, …, 48. There are exceptions, where the future values
start from 1, 2, … , m, but we will make sure that this is clearly understood when we come
to these methods.

The principles of calculating the linear trend, as described here can be applied to other
types of curves. The Manual and the Function method in Excel work with any curve,
though the equations are different. In Excel, if we choose to apply the Function method,
in addition to =TREND() another function called =GROWTH() can be applied. GROWTH
is Excel function that describes exponential trends. It is invoked and used exactly in the
same way as the TREND function used for linear time series.

SPSS Solution

Figures 9.44 – 9.47 show how to calculate the linear regression/trend line coefficients
‘a’ and ‘b’ in linear trend, so we will not repeat this procedure here.

Page | 588
Check your understanding

X10.6 How would you call a model represented by equation: is y=a+bx+cx2, and would
you say that this is a linear model?

X10.7 Define what are residuals when describing the fitting and extrapolating trends?

X10.8 Does R-squared=0.90 indicate a good fit? Is this the same statistic used in
Chapter 8 dedicated to linear regression?

X10.9 Extrapolate the time series below and go 3 time periods in the future. Use
TREND function. Why do you think it would not make sense to extrapolate this
time series 10 time periods in the future?

X 1 2 3 4 5 6 7 8 9 10 11 12
Y 20 25 25 27 29 33 29 33 35
Table 10.4

10.4 Error measurements


You can ask yourself a simple question: Why do we need forecasts? The essential
purpose of forecasting is to try to manage uncertainty that the future brings. We can
never eliminate uncertainty associated with the future, but good forecasts can reduce it
to the acceptable level. Let’s “unpack” this statement.

What would be a good forecast? An intuitive answer is: A good forecast is the one that
shows the smallest error when compared with actual event. However, it is impossible to
measure errors until the event happened. Essentially, in order to produce good
forecasts, we would like to measure errors before the future unfolds. How do we do
this?

As we showed in Example 10.6 where we fitted a trend to the actual data, before we use
the model to extrapolate the data, we first “back-fit” the existing time series using the
model. This is sometimes called ex-post forecasting. Once we produced the ex-post
forecasts (which is the same word as fitting a model to actual data), it is easy to measure
deviations from the actual data. These deviations between the actual and model fitted
data (ex-post forecasts), are called forecasting errors. Essentially, forecasting errors
will tell us how good our method or model is. In the context of regression analysis, we
referred to these errors as residuals.

As already shown in previous chapter where we addressed regression residuals,


calculating errors is a trivial exercise. An error is a difference between what happened
(y) and what we thought would happen according to the model (ŷ). The same principle
applies to any forecasting of the time series.

An error is the difference between the actual data and the data produced by a model, or
ex-post forecasts. This can be expressed as a formula:

Page | 589
et = At - Ft , or et = yt - ŷt (10.4)

Where et is an error for a period t, At (or yt) is the actual value in a period t and Ft (or ŷt)
is forecasted value for the same period t. Just like with the regression analysis,
calculating errors, or residuals, will tell us if our method, or a model, fits the actual data
well. Again, remember that regression residuals had to be random, otherwise the
assumption was that the model did not fit the data well. The same rule applies here.

Example 10.7

Let us take a hypothetical and non-linear time series and fit an incorrect linear model to
it. Figure 10.25 shows a time series that is non-linear, and we deliberately fitted the
wrong linear model to it (calculations not shown here).

Figure 10.25 A linear model fitted to a non-linear time series

If we plot the errors, as in Figure 10.26, they clearly show that they are not random. If
errors are not random, then we picked the wrong model. This is a good example of how
errors can help us decide if the model, or the method selected, is the correct one for the
data.

Page | 590
Figure 10.26 Errors for the linear model fitted to non-linear time series in Figure
10.25

Another use of error measurement is if we are unsure which one of several models
calculated is the most appropriate one. We can use several different forecasting
methods and apply them to the same data set. We then calculate forecasting errors for
all of them and compare them. Whichever model/method shows the smallest errors in
the past, will probably make the smallest errors when extrapolated in the future. In
other words, the model with smallest historical errors will be the best to quantify the
uncertainty that the future brings. This is the key assumption.

However, often it is not enough to calculate just a series of simple forecasting errors. We
need to calculate some statistics based on these errors. For example, as with any other
variable, we can sum all the errors and find an average error:

∑(𝐴𝑡 − 𝐹𝑡 )
𝑒̅ =
𝑛

or

∑(𝑦𝑡 −𝑦̂𝑡 )
𝑒̅ = (10.5)
𝑛

As we will see shortly, the average error is more often called the Mean Error (ME), and
it is just one of many different types of error statistics that we can use to evaluate our
model and our forecasts.

Example 10.8

We will use the same mini-data sample as in Example 10.3.

Page | 591
Period Actual Y Forecast Ŷ
1 130.0 110
2 120.0 120
3 110.0 130
4 135.0 140
5 155.0 150
6 160.0 160
7 180.0 170
Table 10.5 A short time series and its forecasts

Figure 10.27 shows the results in a graphical way.

Figure 10.27 A graph for the data in Table 10.5

The simple calculations of error values are:

Period Actual Y Forecast Ŷ Error


1 130.0 110 20.0
2 120.0 120 0.0
3 110.0 130 -20.0
4 135.0 140 -5.0
5 155.0 150 5.0
6 160.0 160 0.0
7 180.0 170 10.0
Sum 990.0 980.0 10.0
Average 141.4 140.0 1.4
Table 10.6 Errors for the data from Table 10.5

From table 10.6:

e1 = y1 - ŷ1, e2 = y2 - ŷ2, …, e7 = y7 - ŷ7

From these individual errors, we calculate:

Page | 592
 (yt - ŷt) = 10

∑(𝐴𝑡 − 𝐹𝑡 ) 10.0
𝑒̅ = = = 1.4
𝑛 7

For period 1 (t=1) our forecast is below the actual values, which is presented as 20.0,
because errors are calculated as actual minus forecasted. For period t=2, our forecast is
identical to the actual value. For period 3 (t=3), we are above the target, showing an
error of -20.0, etc. And lastly, for period 7 (t=7), our forecast is below the actual,
showing an error of 10.0. What can we conclude from this? If these were the first 5
weeks of our new business venture, and if we add all these numbers together, then our
cumulative forecast for these five weeks would have been 980. The business generated
990. This implies that the method we used made a cumulative error of 10, or given the
above formula, we underestimated the reality by 10 units. If we divide this cumulative
value by the number of weeks to which it applies, i.e. 7, we get the average value of our
error of 1.4.

The average error that our method generates per period is +1.4 and, because errors are
defined as differences between the actual and forecast values, this means that on average
the actual values are by 1.4 units lower than our forecast. Given earlier assumptions that
the method will probably continue to perform in the future as in the past (assuming there
are no dramatic or step changes), our method will probably generate similar errors in the
future. Given how small the errors are, on average we have a good forecasting method.

Excel Solution

Figure 10.28 shows an example of how to calculate forecasting errors in Excel.

Figure 10.28 An example of calculating the forecasting errors.

Columns B, C and D contain the just the values (calculations not shown), and column E
uses simple formula for calculating errors as E4=C4-D4, etc.

Let us conduct a brief simulation. Assume that we use two different methods to produce
forecasts. One of the methods generates an average error of -2 and the other one 4. If
you assume that these are the actual units in tonnes of the product that you are
forecasting, then the first method overshoots the actual by 2 tonnes, and the second one
undershoots the actual by 4 tonnes. In the first case you might end up with 2 tonnes of
product not sold and in the second you could have made more money by selling another

Page | 593
4 tonnes, but you did not have them available, because your forecast was short. Which
forecast would you prefer?

Difficult question. One approach is to say that in absolute terms 2 is less than 4,
therefore, we would recommend the first method as a much better model for
forecasting this business venture. On the other hand, missing the opportunity of selling
4 tonnes might be more important to a business than 2 tonnes of extra product on
inventory. The point we are making is: there could be various business scenarios why
you might prefer one or the other forecast. As we do not know the circumstances, we
will adopt purely numerical approach and go for the lowest absolute value.

The above examples and the simple simulation illustrated how forecasting methods can
help us understand and potentially quantify uncertainty. We are effectively using errors
as measures of uncertainty. We learned how to calculate an average, or mean error, but
in practise, other error measurements or error statistics are used too, and we will cover
this in the section below.

SPSS Solution

SPSS does not provide error calculations without the context of the method for which the
errors are used. We will, therefore, show how the errors and error statistics are executed
in SPSS as we cover specific extrapolation methods.

Types of error statistics

A variety of error measurements, or error statistics, can be used to assess how good
the forecasts are. The six most commonly used error statistics are: the mean error (ME),
the mean absolute error (sometimes called the mean absolute deviation and
abbreviated as MAD), the means square error (MSE), the root mean square error (RMS),
the mean percentage error (MPE) and the mean absolute percentage error (MAPE).
These errors are calculated as follows:

∑(𝐴𝑡 −𝐹𝑡 ) ∑ 𝑒𝑡
𝑀𝐸 = = (10.6)
𝑛 𝑛

∑|𝐴𝑡 −𝐹𝑡 | ∑|𝑒𝑡 |


𝑀𝐴𝐷 = = (10.7)
𝑛 𝑛

∑(𝐴𝑡 −𝐹𝑡 )2 ∑ 𝑒𝑡2


𝑀𝑆𝐸 = = (10.8)
𝑛 𝑛

∑ e2t
RMS = √ (10.9)
n

𝐴 −𝐹 𝑒
∑( 𝑡 𝑡 ) ∑( 𝑡 )
𝐴𝑡 𝐴𝑡
𝑀𝑃𝐸 = = (10.10)
𝑛 𝑛

|𝐴 −𝐹 | |𝑒 |
∑( 𝑡 𝑡 ) ∑( 𝑡 )
𝐴𝑡 𝐴𝑡
𝑀𝐴𝑃𝐸 = = (10.11)
𝑛 𝑛

Page | 594
Where, At represents the actual values in the time series yt, Ft represents the forecasts,
or ŷt, and et are the errors. For the number of errors considered, we used a symbol of n.

Mean error (ME) is exactly what the phrase and the equation (10.6) implies, an average
error. Unfortunately, positive errors might cancel the negative errors, and often this
error will be equal to zero. It does not mean that we do not have errors, just that they
cancelled each other. This is the reason why this error is often used in conjunction with
other error measurements.

Mean absolute error (MAD) eliminates the problem of positive and negative errors
cancelling each other and produces a typical error, without saying if it is positive or
negative.

Mean square error (MSE) also eliminates the problem of ME by squaring the errors and
thus eliminating the cancelling of the positive errors from negative errors. It is a
frequently used indicator, but squaring a number implies that we do not know the units
in which errors are produced.

Root mean square error (RMSE) solves the problem of MSE and provides a
measurement in the same units as the time series. If the units of the time series are
miles or litres/head, then RMS is also expressed in the same units, i.e. miles or litres per
head. As we will see later, it is also a kind of standard deviation for forecasts.

Mean percentage error (MPE) provides a mean percentage value that forecasts deviate
from actual data, rather than the unit value as ME or MAD do.

Mean absolute percentage error (MAPE) eliminates the sign in front of the MAD and
provides the mean percentage error as a typical value, without the plus or minus sign.

When evaluating forecasts and/or forecasting models, at least one or potentially several
of these error statistics, will be used to provide evaluation.

Example 10.9

Using the same example as in 10.3 and 10.8, we will show how to do the error statistic
calculations in Excel. Figure 10.29 shows the calculations.

Page | 595
Excel Solutions

Figure 10.29 Calculating various errors

Errors in column E are calculated as the values from column C minus the values from
column D. For example, E4=C4-D4. The same cell in column F (MAD) is calculated as
F4=ABS(E4). Cell G4 (MSE) is calculated as G4=E4^2 and cell H4 (RMSE) as
H4=SQRT(G4). Cell I4 (MPE) is I4=E4/C4 and cell J4 (MAPE) is calculated as J4=F4/C4.

From Excel:

MAD = 8.57

MSE = 135.71

RMS = 11.65

MPE = 0

MAPE = 0.07

There is another, faster and more elegant method. Rather than calculating individual
errors (as in columns E to J in Figure 10.29) and adding all the individual error values
(as in row 9) or calculating the average (as in row 10), we could calculate all these
errors with a single formula line for each type of error.

Example 10.10

Using some of the built-in Excel functions, these errors can be calculated as illustrated
in Figures 10.30.

Page | 596
Excel Solutions

Figure 10.30 A method of calculating various error statistics as a single function in Excel

Note that MAD, MPE and MAPE formulae have curly brackets on both sides of the
formulae. Do not enter these brackets manually. Excel enters the brackets automatically
if, after you typed the formula, you do not press just the Enter button, but
CTRL+SHIFT+ENTER button (i.e. all three at the same time). This means that the range is
treated as an array.

You might ask yourself a question: Why should we bother with so many different error
statistics? That is a good question, especially given that there are a few more types of
error statistics that we did not include in this chapter. The simplest answer is: they are
sensitive to different things and often we will calculate more than one type of error
statistic and, depending on the results, they will help us make a better judgement about
our method and our forecasts. Let us take a look at just MSE as an example.

The concept of MSE was also extensively used in linear regression where we used the
residuals (errors) to evaluate the model. The rationale behind MSE is that large numbers
(errors) when squared will get even larger. Take for example two errors, one with the
value of 2 and the other one with the value of 10. The second one is five times bigger than
the first one. However, when you square them, then 100 is 25 time bigger than 4. This
means that if we have some large errors, in other words, our model is not fitting the actual
data “tightly” enough, then this model will show large MSE. This means that MSE favours
smaller errors and “penalises” models that have a few larger errors. Whether this is right
or not, this is the reason why MSE is one of the error measurements that is most often
used to assess forecasts and models.

Page | 597
A general rule to follow is: the lower the error statistic when compared between several
potential forecasting methods, the better the model. So always select the model that has
lower error statistic. What if you have several models and you calculated several different
error statistics (ME, MSE and MAD, for example) for every model, and some errors are
lower for one model and the other errors are lower for the other model? There are no
rules for such cases, so use other statistics that we will cover shortly to make a judgement
which model to select.

SPSS Solution

SPSS will provide fit measures such as RMSE, MAPE, etc. when fitting a time series model
to the data set using Analyze > Forecasting > Create Traditional Models > Statistics and
choosing your model fit statistics. Unfortunately, you cannot enter actual and forecast
data values into SPSS and run the SPSS Forecast command to reproduce these results
(unless you execute it as a manual formula). For this reason, the error statistics in SPSS
are covered in the context of specific methods.

Check your understanding

X10.10What do you think is a difference between accuracy and precision? How would
you apply these definitions in forecasting context, i.e. what are the consequences
if your forecasts are precise, but not accurate? Could you have accurate forecasts
that are not precise?

X10.11Why is MAD or MSE type of error measurement preferred over the ME type of
error?

X10.12Two forecasts were produced, as shown below. The ME for the second forecast is
8 times larger than the ME for the first forecast. However, the MSE is 64 times
larger. Can you explain?

Page | 598
Table 10.7 Two variables, their forecasts, errors, ME and MSE

X10.13It is acceptable to see some regularity in pattern when examining the series of
residuals, or forecasting errors?

X10.14If forecasted values are close to actual values, what do you expect to see on a
scatter diagram?

10.5 Prediction interval


We need to remind ourselves that in the sampling chapter we used the standard error of
the mean (SE), together with the z-value and the estimated mean value x , to make an
estimate that the true mean value  is somewhere in a given interval. This interval is
defined as the confidence interval CI:
𝑠
𝐶𝐼 = 𝑥̅ ± 𝑧 × 𝑆𝐸 or 𝐶𝐼 = 𝑥̅ ± 𝑧 × ( 𝑛) (10.12)

See equations (4.6) and/or (5.7) to refresh your memory on the standard error of the
mean (SE).

Depending on the value of z, we get different confidence intervals (CI). For example: (a)
z = 1.64 for 90% CI, (b) z = 1.96 for 95% CI, and (c) z = 2.58 for 99% CI.

If the sample, or the data set, is relatively short and represents just a small sample of the
true population data values, then the t distribution is used for the computation of the
confidence interval, rather than the z-value.
𝑠
𝐶𝐼 = 𝑥̅ ± 𝑡𝑣𝑎𝑙𝑢𝑒 × 𝑆𝐸 or 𝐶𝐼 = 𝑥̅ ± 𝑡𝑣𝑎𝑙𝑢𝑒 × ( 𝑛) (10.13)

The only difference between equation (10.12) and (10.13) is that the t-value in the
above equation will be determined not just by the level of significance (as it was the
case with the z-values), but also by the number of degrees of freedom.

Page | 599
If you compare equation (10.13) to equation (9.21) from the chapter on regression
analysis, you will see that they convey identical message. In other words, to find an
interval in which the true value of y is likely to reside, you need to add to ŷ the standard
error of the estimate multiplied by the desired value of z or t, depending what is the
required level of confidence.

A general rule is that for larger samples greater than 100 observations, the z-value and
the t-value produce similar results, so it is discretionary which one to use. If your time
series is shorter than 100 observations, you should use t-value and for the series longer
than 100 observations, you can use either one, but z-value is easier to use.

In any case, remember that in the context of mean estimates, the confidence interval
(CI) is based on the z-value or t-value and it enables us to claim with x% confidence
that, on the basis of the sample data, the true mean resides somewhere in this given
interval.

Standard errors in time series

The section above restated that the standard error of the estimate of the mean (SE)
measures the differences between every observation and the mean. When dealing with
time series, it would seem logical to use the same principle, but rather than calculating
deviations from the mean, we calculate deviations between the actual and predicted
values. We can modify equation (5.7), where we measure deviations from the mean, and
instead of the mean value use the predicted values as defined by equation (10.14). This
is called the standard error of the estimate of the predicted values. As before, we
just shorten it to the standard error, and from the context it is clear if the phrase
applies to the mean or predicted values.

∑𝑛 ̂ 𝑖 )2
𝑖=1(𝑦𝑖 −𝑦
𝑆𝐸𝑦̂,𝑦 = √ (10.14)
𝑛−2

Here y i are the actual observations and ŷ i are the predicted values. As we will see
shortly, the Excel version of this formula is =SQRT (SUMXMY2 (array_x, array_y)/n-2).
Excel offers an even more elegant function as a substitute for this formula. The function
is called: =STEYX (known_y’s, known_x’s). Both functions return the standard error of
the predicted ŷ-value for each x in the regression. If you look into Excel’s Help file, you
will see that this function is a very elegant representation of an unfriendly looking
equation given by (10.15).

1 [∑(𝑥−𝑥̅ )(𝑦−𝑦̅)]2
𝑆𝐸𝑦̂,𝑦 = √((𝑛−2) (∑(𝑦 − 𝑦̅)2 − ∑(𝑥−𝑥̅ )2
) (10.15)

Either way, remember that Excel formulae =SQRT(SUMXMY2(array_x, array_y)/n-2)


and =STEYX(known_y’s, known_x’s) are identical. They both return the standard error
for the predicted values. Equally, equations (10.14) and (10.15) are absolutely identical,
although (10.14) uses ŷ for calculating SE and (10.15) uses 𝑥̅ and 𝑦̅ to do the same.
Either way, they deliver the same value for the standard error of the estimate.

Page | 600
As you read this, or any other statistics textbook, for the sake of expedience they all use
just the phrase the standard error. From the context you need to “decipher” if this
phrase applies to the standard error of the estimate of the mean or the standard error of
the estimate of the predicted values. The formula for the former one is given by equation
(5.7) and the formula for the latter one is given by equation (10.14).

Given that we now have equation for SEyŷ as in (10.14), this means that we can modify
equations (10.12) and (10.13) into equation (10.16).

ŷ  tvalue  SEŷ,y or ŷ  z  SEŷ, y (10.16)

Where, ŷ are the predicted values, SEŷ, y is the standard error of prediction and the tvalue
is the t-value from the Student’s t critical table (z-value used for longer time series).
Equation (10.16) is known as the prediction interval, and it is equivalent to equation
(9.21) from the previous Chapter.

A prediction interval tells us that although we estimated a model value to be ŷ, the true
value of y is likely to be somewhere in this interval. The confidence, or probability that
the true value y is in this interval, is given by the t-value or z-value. As we said that, for
example, for z=1.96, this confidence (or probability) is 95%. Do not confuse the
confidence value with the width of the prediction interval. It is logical to expect that the
higher the confidence level, the wider the prediction interval will be.

Example 10.11

In Examples 10.4 - 10.6 we used UK Petrol prices per liter between 1983-2020 and
extrapolated them 10 periods in the future. We will now shorten the time series and use
the same data, but only 2006 - 2020. From this 15 observation time series we will
extrapolate 5 periods ahead to 2025, but this time we will also calculate the prediction
interval.

Excel Solution

We will use Excel =TREND() function to produce forecasts and calculate the prediction
interval for the trend forecasts. Figures 10.31 and 10.32 illustrates the technique.

Page | 601
Figure 10.31 A time series with its trend, prediction interval and deviations from
the mean

Column D contains the values calculated using Excel =TREND() function, i.e. the value of
cell D4 is: D4=TREND($C$4:$C$18, $B$4:$B$18, B4), etc. Columns E and F are based on
equation (10.16). For example, E4=D4-$H$5*$H$8 and F4=D4+$H$5*$H$8, and the
values of $H$5 (SE) and $H$8 (t-value) come from Figure 10.32.

Figure 10.32 Key indicators of goodness of fit

Note that the cells H3 and H4 contain manual formulas for calculating the standard
error, though they use different Excel functions. Cell H5 uses the dedicated Excel
function.

Page | 602
As a side note, we can also use the Pearson correlation coefficient and the Total Sum of
Squares (SST) to calculate the standard error:

(1−𝜌2 )𝑆𝑆𝑇
𝑆𝐸𝑦̂,𝑦 = √ (10.17)
𝑛−2

Where, SST is defined by equation (9.13) from the previous chapter. This formula was
used in H13 (this gives us H3, H4, H5 and H13 as four alternatives to calculate the same
value, using different equations or functions). Equations (10.14) and (10.17) produce
the same value, as we can see from the cells H5 and H13.

In Figure 10.31, the trend function was extrapolated 5 periods in the future. Figure
10.33 illustrates the graph for the prediction and the corresponding prediction interval.

Figure 10.33 A graph of the time series from Figure 10.31 with the trend and the
prediction interval

The equations and calculations we performed above show how to calculate the prediction
interval when forecasting time series. Unfortunately, this does not comply with one
intuitive assumption, which is that the width of the prediction interval should not be
constant and that it should change with time. In particular, the further we go in the future,
the wider the interval should be as the uncertainty increases.

How do we calculate the prediction interval and make sure it changes with time?

As we can see from the above example, the value of the standard error is a constant
value. To make the prediction interval change with time, we will need to replace
equation (10.14) with equation (10.18).

1 (𝑥 −𝑥̅ )2
SE𝑦̂,x =SEy,ŷ √1 + 𝑛 + ∑(𝑥𝑖 −𝑥̅ )2 (10.18)
𝑖

Page | 603
Equation (10.18) is effectively “correcting” the standard error SEŷ,y for ŷ given by
equation (10.14) or (10.15) for the changing value of x.

If you compare the square root portion of equation (10.18), it looks similar to the
square root portion of equation (9.22). In fact, they are identical. Although slightly
different calculations are executed in (9.22) as opposed to (10.18), it can be easily
proven that these two expressions are identical:

1 (𝑥𝑖 − 𝑥̅ )2 1 𝑛(𝑥𝑖 − 𝑥̅ )2
√1 + + = √ 1 + +
𝑛 ∑(𝑥𝑖 − 𝑥̅ )2 𝑛 𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2
From (10.18) From (9.22)

As the square root expression on the left, from equation (10.18) is easier to implement
in Excel, we will use this format to calculate the corrections of the standard error of
estimate of ŷ for every value of x.

We will use the time series from Example 10.11 to demonstrate the effects of this
additional formula.

Example 10.12

Figures 10.34 and 10.35 illustrates the Excel solution to calculate interval estimate.

Figure 10.34 Same time series as in Figure 10.31, but using SEŷ,x for the prediction
interval

Columns D, F and G are calculated in the same way as columns D, E and F in Figure 10.31.

Page | 604
The first value in column E (SEŷ,x) is calculated:

E4=$I$6*SQRT(1+(1/COUNT($B$4:$B$18))+(B4-AVERAGE($B$4:$B$18))^2/DEVSQ($B$4:$B$18)) and
copied down.

This is equation (10.18) translated into Excel syntax. The value of $J$6 (SE ŷ,y) is found in
Figure 10.35 as the value of the standard error of the estimate (the same value is
calculated in $J$7 using a different functions).

Figure 10.35 Two ways to calculate the standard error of prediction SEŷ,y

Cell J4 is calculated as =COUNT(C4:C18)-2, according to equation (10.17). And finally,


cell J5 is the t-value calculated as =T.INV.2T(J3,J4).

The only difference between Figure 10.31 and Figure 10.34 is that in Figure 10.31 we
used one fixed value of SEŷ,y (cell I3) to calculate the prediction interval in columns E
and F. In Figure 10.34, we created an additional column E where SEŷ,x was calculated as
per equation (10.18). This means that SEŷ,x was no longer a single value, but it changes
as the value of x changed. This makes the prediction interval change with time.

Figure 10.36 illustrates the Excel graphical solution.

Figure 10.36 A graph of the time series from Figure 10.34 with the trend and the
widening prediction interval

Page | 605
Note that the prediction interval is the narrowest where x = 𝑥̅ , which is implied from
equation (10.18).

Comparing Figure 10.36 and Figure 10.33, we can see that the prediction interval in
Figure 10.36 based on equation (10.18), now complies with a more intuitive assumption
that the further in the future we extrapolate our trend, the greater the uncertainty, i.e. the
more widely our forecasts are likely to be spread.

SPSS Solution

In order to understand how to implement these calculations in SPSS, you first need to go
back to Example 10.4 and repeat what has been shown in Figures 10.14 to 10.110. Figures
10.18 and 10.19 contain already the prediction interval and it is marked as LCL (Lower
Confidence Level) and UCL (Upper Confidence Level). We will repeat the whole process
as on this occasion we are using a shorter time series from 2006-2020 (Average annual
UK unleaded petrol prices per litre).

SPSS data file: Chapter 10 Example 12 Prediction interval.sav

Enter data into SPSS in Figure 10.37.

Figure 10.37 The last 15 values of the time series from Table 10.3

SPSS Curve Fit

Analyze > Regression > Curve Estimation

Transfer the variable Series to the Dependent (s) box


Choose Independent Variable: time
Choose Models: Linear

Page | 606
Figure 10.38 Selecting the linear trend model in SPSS Curve Estimation mode

Select Save

Choose Save Variables:

• Predicted values
• Residuals
• Prediction intervals, 95% confidence interval
• Predict Cases – Predict through Observation = 20

Figure 10.39 Selecting 5 future observations (to observation 20) to be forecasted


(the time series has 15 observations)

Select Continue

Select OK

When you click OK the following menu appears, choose OK

Page | 607
Figure 10.40 A notification box from SPSS informing about the creation of 4
additional variables

SPSS Data File

SPSS data file modified to include predicted values, residuals, forecast values for time
points 1-15, prediction intervals and the forecast values for time points 16-20
[manually entered time points t = 16 to 20] are given in Figure 10.41.

Figure 10.41 The forecasts (future 5 observations of data from the table 10.3) and four
new variables created whilst building the forecasting model

SPSS Output

Model summary

The trend line equation statistics are provided in Figure 10.42 (T = 98.9 + 1.835 * time
point) with a time series plot in Figure 10.43.

Page | 608
Figure 10.42 SPSS model summary for linear trend forecasts for data from Table
10.3

Figure 10.43 SPSS graph of the data from table 10.3 and the fitted linear trend

Up to this point all the steps are identical to what we covered in Example 10.4. From this
point we will continue to add a confidence interval to graph.

Modify graph to include the confidence interval

Double click on the graph and then double click on the trendline – this will open the
Chart Editor, as in Figure 10.44.

Page | 609
Figure 10.44 SPSS graph of observations and a linear trend

Click on symbol shown in Figure 10.44. This will trigger a dialogue box as in
Figure 10.45.

Figure 10.45 Add an interpolation line

This joins the data points together with a straight line.

Page | 610
Figure 10.46 SPSS progress graph of the time series and the trend line

Click on File > Close as illustrated in Figure 10.47

Figure 10.47 A dialogue box from SPSS

Figure 10.48 SPSS final graph of the time series and the trend line

Now fit the confidence interval using LCL_1 and UCL_1 calculated values

Page | 611
Select Graphs > Legacy Dialogs > Line
Choose Multiple
Choose Data in Chart Are: Values of individual cases

Figure 10.49 SPSS option box for multiple lines

Select Define
Transfer into Lines Represent box: Series, Fit for Series …., 95% LCL…,
and 95% UCL ….
In Category Labels, choose Variable and transfer Period into box

Figure 10.50 Selecting in SPSS which variables to include in the graph

Select OK

SPSS Output

Page | 612
Figure 10.51 The final version of the time series, the trend line and the prediction
interval (compare with Figure 10.36 for Excel version)

From the SPSS solutions we observe the results agree with the Excel solutions: trend
line, associated trend line statistics, forecasts for time points 31-45, and the graphs.

Check your understanding

X10.15What is the difference between the confidence level and the prediction interval?

X10.16What do you think is appropriate to use to calculate the prediction interval for a
time series that has 20 observations? Would you use the z-values or t-values?

X10.17Is it logical to expect that the prediction interval should get wider and wider the
further into the future we extrapolate the forecasts?

10.6 Seasonality and Decomposition in classical time series


analysis
We started this chapter with a simplified model. We stated that if we are interested in
forecasting the long-term future of a data set, all we have to do is to assume that data set
consists of only two components. Equation (10.1) stated that the first component is just
a trend and everything else is treated as a residual.

The residuals were the second component. In other words, if we can produce a good
trend line (linear or any curve), and fit it to our data, the difference between the
historical values of the time series and the trend should be considered as residuals.

Page | 613
These residuals, we also learned, need to be randomly fluctuating around this trend line.
If they are not, then the trend does not represent (or fit) the time series well and it is a
wrong forecasting model. Sometimes no matter what we do, these residuals are not
random. Why? We will use an example to explain why.

Example 10.13

We will look at a real-life time series, which happens to be Quarterly index for US
Consumer energy products from Q1 2015 until Q2 2020, where 2012=100.

We are showing in Figure 10.52 and 10.53 the time series and the trend we fitted to it
(calculations not shown).

Figure 10.52 Quarterly index for US Consumer energy products where 2012=100
with the trend and error calculations

Page | 614
Figure 10.53 a graph of the time series from Figure 10.52

We can see that the linear trend was appropriately selected, but when we plot the
errors, as in Figure 10.54, we can see that they show regularity in movements. It seems
that something else must be embedded in our actual data. No matter what curve we
picked, there seems to be some other regularities pulsating in our time series, which
means that a simple trend method will not be enough to model such time series.

Figure 10.54 Errors chart for column F from Figure 10.52

If there are other components embedded in our historical time series, then this
simplistic approach is preventing us from incorporating them into the forecasting
model. Time to introduce the seasonal component.

We all intuitively know that the sales of ice cream will have a strong seasonal
component. General retail sales, we also know, has a strong seasonal component,
showing strong peaks around Christmas time, for example. We obviously need to
incorporate this periodic component into our model, as otherwise our model will never

Page | 615
be a good fit. In other words, if we do not do it, we will never get the errors to fluctuate
in a random fashion, which is essential to declare a model or a method fit for purpose.

A method that is called the time series decomposition method is precisely one of
these methods that will help us capture other components from historical time series.
This will enable us to build a credible forecasting model. In fact, the method captures
not only a seasonal component, but a cyclical too. Let’s explain.

The classical time series decomposition method starts with an assumption that every
time series can be decomposed into four elementary components:

(i) Underlying Trend (T)


(ii) Cyclical variations (C)
(iii) Seasonal variations (S)
(iv) Irregular variations (I)

Depending on the model, these components can be put together in several different
ways to represent the time series. The simplest of all is the so-called additive model. It
states that time series Y, implicitly consists of the four components that are all added
together:

Y=T+C+S+I (10.19)

If you compare equation (10.19) with (10.1) you will see that what used to be called R
(Residuals) is now broken down into three new components, (C, S and I). All the
components in equation (10.19) share the plus sign, implying that this is an additive
model. However, beside the additive model, another alternative is the multiplicative
model that can also be used:

Y=TCSI (10.20)

Sometimes the most appropriate model is in fact a mixed model. Here is an example of
one such model:

Y = (T  C  S) + I (10.21)

Example 10.14

To illustrate how the components make up the estimated time series, we will use the
same artificially short time series from Example 10.3, as illustrated in Figure 10.55.

Page | 616
Figure 10.55 An example of a time series and its constituent components

Column B contains the values of the time series Y and columns C to F show the constituent
components of this time series (trend, cyclical, seasonal and irregular component). We
are not showing here how they were calculated, as this is just for illustration purposes.
However, once we have these components (columns C to F), we can recompose them, as
we did, and show in column G as the estimated variable Ŷ.

We used the mixed model, as per equation (10.21), to “reconstitute” the time series Ŷ
from the components that were calculated (we’ll explain shortly how these components
are calculated). As in previous examples, this new time series Ŷ is effectively an estimate,
or an approximation (or a fit), of the actual time series Y. The character of the data in time
series will determine which model is the most appropriate. We will come back to this
point too. Let us describe briefly what exactly is meant by every component, symbolized
by T, C, S and I.

Underlying trend (T) is a general tendency and the direction that the time series follows.
We are already very familiar with this component. It can be horizontal (stationary time
series), or upward/downward (non-stationary time series). This trend line does not have
to be a straight line, it can be a curve, or even a periodic function.

The cyclical component (C) is a new one. The cyclical component consists of the long-
term variations that happen over a period of several years. If the time series is not long
enough, we might not even be able to observe this component. If you have annual data
over a long period of time, the cyclical component will move up and down around some
imaginary trend line. If we used economic data, we all know that although an economy
might grow over a number of years, there will be a cluster of periods when the growth is
stronger (the prosperity years) and a number of years when the growth is sluggish or
non-existent (recession, for example). This is a typical example of the cyclical component.

The seasonal component (S), on the other hand, applies to seasonal effects happening
within one year. Therefore, if the time series consists of annual data, there is no need to
worry about the seasonal component. At the same time, if we have monthly data (or
quarterly, or weekly, for example) and our time series is several years long, then it will
possibly include the seasonal component.

The irregular component (I) is everything else that does not fit into one of the previous
three components. This component is also called the residuals, or errors, though we
changed the notation to I in order not to confuse it with R as used in the previous chapter.

Page | 617
We know that we need to analyse this component, as it is important for the quality and
accuracy of our forecasting model.

A method of isolating different components in a time series or, decomposing the time
series as we will do here, is called the classical time series decomposition method. This
is one of the oldest approaches to forecasting. The whole area of classical time series
analysis is concerned with this theory and practise of how to decompose a time series
into these components. Once you have identified the components and estimated them,
you then recompose them to produce forecasts.

Now we know that a time series can be decomposed into up to four constituent
components, and we know that these components could form additive or multiplicative
models, how do we know which model is appropriate for our time series? A general rule
that applies not only to the decomposition models, but to many other models, is that
stationary data are usually better approximated using the additive model and that non-
stationary data are better fitted with a multiplicative model.

The question now is how do we isolate each of these four components from the data? To
demonstrate the principle, we will use some very simple algebra. We will take a
multiplicative model from equation (10.20).

If Y = T  C  S  I, then by dividing the historical time series Y with the trend component
T that we have calculated using the linear regression or simple trend approach, we can
“isolate” the remaining three components:

𝑌 T×C×S×I
= =C×S×I (10.22)
𝑇 T

We already said that if we have annual data, then the cyclical component will be visible,
but seasonal component is potentially hidden in the trend data. This means that for
annual data we do not have to worry about seasonality, so the above equation (10.22)
becomes:

𝑌 T×C× I
= =C× I (10.23)
𝑇 T

Equally, if we have quarterly, monthly, weekly or daily data, in other words data
expressed in less than annual form, then probably the cyclical component is not visible
(it might be potentially hidden inside the trend component), but the seasonal
component is. In this case, we have a new equation:

𝑌 T×S× I
= =S× I (10.24)
𝑇 T

As we can see, it is easy to get down to just one component, either C or S, but which is still
“polluted” with some Irregular component. There are various ways to isolate the seasonal
(or cyclical) component from the irregulars (residuals), and we will now learn how to do
it.

Both the cyclical and seasonal component have similar behaviour, i.e. they both repeat
the pattern at some level of periodicity. The only difference is that the seasonal pattern

Page | 618
repeats itself within every year and the cyclical pattern takes several years to repeat. The
time span for each of the two components is different, but their behaviour is similar. This
means that most methods for isolating the cyclical component can be applied to seasonal
component.

Cyclical component

To illustrate a simple method of extracting the cyclical component, we will use data
from Encyclopaedia of Mathematics that shows the number of the Canadian lynx
"trapped" in the Mackenzie River district of the North-West Canada for the period
1878–1931. In fact, the time series on the website is even longer and covers 1821-1834
(https://www.encyclopediaofmath.org/index.php/Canadian_lynx_data).

Example 10.15

Year No. Year No. Year No. Year No. Year No.
1878 299 1889 39 1900 387 1911 1388 1922 399
1879 201 1890 49 1901 758 1912 2713 1923 1132
1880 229 1891 59 1902 1307 1913 3800 1924 2432
1881 469 1892 188 1903 3465 1914 3091 1925 3574
1882 736 1893 377 1904 6991 1915 2985 1926 2935
1883 2042 1894 1292 1905 6313 1916 3790 1927 1537
1884 2811 1895 4031 1906 3794 1917 674 1928 529
1885 4431 1896 3495 1907 1836 1918 81 1929 485
1886 2511 1897 587 1908 345 1919 80 1930 662
1887 389 1898 105 1909 382 1920 108 1931 1000
1888 73 1899 153 1910 808 1921 229
Table 10.8 A time series showing the number of the Canadian lynx "trapped" in
the Mackenzie River district of the North-West Canada for the period 1878–1931

Excel Solution

The time series consists of 54 observations and when we plot it, it looks as in Figure
10.56.

Page | 619
Figure 10.56 A graph of the time series from Table 10.8

This appears to be a cyclical time series, so we will now apply the classical decomposition
principles and calculations in Excel, shown in Figure 10.57.

The graph in Figure 10.56 indicates a repeating pattern approximately every 9 years, so
the length of the lynx trapping cycle seems to be 9 years.

Figure 10.57 Calculating the trend and the C×I components from the data from
Table 10.8

Column C in Figure 10.57 shows the number of Lynx trapped per year (Y). The data are
annual, so according to the principles of time series decomposition, they contain three
components: T, C and I. We need to isolate the trend (T) component first. We achieved
this in column D in Figure 10.57. This was calculated using Excel function =TREND(). Cell
D3, for example, is calculated as: D3=TREND($C$3:$C$56,$B$3:$B$56,B3), etc.

Page | 620
According to equation (10.23), to eliminate the trend value from the data Y, we need to
divide the historical values Y (column C in Figure 10.57) with the trend data T (column D
in Figure 10.57). This is done in column E where, for example, cell E3=C3/D3, etc. The
result is the cyclical component (C) entangled with the irregular component (I) which is
also shown in a graph form in Figure 10.60.

To take the next step, we will need another table, as per Figure 10.58.

Figure 10.58 Grouping data from column E in Figure 10.40 to identify typical cycle
values

As the cycle is decided to be 9 years, there are 6 cycles in this time series. The block of 9
values (E3:E11) from Figure 10.57 is copied to I3:Q3 in Figure 10.58. This was repeated
6 times until all the blocks of 9 cells were copied. The typical cycle values in row 9 (I9:Q9
in Figure 10.58) are calculated as simple average values (for example,
I9=AVERAGE(I3:I8), etc.).

The average value effectively removes the irregular variations within every cycle and this
average value for every cycle now becomes a typical cycle value. In other words, by
averaging all values in every cycle, we have eliminated the Irregular component. What we
have in I9:Q9 in Figure 10.58 is a pure C component.

Now we need to move the cyclical component C from Figure 10.58 to Figure 10.510. Cells
I9:Q9 from Figure 10.58 are now, as a block of nine values, copied into column F in Figure
10.510. Note that in Figure 10.59 rows 15-45 are hidden.

Page | 621
Figure 10.59 Extracting a pure C component from Figure 10.58 and calculation of the
estimates ŷ

Column G in Figure 10.59 recomposes the time series (Ŷ) using a simple formula
G3=D3*F3, which is copied down.

Figure 10.60 A graph of the C×I components from Figure 10.59 (column E)

We decided that 9 years is the periodicity of this time series by just observing the graph.
There are other more accurate methods, such as autocorrelations, that can be used to
decide precise periodicity, but they are beyond the scope of this textbook, though you can
learn about it in one of the online chapters.

Page | 622
Figure 10.61 shows typical C component after the Irregular component has been
removed (column F in Figure 10.59).

Figure 10.61 A graph for typical C component from figure 10.59 column F

Figure 10.62 shows the original data and the fitted values Ŷ (G in Figure 10.59). As the
value of C is effectively an index number oscillating around 1, all we had to do was to
multiply every trend value T with its corresponding cycle value C.

Figure 10.62 The result of recomposing the components and showing them
against the actual values for Example 10.15

As we can see our model does not fit the actual data perfectly well. However, it is much
better than a simple trend line fitted to data. We will learn later how to improve on this
model.

As a side comment, if we used only trend (column D in Figure 10.59) as a model and then
calculated the errors or residuals, we would get the chart that looks as Figure 10.63.

Page | 623
Figure 10.63 Errors if the model used for data in Figure 10.59 was just a simple
linear trend (column D)

Clearly the errors in Figure 10.63 show strong pattern, indicating that fitting just a linear
model is not adequate.

On the other hand, the errors calculated from the model constructed through the
decomposition method look as in Figure 10.64. They indicate some improvements, but
we can say that we could have improved even more (which we will demonstrate shortly).

Figure 10.64 A graph of errors calculated from the forecasts by the decomposition
method

Classical decomposition method makes sense if we have a strong cyclical or seasonal time
series and we will make further improvements as we proceed through this section and
the following chapter.

SPSS Solution

SPSS has an option for time series decomposition, and it is called Seasonal
Decomposition, under the Forecasting sub-menu of Analyze tab. However, the same
method is used for both cyclical and seasonal data, so we will defer to show how to use it
after we covered seasonal data below.

Page | 624
Seasonal component

The above example applied to the cyclical component. If we had a seasonal component
only, we use the same technique. Other more complex, and more accurate, techniques
exist, but they are beyond the scope of this textbook. To demonstrate how to isolate the
seasonal component specifically, we will use an example that contains quarterly data.

Example 10.16

The data set used represents quarterly index number for the US consumer energy
products, based on 2012=100 (same as Example 10.13). As before, we will use Excel
first to show the calculations.

Year Quarter Y
2015 1 120.2
2 90.6
3 101.8
4 98.7
2016 1 114.6
2 92.0
3 107.8
4 105.8
2017 1 116.3
2 92.6
3 910.7
4 108.1
2018 1 124.2
2 98.1
3 104.5
4 112.2
2019 1 125.9
2 93.6
3 103.4
4 110.2
2020 1 116.8
2 88.3
Table 10.9 Quarterly index number for the US consumer energy products, based
on 2012=100

Figure 10.65 illustrates the data graph with a trend line fitted to the data set.

Page | 625
Figure 10.65 A graph for time series in Table 10.9 and the corresponding trend

Excel Solution

Figure 10.66 and 10.67 show the table covering period from 2015 until 2020 (column
A). Every year is broken into four quarters (column B) and the whole-time series is only
22 observations long (column C). The values of the time series are given in column D.
Figure 10.66 also shows some calculations that we will explain in a moment.

Figure 10.66 Calculating T and S×I components for data in Table 10.9

As before, column E is calculated using =TREND() function and column F is column D


divided by column E (for example, F4=D4/E4).

Page | 626
To calculate seasonal components, we need a few more tables, as in Figure 10.67.

Figure 10.67 Extracting seasonal components for data from Table 10.9 and Figure 10.66

The same principle as before, copying the values from column F in Figure 10.66 to cells
K6:P9 apply. The details are explained further down.

We can now calculate the forecast values for each time point, as in Figure 10.68.

Page | 627
Figure 10.68 Re-composition of the components and forecasts for data from Figure 10.66

We calculated the values up to column F in Figure 10.68, in an identical way to the


calculations we did in Example 10.15. The only difference is that in this Example 10.16
we are handling seasonal components and in Example 10.15 we handled cyclical
components. Because this time series in Example 10.16 consists of quarterly data, the
seasonality present is clearly based on four quarters in the year. In Figure 10.67 in cells
K6:P9 we can see how the data from column F in Figure 10.66 was transposed in the
appropriate cells.

The next step, which are the cells K15:P18 in Figure 10.67, are identical to the previous
range K6:P9 in Figure 10.67, with one difference. In every row of this little table, and
every row corresponds to one quarter, we have “greyed out” the min and the max value
for every row. This helps us visually as we need to eliminate possible extremes that might
be recorded during this quarter over the range of years.

After we excluded min and max value per row, we calculate the average value for the
remaining cells. These average values for every quarter are given in cells Q15:Q18 in
Figure 10.67 The sum of the average values per quarter in cells Q15:Q18 in Figure 10.67
should add up to 4 (because they are like index numbers per quarter and four quarters
times 1 should be 4). However, as we can see in cell Q19 in Figure 10.67, they add up to
4.000330. This means we need to come up with a correction, or scaling factor, which is

Page | 628
given in cell Q20 in Figure 10.67. The value of the scaling factor is 0.999918 and it is
calculated as: Q20=4/Q110.

Once we have the scaling factor, we can calculate the final values for every quarter, and
this is given in cells K24:K27 in Figure 10.67. They are calculated by multiplying every
seasonal factor from Q15:Q18 with the scaling factor from Q20, all in Figure 10.67. What
we get are the typical quarterly indices for this time series. These typical indices are then
copied into column G in Figure 10.68.

From there, we return to column H in Figure 10.68. This column represents the model
estimates Ŷ and they are calculated by multiplying the trend value values in column E
with the typical seasonal indices in column G, both in Figure 10.68.

The result is the re-composed time series and it represents the fit, or model values, for
our time series. You can see the result in Figure 10.610. The model seems to fit the data
extremely well.

Figure 10.69 The result of recomposing the components and showing them
against the actual values for Example 10.16

SPSS Solution

SPSS data file: Chapter 10 Example 16 Seasonal time series.sav

Enter series data into SPSS

Page | 629
Figure 10.70 SPSS data window for Example 10.16

Set time series (2015 Q1 – 2020 Q2)

We enter data into SPSS, which is Quarterly index for US consumer energy
products from 2015 Q1 until 2020 Q2 (2012=100). Before we proceed, we need
to assign date labels to data.

Select Data > Define data and time

Figure 10.71 Time-stamping the data in SPSS

Click OK

The time data will then be added to the SPSS data file as illustrated in Figure
10.72.

Page | 630
Figure 10.72 SPSS creating a new time label variables

Run the SPSS test: Seasonal Decomposition

Analyze > Forecasting > Seasonal Decomposition

Transfer variable Quarterly_index in to the Variable(s): box

Figure 10.73 SPSS dialogue box for seasonal decomposition

Click OK

The message will appear warning that 4 new variables will be added to the data
set.

Page | 631
Figure 10.74 SPSS notification that new variables will be added to the file

SPSS Output

After we clicked OK, SPSS provides several outputs. Figure 10.75 shows that we have
selected the multiplicative model and what the values of the typical quarterly seasonal
index are.

Figure 10.75 A summary of the typical values for seasonal components for
Example 10.16

SPSS adds new variables to your data set. Figure 10.76 shows the ones that we selected
during the execution process.

Furthermore, we can now fit this model to the time series data given the model fit is
given by the equation SAF_1 * STC_1.

Select Transform > Compute Variable

Type Fit in the Target Variable box

Page | 632
In the Numeric Expression box type the equation = SAF_1 * STC_1. This will add an extra
column called fit to your SPSS data file as illustrated in Figure 10.76.

Figure 10.76 New variables created by SPSS when using seasonal decomposition as a
forecasting model

SPSS has created four new variables and they are called SAS_1, SAF_1, STC_1 and ERR_1.
The interpretation, as per SPSS Help file, is as follows:

• SAS. Seasonally adjusted series, representing the original series with seasonal
variations removed. Working with a seasonally adjusted series, for example,
allows a trend component to be isolated and analyzed independent of any
seasonal component.
• SAF. Seasonal adjustment factors, representing seasonal variation. For the
multiplicative model, the value 1 represents the absence of seasonal variation;
for the additive model, the value 0 represents the absence of seasonal variation.
• STC. Smoothed trend-cycle component, which is a smoothed version of the
seasonally adjusted series that shows both trend and cyclic components.
• ERR. The residual component of the series for a particular observation.

If you compare these column values with the Excel version for Example 10.16, then:

1. SAS variable is comparable to Column F (Seasonally Adjusted series) in Figure


10.68
2. SAF variable is comparable to cells K24:K27 in Figure 10.67 (Typical Seasonal
Index) or to column G in Figure 10.68
3. STC variable is somewhat comparable to Column E (Trend component) in Figure
10.68 (this is not quite the case as SPSS uses moving averaged trend, which will
be explained in the next Chapter)
4. ERR is calculated and shown in the section below

You will notice that the numbers from Excel do not match 100% the numbers from
SPSS, though they are very close. For example, the typical quarterly indices for Q1-Q4 in

Page | 633
SPSS are: 1.13, 0.89, 0.97 and 1.01. In Excel they are: 1.13, 0.87, 0.98 and 1.02. The
reason for the discrepancies is that SPSS uses slightly more sophisticated estimation
method to make it universally applicable to both seasonal and cyclical components.

For example, the method for trend that SPSS uses, is not a simple trend method but a
combined centred moving average method that incorporates both T and C component
together. This has a “knock on” effect on both the SAS and SAF variables, hence small
differences from our Excel solution. To establish how well this decomposition method
models the actual data, we need to measure the differences between the actual values in
column D from the fitted values in column H in Figure 10.68. In fact, we can also apply
some of the lessons from the linear regression chapter and return to the model
goodness of fit concept.

Error measurement

We will now calculate the errors, as well as the squared errors for Example 10.16. From
there, we will check if the errors are random, if they are distributed as per normal
distribution and what is the mean square error (MSE), as well as the root mean square
error (RMSE). You should be by now be familiar with all the calculations, so we will go
directly to Excel to perform these tasks.

Example 10.17

We use the data from column D and H in Figure 10.68 to calculate the errors. They are
copied into columns B and C in Figure 10.77.

Excel Solution

Figure 10.77 Error analysis for the classical decomposition forecasting method

Page | 634
The errors in column D in Figure 10.77 are simple differences between columns B and C
(for example, D4=B4-C4, etc.). As the first step, we will plot the errors from column D in
Figure 10.77. The plot is given in Figure 10.78.

Figure 10.78 A graph of errors (column D in Figure 10.77) calculated from the
forecasts by the decomposition method

As we can see from Figure 10.78 errors appear to be random, which means that our model
potentially modelled the actual data well. However, the fact that they appear random is
not enough. We need to do some tests. One of them is the test for normality of errors, i.e.
a check if errors are distributed per normal distribution. Columns E to J in Figure 10.77
show the procedure. Errors are ranked in column E using Excel function =RANK(). For
example, cell E4=RANK(D4,$$4:$$25,1), which is copied down. We copy the values from
this column into column F, but then we just use Excel SORT utility to sort all the errors in
this column in ascending order. Column G now shows the rank for every error, calculated
as G4=RANK(F4,$F$4:$F$25,1). This column is labelled rt.

The rank values rt from column G in Figure 10.77 are used to calculate the cumulative
probabilities for this time series. The probabilities are calculated using a simple
equation (10.25):

𝑟𝑡 −0.5
𝑝𝑡 = (10.25)
𝑛

Where rt is the rank and n is the number of errors. These values pt are shown in column
H in Figure 10.77. To calculate the p-values in column H, we convert equation (10.25)
into: H4=(G4-0.5)/$$25, and copy the cells down. The z-values in column I are
calculated as I4=NORM.S.INV(H4), and they are also copied down. And finally, cell
J4=D4^2, is also copied down. In cell L3 we are showing the sum for all e2, calculated as
L3= SUM(J4:J25). L4 contains MSE, and it is calculated as L4= L3/COUNT(J4:J25). Cell L5
is RMSE, calculated as L5=SQRT(L4). To summarise, to check if our errors follow normal
distribution, we need to calculate the z-values for every pt. In other words, we will use
the Excel cumulative distribution function =NORM.S.INV(), or total area under the curve
to the left of every pt point. As it is shown in column I in Figure 10.77, this is all easily
executed in Excel. Finally, we can create a plot of errors et vs the zt values (columns D
and I in Figure 10.77), shown as a scatter diagram. Figure 10.79 shows the plot.

Page | 635
Figure 10.79 Error normality graph for classical decomposition forecasts in
Example 10.16

As we know from Chapter 9, the dots in this scatter diagram must be aligned in a more-
or-less straight line. To help us see the alignment, we right click on any of the dots in the
plot in Excel and then select Add Trendline from the dialogue box that appears. This will
automatically add a line that best fits a straight line between the dots. As we can see, our
line fits very well almost all the errors and all the errors follow the straight line, which
means that they are normally distributed. This is another confirmation that our model is
fitting data well.

We also calculated the mean square error, which happens to be 12.84 (cell L4 in Figure
10.78). To put this error in the same units as the original time series (which happen to be
indices), we take the square root of that value. Root mean square error (RMSE) is 3.58
(cell L5 in Figure 10.78). This means that on average our fitted model is “adrift” from the
original time series by 3.58 index points. Figure 10.80 showed us visually that we have a
good fit, but now we were also able to quantify how good this fit is. If we used more than
one model or a method to fit to the actual data, we would use MSE and RMS to decide
which one to keep as the final model. As we know, the lower MSE or RMS, the better is
the time series fitted by this model.

Prediction interval

Equation (10.16) already defined how to use the standard error of the estimate SEŷ,y to
calculate the interval where the true value of y might be. We called this a prediction
interval. We also said that for the larger time series, we can use either the t-value or z-
values to define the prediction interval. So, our prediction interval, if we use the most
generic format, is defined as per equation (10.16), which we repeat here:

Ŷt ± z  SEŷ,y (10.26)

Page | 636
Because we said that our errors must follow the normality assumption, we will use one
“trick” which enables us, under these conditions, to say that RMSE (root mean square
error) is effectively the same as SEŷ,y (the standard error of the estimate). You can
compare equation (9.19) with (10.9) to understand why we said that. This means that
equation (10.26) becomes:

Ŷt ± z  RMSE (10.27)

The difference between equation (9.19) and (10.9) is that the denominator in (9.19) is n-
2 and in (10.9) is n. Equation (10.9) shows a general formula for calculating the RMSE,
whilst (9.19) shows a specific version suited to regression analysis, where n-2 indicates
the number of degrees of freedom for linear regression. Out of convenience, let’s label the
expression z  RMSE in equation (10.27) as k. In this case, the equation (10.27) becomes:

Ŷt ± k (10.28)

The above equation (10.28) is perfectly suited for one step-ahead forecasts. The ex-post
forecasts that fit the historical time series are effectively one-step ahead forecasts, so it
is appropriate to use this approach to estimate the prediction interval for the historical
data. However, when we come to the last observation in the time series and we forecast
the future values, we are producing forecasts for multiple steps ahead (because we
reached the last actual time series observation). As we already know, it is intuitive to
assume that as we go further into the future, the prediction interval is going to get wider
and wider. Unfortunately, without going into some very complex methods which are
beyond the scope of this textbook, there are no easy analytic methods to apply this
principle to seasonal data. However, we can use a workaround, which is an empirical
method.

The empirical method that could be used is as follows. We take h to be the number of
steps ahead for which we are producing forecasts, i.e. h = 1, 2, 3, … If we multiply the
MSE by h, this will intuitively address the need to make the prediction interval going
wider as we move further into the future. In this case, the RMSE is calculated using the
following equation:

𝑒2
𝑅𝑀𝑆𝐸ℎ = √ℎ × 𝑀𝑆𝐸 = √ℎ (10.29)
𝑛

Equation (10.27) no longer contains just single value RMSE, but RMSEh, which is a
dynamic value, changing with h. The factor k in the prediction interval now also
becomes a dynamic value, subject to the number of future steps h. This means our
equation (10.28) keeps the shape, but it is “enhanced” by the number of steps h:

Ŷt+h ± kh (10.30)

Example 10.18

To illustrate how to calculate a prediction interval for seasonal time series, we will use
the data from Example 10.16.

Page | 637
Excel Solution

Figure 10.80 Prediction interval for Example 10.16 based on classical decomposition
forecasts

Columns A:H in Figure 10.80 are copied from Figure 10.68 and errors in column I are
calculated as the differences between the cells in column D and H, as before. Cell J3
(MSE) is calculated as =SUMXMY2(D4:D25,H4:H25)/COUNT(D4:D25) and cell J4
(RMSE) as =SQRT(J3). Cell J6 contains the value of 0.05, which is the value of alpha for
the 95% confidence interval. Cell J7 is the t-value for 95% confidence interval,
calculated as =T.INV.2T(J6,COUNT(D4:D25)-2). The second argument in this function
refers to degrees of freedom, which is n-2, or in our case 22-2=20.

The new values are the steps h in cells J26:J31, which are just the future time periods,
market as 1, 2, …, 6. The dynamic RMSEh is in cells K26:K31 and it is calculated as
J26=SQRT(J26*$J$3), copied down. Cells L26:L31 contain the kh values and they are
calculated for 95% confidence interval L26=J26*$J$7, etc.

To show how well our model fits the actual data, as well as the future forecasts and their
prediction interval, we put all these variables in one graph in Figure 10.81. As we can see
from Figure 10.81, the historical prediction interval tracks actual observations in a
consistent manner. The future prediction interval projects the 95% confidence that the
future values will be somewhere within this interval. It also complies with the intuitive
expectation that it should get wider the further into the future we extrapolate our
forecasts.

Page | 638
Figure 10.81 The result of recomposing the components and showing them
against the actual values for Example 10.18 together with the historical and future
prediction intervals

This might not be so obvious from the graph due to the seasonal character of the time
series. However, if we use T from column E in Figure 10.80 and put it in equation (10.30),
so that it reads: Tt+h ± kh, then Figure 10.81 changes into Figure 10.82.

Figure 10.82 The same as Figure 10.82, except that T is used for forecasts instead
of ŷ

To demonstrate the point, in Figure 10.82 the prediction interval was calculated using
just the trend value T, rather than the estimates ŷ as in Figure 10.81. This is a general
method for showing the widening prediction interval for future steps h and is often used
in regression analysis. In Figure 10.83 we show how the two future prediction intervals
differ (compare the rows 26:31).

Page | 639
Figure 10.83 Prediction interval calculated on the basis of ŷ values (left) and T
values (right)

The historical prediction interval in cells N4:O25 in Figure 10.83 (for both cases) is
calculated as N4=H4-$J$7*$J$4 and copied down to the 25th row. Cells N26:O32 use
N26=H26-L26 and O26=H26+L26 copied down.

Note that the prediction interval in cells N26:O31 is different in the left-hand chart
(where ŷ is used as the basis for the prediction interval) from the right-hand chart
(where T is used as the basis for the prediction interval).

Although in the context of seasonal time series forecasting you would not expect to use T
rather than ŷ for forecasting (though it is a common practise in regression analysis), we
have done it here just to make it more obvious to show how the future prediction interval
gets wider the further into the future we go, despite the fact that the confidence level
remains the same, i.e. 95% in this specific case.

Check your understanding

X10.18Decide the periodicity of the time series in Table 10.10, isolate the cyclical
component and produce forecasts using the classical decomposition method.

Page | 640
Year y Year y
2001 25 2010 26
2002 26 2011 26
2003 27 2012 27
2004 27 2013 29
2005 26 2014 28
2006 25 2015 27
2007 26 2016 26
2008 28
2009 27
Table 10.10

X10.19Which model did you use in X10.18 and why?

X10.20Go to one of the web sites that allow you to download financial time series (such
as http://finance.yahoo.com/) but with a strong seasonal/cyclical component.
Use them to practise classical decomposition method.

Chapter summary
In this chapter we introduced time series as a special type of data sets moving
sequentially through time. We declared that time series analysis is mainly used for
extrapolations and forecasting. We also defined various terms that define different
types of time series (stationary vs non-stationary), seasonality, as well as types of
methods used to produce forecasts. And finally, we emphasized that most of the
material from this chapter is similar to the previous chapter, with the exception that in
this chapter we only used one set of observations (one variable) that is implicitly
defined by time (another variable).

We also assumed that the easiest way to think about time series extrapolation and
forecasting is to assume that a time series can be extrapolated using a simple trend
function. In this case everything else that is not included in the trend function is called a
residual. This approach is particularly useful if we are interested in the long-term
forecasts where deviations from the trend are not as important as the overall direction
and the slope of the trend.

As the initial focus of this chapter was just on the trend component, we explained that it
can come in different shapes (linear, curve, sinusoid, etc.). We then used a simple trend
extrapolation technique to demonstrate how to fit it to the time series and how long-
term forecasting is done. Once the trend was fitted to the time series, we were able to
measure a variety of error terms.

We covered six specific error measurement statistics, which are: the mean error (ME),
mean absolute error (MAD), mean square error (MSE), root mean square error (RMSE),
mean percentage error (MPE) and mean absolute percentage error (MAPE). After we
demonstrated how they are calculated, we also provided interpretation of every error
measurement method.

Page | 641
The following section was dedicated to measuring the goodness of fit of our forecasts
and establishing the future uncertainty interval for our forecasts. We used the standard
error of the estimate to calculate the prediction interval and to show that it can be made
to grow wider the further into the future we extrapolate our time series.

The last section was dedicated to time series decomposition method. We concluded that
in certain instances, especially if we are dealing with seasonal or cyclical data, a simple
trend extrapolation is too “crude” as a method to produce acceptable forecasts. We
needed to introduce other components hidden in a time series, such as cycles and
seasons, and learn how to decompose the time series to extract them. Once we isolated
all the relevant components, we were able to produce forecasts that were much more
realistic than simple trend extrapolations.

We concluded the chapter by demonstrating how to calculate a prediction interval for


seasonal time series and how it complies with an intuitive assumption that the interval
should grow in width, the further into the future our forecasts are extended.

Test your understanding


Basic concepts

TU10.1 How would you classify the two-time series shown below and why:

Figure 10.84

Figure 10.85

TU10.2 Is it possible for a stationary time series to be seasonal at the same time? If you
think it is, sketch how would such time series look like as a graph.

Page | 642
TU10.3 What is the difference between using the =TREND() method vs. the =SLOPE()
and =INTERCEPT() method in Excel to calculate linear trend? What other built in
functions in Excel enable you to calculate a non-linear trend?

TU10.4 To calculate forecasting errors, you subtract:

A. et = At - Ft or B. et = Ft - At

Does this create a minor inconvenience in interpretation and what is the


inconvenience?

TU10.5 What is the main advantage, and the main disadvantage, of MSE?

TU10.6 If your forecasting errors, when graphed as below in Figure 10.86, exhibit this
kind of pattern, what would you conclude about your forecasts.

Figure 10.86

TU10.7 Which dedicated Excel function is used for the formula below:

n
2
 (y i − ŷ i )
i =1
SE y,ŷ =
n−2

What alternative Excel functions would you use to build the same formula.

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:

1. A10Wa Other types of trends


2. A10Wb Index numbers refresher

Page | 643
11. Short and medium-term forecasts
11.1 Introduction and chapter overview
In the previous chapter we introduced long-term forecasts. They are typically nothing
but trend extrapolations and are very similar to simple regression analysis. Short-term
forecasts, on the other hand, bring completely fresh perspective on time series analysis
and forecasting.

We already know that we do not expect our long-term forecasts to be too accurate. If
they clearly indicate the direction and the general shape of the curve they follow, our
objectives are met, and we are happy with the results. Our short-hand forecasts, on the
other hand, are not required to follow a trend. On the contrary, they are supposed to
forecast just one period ahead, so they need to be very accurate. The medium-term
forecasts go several periods ahead, so they are expected to be as accurate as possible,
or at least to have the prediction interval as narrow as possible. This means that both
the short-term and the medium-term forecasts require to be handled differently from
the long-term forecasts.

We must remember that the words, such as “short-term” and “medium-term” are not
used in the time context. Short-term means just one forecasts ahead, regardless whether
we use data expressed in minutes or annual data. Medium-term implies two to several
(arbitrary up to four or six) forecasts ahead. If the data are given in seconds and we
need to forecast the next 24 observations, although this is only 24 seconds in the future,
it is a long-term forecast. Equally, if the data is measured annually, and we must forecast
just one year in the future, then this is a short-term forecast. As we said, it is not the
time dimension, but the number of the future observations that we are forecasting (the
time horizon) that will determine if we are talking about the short-term or long-term
forecasts.

Short-term and medium-term forecasting methods require understanding of relatively


simple and intuitive concepts, such as moving averages and exponential smoothing.
After we introduced a concept of a moving average, we will learn how to use this
concept as a short-term forecasting technique. If we wanted to use the same technique
for medium-term forecasts, then certain modifications are needed, and we will
introduce double moving averages to explain how to achieve this. As a more
sophisticated alternative to moving averages, we will introduce exponential smoothing.
This method can also be used as a short-term and a medium-term approach to
forecasting. We will explain how both types of forecasts are executed using this
technique.

For both the short-term and medium-term forecasts we need to measure how
appropriate they are to be used with a specific data set. This is achieved by
understanding the data sets, as well as by measuring forecasting errors. The error
measurement is identical to the approach we used in the previous chapter, so we will
just explain the specifics related to the short-term and medium-term forecasts. This will
lead us to the section on how the prediction intervals are constructed for these types of
forecasts.

Page | 644
The final section will combine some of the lessons from this and the previous chapter
and we will learn how to handle short to medium-term seasonal time series by
combining the classical decomposition method with exponential smoothing. The
chapter is completed by introducing the Holt-Winters’ method, one of the most
powerful methods for forecasting seasonal time series.

Why is short term forecasting so important? Let us assume that you own some shares
and you are contemplating selling them. You are not in a rush, so your strategy is to wait
until you think the shares have reached a reasonably high level. When this happens, you
want to “pull the trigger” and sell. It is, therefore, important for you to monitor the
shares movement from one day to another and anticipate what might happen the
following day. If you forecast that the value will go up, you will wait. If the forecast is
that it will go down, you might decide to sell today.

This is very crude scenario that describes why you really want to know what will
happen the following day. Another one, that you are likely to experience if you end up in
supply chain, is the question of inventory. Assume that you are running a shop that gets
supplies from the distribution centre. Because of the distance from the distribution
centre, you have a 12-hour delivery notice. If you can predict that you will be down
significantly on one item, you quickly order it and within 12 hours it is delivered. Your
short-term forecasts are the only tool to defend you from empty shelf space for this
product, or from overstocking an item that you cannot sell. Again, short term forecast is
your only tool to manage your business efficiently.

We used the phrase earlier that forecasting is one of the areas of statistics that will help
you today to gain glimpses of tomorrow. These glimpses imply that these statistical
techniques will help you narrow down the most probable area, or range of numbers,
likely to happen in the future. It is the objective of short-term forecasting to narrow
down this area as much as possible and deliver as accurate and as reliable forecasts as
possible.

Learning objectives

On completing this unit, you should be able to:

1. Understand moving averages.


2. Be able to forecast using single and double moving averages.
3. Understand exponential smoothing.
4. Be able to forecast using single and double exponential smoothing.
5. Calculate errors to establish if the model fits the data set.
6. Construct a prediction interval.
7. Produce seasonal short to medium range forecasts.
8. Be able to apply Holt-Winters’ method to seasonal mid-range forecasts.
9. Solve problems using the Microsoft Excel and SPSS.

Page | 645
11.2 Moving averages
Short-term and medium-term forecasting techniques are all based around some sort of
moving averages, or smoothed values where past errors play role in forecasting. One
way or the other, both sets of techniques are essentially “smoothing” the original time
series. What do we mean by this? We mean that the time series that is created by using
one of these techniques to fit the historical values, will in its appearance be smoother
than the original time series. It will have fewer dramatic ups and downs and it will
appear as if someone “ironed” the original time series.

Smoothing the original time series and treating this smoothed time series as an
approximation of the original time series, or the fit, means that we are eliminating some
random elements from the time series. Just like in the previous chapter, we assumed
that the long-term forecast is the trend, plus some random variations, here we are
looking not for a trend, but for a moving average. In fact, it can be either a moving
average, or exponentially smoothed values (to be explained below). Everything
beyond that is also treated as a residual. So, in principle we kept the same philosophy as
in the beginning of the previous chapter, but we are substituting the trend values with
the moving averages, or exponentially smoothed values.

Simple moving averages

To understand moving averages, we must remind ourselves of some of the basic


properties of the mean value, or an average, in the context of time series analysis. Let us
use a simple Excel =AVERAGE() function to calculate the average value of a time series.
We’ll use a very short and artificial time series just for the illustration purposes.

Example 11.1

A very short time series in Figure 11.1 has an average value of 206. This average value
represents the series well, because the series flows very much horizontally.

Figure 11.1 A short stationary time series and the average value

Figure 11.2 illustrates this graphically. The average of 206 is shown as a horizontal line
that runs across the time series.

Page | 646
Figure 11.2 A graph for the time series from Figure 11.1

As we know, the above sample time series can be called a stationary time series. This
implies that an average is a very good predictor of a stationary time. If we know the
average (the mean value, for example), then we can predict the next future value and it
will be probably somewhere around this mean value.

However, if the series was moving upwards (as in Figure 11.3), or downwards, we have
a non-stationary time series. In this case this average value would not be the best
representation of the series (see Figure 11.4).

Figure 11.3 A short non-stationary time series and the average value

Figure 11.4 A graph for the time series from Figure 11.3 with fitted mean value

In this case a much more realistic representation would be a moving average. How do
we calculate moving averages?

Moving averages are dynamic averages and they change depending on the number of
periods for which they are calculated. A general formula for moving averages is given by
equation (11.1).

Page | 647
∑𝑡−𝑁+1 𝑥𝑖
𝑀𝑡 = 𝑖=𝑡
(11.1)
𝑁

In equation (11.1), t is the time period, N is the number of observations in the interval
taken into the calculation and xi are the observations. A simplified expression of
equation (11.1) is shown as equation (11.2).
xt +xt-1 +xt-N+1
Mt = (11.2)
N

This means that the moving average for M3 is calculated as:

x3 +x2 +x1
M3 =
3

Using the data from Figure 11.3:

200 + 250 + 150


M3 = = 200
3

In this case, the moving average value for the first three observation is placed in the
third period, i.e. at the end of the interval for which it is calculated.

However, sometimes you will see the following equation:

x3 +x2 +x1
M2 =
3

200 + 250 + 150


M2 = = 200
3

This is called a centred moving average. It is the same value, but it is positioned in the
middle of the interval for which it is calculated, rather than placed in line with the last
observation in the interval, as above.

Notice that above we used an example with odd number of observations (3) in the
interval. If we had an even number of observations in the interval and we wanted to
centre the moving average, it would be difficult to place it between the two middle
observations. For this reason, it is easier to take the odd number of observations in the
moving average interval.

It is a convention to place the moving average value either aligned with the last
observation from the moving average interval, or to centre it in the middle of the
moving average interval. For these purposes, we recommend that, if appropriate, the
moving average interval consists of odd number of observations.

Another way to express the equation (11.2) is as follows:

x t − x t −N
M t = M t −1 + (11.3)
N

Page | 648
Equation (11.3) implies that we can still estimate the current moving average even if we
do not know all the values in the moving average interval. All we need is the previous
value of the moving average, plus the other two values from the interval.

If we had 5 elements in the moving average interval (N=5) and, for example, we tried to
estimate the 8th moving average value in the series, the equation (11.3) would look as
follows:

x8 - x3
M8 = M7
5

Although this might appear to be useless fact here, you will see why we mentioned it
when we discuss exponential smoothing. The key point here is that the current value of
the moving average can be extracted from the previous value of the moving average,
plus some combination of the actual historical values from the time series. To
standardise the notation, we will use an abbreviation MA for moving averages, or SMA
(single, or simple moving averages). If you see 3MA or 5MA, this means: single moving
averages for the interval of 3 or 5 observations respectively.

Example 11.2

We will use the nonstationary time series from Example 11.1 and calculated moving
averages as in Figure 11.5.

Figure 11.5 An example of 3 and 5 centred and not-centred moving averages

Column D shows 3 point moving averages (3MA) not centred (written at the end of the
moving average interval), column E also shows 3MA, but centred, i.e. written in the
middle of the moving average interval. Columns F and G show the same example, but for
5MA, and column H shows the moving averages calculated as per equation (11.3), and
as expected they are the same as the values from column F.

Page | 649
Excel Solution

The cells in Figure 11.5 are executed in Excel as follows:

The first cell in column D is D6=SUM(C4:C6)/3, or D6=AVERAGE(C4:C6). The same


formula is used for cell E5. Correspondingly, we have F8=SUM(C4:C8)/5, or
F8=AVERAGE(C4:C8), and the same formula in G6. In cell H9 we are using equation
(11.3), so H9=G8+(C0-C4)/5. All the formulas are copied down to the last cell that it
makes sense.

To illustrate how the time series and 3MA centred series look like, in Figure 11.6 we are
showing the original time series and the 3MA time series.

Figure 11.6 A graph of the time series and centred 3MA from Figure 11.5

SPSS Solution

SPSS provides a standard function for calculating centred moving averages. Figure 11.7
contains the same time series as in Figure 11.5. The file used is Chapter 11 Example 2
Moving averages.spv.

Figure 11.7 Time series from Figure 11.5 in SPSS

To calculate moving averages, we need to create a new time series.

Select Transform > Create Time Series

Page | 650
Figure 11.8 A dialogue box to create a new time series

This brings a dialogue box as illustrated in Figure 11.9.

• Transfer Series to the Variable -> New name box


• Use the drop-down menu Function to choose Centered moving average.
• In the Span entry box type 3.
• Click on Change

Figure 11.9 A dialogue box to create a 3MA centred time series from the original
time series

Click OK

Page | 651
SPSS output

This creates a new series, as in Figure 11.10, where Series_1 represents the 3-point
centred moving average.

Figure 11.10 3MA centred time series created in SPSS

To chart both time series (the original time series and a newly created 3MA centred
time series) onto the same chart.

Select Analyze, Select Forecasting, Select Sequence charts

Figure 11.11 Charting the two time series from Figure 11.10

Move Periods to Time Axis Labels and Series and MA(Series,3,3) to variables.

Page | 652
Figure 11.12 Defining the chart for the two time series

Click OK

SPSS output

Figure 11.13 A graph of the time series from Figure 11.10

The SPSS chart given in Figure 11.13 is identical to the Excel solution given in Figure
11.6.

In Example 11.2 we used only 3-period (3MA) and five period (5MA) moving averages.
What happens if we extend the number of observations in the moving average interval?
As the number of observations increases in the moving average interval and it reaches
ultimately the full data set, the moving averages line becomes smoother and smoother,

Page | 653
until it becomes a straight horizontal line that represents the overall average (mean)
value.

Our simple data set from Example 11.2 is too short to illustrate this, so we will use the
data representing the average annual UK unleaded petrol prices in pence per litre 1983-
2020 (the same dataset was previously used in Example 9. 4). The data set contains 38
observations and we decided arbitrarily to calculate 3MA, 12MA and the overall average
(effectively 38MA). Figure 11.14 shows the results as a graph.

Figure 11.14 A time series from Example 9.4 and three other types of averages
(3MA, 12MA and simple average)

Remember that the larger the number of moving averages in the interval, the
“smoother”, the time series of moving averages will be when compared to the original
time series. This is quite obvious from Figure 11.14. If you look at the 3-interval moving
average time series, you will see that it is closely tracking the actual time series. On the
other hand, a 12-interval moving average line will be much “flatter” and eliminate the
extreme “jumps”, as it is averaging much larger intervals and it is, therefore, not so
much a subject to most recent events.

Moving averages are one of the favourite techniques often used in business reports.
Typically, you will find 3-months, 6-months or 12-months moving averages used.

Short-term forecasting with moving averages

Now we know how to create moving averages, the question is: how do we use them as a
forecasting tool? The answer is very simple: we just shift the moving average by one
period in the future and this becomes our forecast. Equation (11.2) can be re-written as
a forecast in the following way:
xt + xt-1 + xt-2
Ft+1 = Mt = (11.4)
N

Page | 654
In other words, the moving average for the first three periods (if we are using 3 moving
average periods, for example), becomes a forecast for the fourth period:

x3 +x2 +x1
F4 =
3
.
.
x10 + x9 + x8
F11 =
3

If we take the data from Example 11.2, then according to equation (11.4) the forecasts
for one period ahead are as follows:

Period Series 3MA forecasts Calculations


1 150
2 250
3 200
4 360 200.0 =(150+250+200)/3
5 330 270.0 =(250+200+360)/3
6 380 296.7 =(200+360+330)/3
7 280 356.7 =(360+330+380)/3
8 300 330.0 =(330+380+280)/3
9 490 320.0 =(380+280+300)/3
10 450 356.7 =(280+300+490)/3
11 413.3 =(300+490+450)/3
Table 11.1 Forecasts using 3MA

From Table 11.1, the forecast at time point 11 is 413.3. If we present the results in a
graphical form, then our forecast for one period ahead is as illustrated in Figure 11.15:

Figure 11.15 A graph for the values from Table 11.1

If, for example a 12-months moving average is used in a business report, you will often
find that the report will calculate the moving average for the previous 12 months and
declare that this is most likely to be the forecast for next month’s value.

Page | 655
Example 11.3

Let us now use a little longer time series and see how to use the moving averages
function in Excel for forecasting purposes. Figures 11.16 and 11.17 shows the UK birth
rate per 1000 people from 1960-2018. In total 59 observations (rows 15-55 are hidden
in the table) and the graph.

Figure 11.16 UK birth rate per 1000 people from 1960-2018, Source: ONS

Figure 11.17 A graph of the time series from Figure 11.16

Excel Solution

We have several ways to apply moving averages in Excel.

Page | 656
The first method is similar to what we did when we added a trend line to a time series.
We right click on the time series in a graph and select Add Trendline. This invokes a
dialogue box with several options included. We click on the option called “Moving
Average” and change the number of periods to 5 as illustrated in Figure 11.18.

Figure 11.18 A dialogue box to invoke moving averages in Excel

Excel will automatically chart the moving averages from the last observation in the
period specified. If we selected a 5-period moving average, then the moving average
function would start from observation five.

Figure 11.19 A graph of the time series from Figure 11.16 and its 5MA values

Page | 657
This shows us how to include moving averages graphically, but we still do not have the
actual values of moving averages. In order to do this, we need to go to Data > Data
analysis option.

Select Data > Data analysis

This will invoke further options, and we select Moving Average option, as shown in
Figure 11.20.

Figure 11.20 A dialogue box to invoke Moving Averages algorithm in Excel Data
Analysis Add-in

In Figure 11.21 we are showing that we selected the range, i.e. cells C4:C62, which is
where our time series resides, and we selected Interval of 5, which is the number of
moving averages we decided to use. Further down in the same dialogue box, we selected
the output to start in cell D4. We also selected the Chart Output and Standard Errors
(more on which a bit later).

Figure 11.21 A dialogue box to define the length of the interval for moving
averages in Excel

What we get is shown in Figure 11.22 (again, rows 15 – 55 hidden). Excel have inserted
5MA values in column D and produced Standard Errors in column E. However, the
column D shows that the first 4 periods are not calculated, implying that Excel places
moving averages not as centred (as we have shown in the previous SPSS example), but
it places them at the end of the moving average interval (in our case the first one is
placed in the fifth period, because we are calculating 5MA).
Page | 658
Figure 11.22 Excel output after the dialogue box as in Figure 11.21

As before, =AVERAGE(C4:C8) which is copied down. Column E shows interesting


looking formula. Cell E12= SQRT(SUMXMY2(C8:C12,D8:D12)/5), which is the function
for the standard error. However, in this case the denominator is 5, which corresponds
with the number of moving averages. We’ll explain this later in the chapter when we
tackle again the standard error.

The graph that was produced by Excel is given in Figure 11.23.

Figure 11.23 Excel automatic chart showing the original time series and its 5MA
as forecasts

Page | 659
Once again, how is the moving average approach used to produce forecasts? As we
already know, all we need to do is to shift the moving average calculations, as produced
by Excel, by one observation. In other words, the moving average value for the first five
observations (assuming we are using moving averages for five periods) becomes the
forecast for the sixth observation. The seventh observation is predicted by using the
next five period moving average (observations two to six), and so on.

Figure 11.24 illustrates the point for five-period moving averages (5MA). In Figure
11.24 we are showing that we just inserted one blank cell D4 (we also inserted a blank
cell in E4 to shift the SE down), which shifted all the calculations one row down (rows
15 – 55 hidden).

Figure 11.24 Modified Figure 11.22 to show how moving averages are converted
into moving average forecasts

The forecast based on 5MA looks as in Figure 11.25.

Page | 660
Figure 11.25 A graph of the time series and its 5MA forecasts from Figure 11.24

By just observing Figure 11.26 we can say that 5-period moving averages (5MA) follows
the actual time series quite well and that the forecast of 11.62 for 2019 look reasonably
credible. We know from previous two chapters that this statement requires more
scrutiny, but we’ll settle for it for now.

We also calculated 13MA interval and produced one step ahead forecast. Figure 11.26
shows the comparison with a 5MA forecasts. As we already know, 13MA forecasts seem
to be even “smoother” than the 5MA forecasts.

Figure 11.26 A comparison between 5MA and 13MA forecasts for the time series

However, there is a difficulty associated with this approach. Moving averages cannot
extend our forecast beyond just one future period, which means that this method can
only be used as a short-term forecasting method that predicts only one future
observation. Moving averages are acceptable forecasting technique, providing we are
interested in forecasting only one future period.

Page | 661
SPSS Solution

Refer to Figure 11.9. All we need to do is to modify the SPSS Function from Centered
moving average to Prior moving average for the same span observations. We will
obtain the results identical to the results in Figure 11.24.

Mid-range forecasting with moving averages

If we need to extend our forecasts just beyond the immediate next forecast, for example,
to the following 2-6 observations, then we are effectively aiming to produce mid-range,
or mid-term forecasts. This implies that we will need to modify our simple moving
average formula.

In order to use moving averages as mid-term forecasts, we need to introduce one more
concept, and that is the concept of double moving averages. For this reason, we will
modify the notation. Simple (or single) moving averages will be called 𝑀𝑡′ or SMA and
double moving averages 𝑀𝑡′′ or DMA.

If moving averages represent a ‘rolling’ average of the actual observations in the series,
then we can also imagine a ‘rolling’ average of these moving averages, or, double
moving averages. Single moving averages are defined by equation (11.1) and double
moving averages are:

∑𝑡−𝑁+1 𝑀𝑖′
𝑀𝑡′′ = 𝑖=𝑡
(11.5)
𝑁

In other words, a moving average of the moving averages.

Using single and double moving averages, we can construct a dynamic intercept and a
dynamic slope coefficient, which will move and fluctuate as the original time series
moves. These two coefficients are calculated as follows:

𝑎𝑡 = 2𝑀𝑡′ − 𝑀𝑡′′ (11.6)


2
𝑏𝑡 = 𝑁−1 (𝑀𝑡′ − 𝑀𝑡′′ ) (11.7)

Where N in the denominator is the number of moving averages in the interval. These
two coefficients enable us to calculate forecasts that dynamically change as the time
series changes. The formula is:

𝐹𝑡+1 = 𝑌̂𝑡+1 = 𝑎𝑡 + 𝑏𝑡 (11.8)

The equation (11.8) will produce forecasts just one period ahead. We said that we were
looking for a method that can forecast further into the future. Well, the answer is now
very simple. If we need to extend forecasts m periods into the future, then the equation
for double moving average (DMA) forecasts becomes:

𝐹𝑡+𝑚 = 𝑌̂𝑡+𝑚 = 𝑎𝑡 + 𝑏𝑡 𝑚 (11.9)

Page | 662
Where m is the number of future periods (1, 2, 3, ..., m).

Equation (11.9) looks identical as a simple regression equation or a simple trend


extrapolation. However, there are two major differences. Both simple regression and
simple trend extrapolation use the fixed values of a and b. With double moving averages
(DMA) equation, the coefficients at and bt are dynamic and they change from period to
period.

The second difference is that a simple trend extrapolation uses variable x, which
represented time periods, starting from 1 and proceeding as consecutive numbers until
the end of the time series. The forecasts were the continuation of the same number
stream (if the time series has 20 observations, the value of x for the future calculations
is 21, 22, 23, …). With DMA, the value of m applies to the future periods, and m always
starts with 1 (the future values of m are: 1, 2, 3, …).

Example 11.4

We’ll use the same data as in Example 11.3. Figure 11.27 summarises the whole
procedure (as before, rows 15 - 55 are hidden). We are using 5-period moving averages.

Figure 11.27 Fitting the UK birth rate per 1000 between 1960-2018 time series
with DMA forecasts

To calculate the first two SMA for year 1964 and 1965, for example, we use equation
(11.1):

17.5 + 17.9 + 18.3 + 18.5 + 18.8


SMA5 = = 18.2
5

Page | 663
17.9 + 18.3 + 18.5 + 18.8 + 18.3
SMA6 = = 18.4
5

The last SMA for year 2018 is

12.0 + 11.9 + 11.8 + 11.4 + 11.0


SMA59 = = 11.6
5

DMAs for the same years are calculated using equation (11.5) as:

18.2 + 18.4 + 18.4 + 18.2 + 17.9


DMA5 = = 18.2
5

18.4 + 18.4 + 18.2 + 17.9 + 17.5


DMA6 = = 18.1
5

The last DMA for year 2018 is

12.5 + 12.3 + 12.1 + 11.8 + 11.6


DMA59 = = 12.1
5

Coefficients at and bt are calculated using equations (11.6) and (11.7). The examples for
time periods 9 and 10 are:

𝑎9 = 2𝑀9′ − 𝑀9′′ = 2 × 17.9 − 18.2 = 17.7



𝑎10 = 2𝑀10 ′′
− 𝑀10 = 2 × 17.5 − 18.1 = 16.9, etc.

2 2
𝑏9 = (𝑀9′ − 𝑀9′′ ) = (17.9 − 18.2) = −0.1
58 − 1 57
2 2

𝑏10 = 58−1 (𝑀10 ′′ )
− 𝑀10 = 57 (17.5 − 18.1) = −0.3, etc.

One step forecasts 𝑌̂𝑡+1 are calculated using equation (11.8). Again, we show here just
periods 10 and 11:

𝑌̂10 = 𝐹10 = 𝑎9 + 𝑏9 = 17.7 − 0.1 = 17.5

𝑌̂11 = 𝑎10 + 𝑏10 = 16.9 − 0.3 = 16.6, etc.

When we reach the last observation, the value of SMA59=11.6, DMA59=12.1, a59=11.2,
b59=-0.2. To calculate m-forecasts ahead, we use equation (11.9):

𝑌̂59+1 = 𝐹60 = 𝑎59 + 𝑏59 × 1 = 11.2 − 0.2 × 1 = 11.2

𝑌̂59+1 = 𝐹61 = 𝑎59 + 𝑏59 × 2 = 11.2 − 0.2 × 2 = 11.9


.
.
𝑌̂59+5 = 𝐹64 = 𝑎59 + 𝑏59 × 5 = 11.2 − 0.2 × 5 = 11.3

Page | 664
Excel Solution

To implement these calculations in Excel is very easy, as shown in Figure 11.28.

Figure 11.28 Double Moving Average (DMA) forecasts (5 observations)

In this example we used 5-period single moving averages (5SMA) and five period
double moving averages (5DMA). They are shown in columns D and E. Columns F and G
calculate at and bt, using equations (11.6) and (11.7). The first part of equation (11.7)
has the term 2/(N-1). As N represents number of observations in the moving average
interval (5 in this case), this translates into 2/(5-1)=2/4=1/2.

The past forecasts, or the DMA fit of the existing time series is given in cells H13:H61,
the future forecasts are given in cells H62:H66, and they were calculated using equation
(11.9). The chart in Figure 11.29 shows the result, i.e. the graph of the original time
series and its DMA forecasts.

Page | 665
Figure 11.29 Chart showing extrapolation using Double Moving Averages (DMA)
forecasts

As we can see, because the values of the coefficients at and bt are dynamically calculated,
the historical DMA forecasts values are ‘mimicking’, or emulating, the movements of the
original time series. Unfortunately, once we have reached the last observation in the
series, these coefficients are “frozen” and all the future extrapolated values are linear.
Clearly, the reason they are linear (straight line) is because the formula for DMA
forecasts is a linear formula, as per equation (11.9).

DMA, as any other method, has some advantages, and some disadvantages.

The values of the regression analysis coefficients a and b (explained in previous


chapter) used for extrapolating the linear trend, for example, were based on minimising
the squares of all the distances of every observation from the trend line. This gives some
statistical credibility to these two coefficients. Unfortunately, the values of the
coefficients at and bt for DMA forecasts used for forecasting the future are based only on
the last moving average interval. This means that they do not really represent the whole
time series. In other words, the basis for extrapolating the time series in the future is
relatively short. This potentially increases uncertainty for our forecasts and implies that
the DMA forecasting method is only acceptable for short to medium term forecasts.

On the other hand, if the time series is non-stationary, then the long-term history has
very little relevance. It is the most recent history that is much more relevant for our
immediate forecasts. Because DMA forecasts explicitly rely on the most recent history,
this method is often much better suited to produce the short-to-medium range forecasts
for non-stationary time series.

SPSS Solution

SPSS does not offer a ready-made solution for DMA, but formulae can be recreated using
the Create Time Series option in Transform menu.

Page | 666
Check your understanding

X11.1 Why is the simple mean (or, the average) the easiest and the “safest” way to
predict future values of a stationary time series?

X11.2 What is the difference between using moving averages as a general technique,
versus moving averages as a forecasting method?

X11.3 What is the impact on newly generated time series of moving averages if you
increase the number of elements in the interval for moving averages?

X11.4 What curve does the double moving average (DMA) forecasting method follow
when extrapolating the values in the future?

X11.5 Is it appropriate to use double moving averages (DMA) forecasting method for
long-term forecasting?

11.3 Introduction to exponential smoothing


In order to introduce the exponential smoothing method, we need to assume that one
of the ways to think about observations in a time series is to say that the previous value
in the series (yt-1), plus some error element (et), is the best predictor of the current
value (𝑦̂𝑡 ). This can be expressed by equation (11.10).

ŷt = yt-1 + et (11.10)

We can modify equation (11.10) and state that every new forecast is equal to the old
forecast plus an adjustment for the error that occurred in the last forecast i.e. et-1 = yt-1 -
Ft-1, as presented in equation (11.11).

Ft = Ft-1 + et-1 or Ft = Ft-1 + (yt-1- Ft-1) (11.11)

Where yt-1 is the actual result from period t – 1 and Ft-1 is the forecast result for period t
- 1.

Remember equation (10.1) from the previous chapter? It states that Y = T + R, which is a
similar principle we are using here. Instead of using the trend value T, equation (11.11)
uses a more general term Ft-1. This makes equations (10.1) and (11.11) similar in terms
of an idea they convey.

Let us now assume that the error element, i.e. (yt-1 - Ft-1) is zero. In this case the current
forecast is the same as the previous forecast. However, if it is not zero, we can take the
full impact of the error et, or just a fraction of the error. If we are going to take a fraction
of the error, then this means that we need to multiply it with the value somewhere
between 0 and 1. This is done using equation (11.12).

Ft = Ft-1 + (yt-1 – Ft-1) (11.12)

Page | 667
We use letter  (alpha) to describe the fraction, and the word ‘fraction’ implies that 
takes values between zero and one. If  = 0, then current forecast is the same as the
previous one. If  = 1, then current forecast is also the same as the previous one, plus
the full amount of the deviation between the previous actual and forecasted value.

If using the same formula, you substitute in equation (11.12) Ft-1 with the same
expression and then Ft-2, etc, you will see that the current Ft depends on all the past
values of yt. When the past values of yt change over time by going upwards or
downwards (non-stationary time series), then more recent observations are more
important than older observations and they should be appropriately weighted. Simple
exponential smoothing is a forecasting method that applies unequal weights to the time
series data, i.e. we have a power to decide if older or more recent data should gain more
weight in deciding about the future.

Why a fraction of an error? If every current forecast/observation depends on the


previous one and this one depends on the one before, etc., then all the previous errors
are in fact embedded in every current observation/forecast. By taking a fraction of
error, we are in fact discounting the influence that every previous observation and its
associated error has on current observations/forecasts.

So, in order to take just a fraction of that deviation,  must be greater than zero and
smaller than one, i.e. 0 <  < 1. The forecasts calculated in such a way are forming a line
that is in fact smoother than the line formed by the actual observations. If we plot both
the original observations and these newly calculated ex-post forecasts of the series,
we’ll see that the ex-post forecasts curve is eliminating some of the dynamics that the
original observations exhibit. It is a smoother time series. Just like the moving average
series.

The origins of equation (11.11) and (11.12) originate from Brown’s single exponential
smoothing method. The original Brown’s formula states that:

𝑆𝑡′ = 𝛼𝑦𝑡 + (1 − 𝛼)𝑆𝑡−1 (11.13)

Note that Brown uses yt rather than Ft-1. This effectively means that Brown is using the
exponentially smoothed values in the same way as we initially used moving averages. In
other words, just as a smoothing technique. If we use the original smoothing equation
by Brown, then we must remember that Ft = S' t -1 (see equations (11.11) and (11.12)).
Remember, this is the same principle as with moving averages, where we said that a
moving average for an interval can be used as either an approximation value for the
interval, or as a forecast for the time period that follows the interval.

By changing the notation, equation (11.12) can also be rewritten as equation (11.14).

𝐹𝑡 = 𝛼𝑦𝑡−1 + (1 − 𝛼)𝐹𝑡−1 or 𝐹𝑡+1 = 𝛼𝑦𝑡 + (1 − 𝛼)𝐹𝑡 (11.14)

Equation (11.12) and the two forms of equation (11.14) are all identical and it is a
matter of preference which one to use. They all provide identical forecasts based on
smoothed approximations of the original time series.

Page | 668
We implied that the smaller the  (i.e. the closer  to zero), the smoother and more
horizontal the series of newly calculated values is going to be. Conversely, the larger the
 (i.e. the closer  to one), the more impact the deviations have and potentially the
more dynamic the fitted series is. When =1, the smoothed values are identical to the
actual values, i.e. no smoothing is taking place.

There is also a connection between Brown’s formula, i.e. equation (11.13) and the
moving averages concept. You will recall from the section on moving averages that we
used equation (11.3):

𝑦𝑡 − 𝑦𝑡−𝑁
𝑀𝑡 = 𝑀𝑡−1 +
𝑁

The above equation can be written as:

1
𝑀𝑡 = 𝑀𝑡−1 + (𝑦 − 𝑦𝑡−𝑁 )
𝑁 𝑡
1
This looks very much like equation (11.12). In this case, it looks as if α= N. Although this
is not strictly true (see the text below), the similarity between moving averages and
exponential smoothing is obvious. This is the reason why exponential smoothing is
sometimes called exponentially weighted moving average (EWMA) method.

The smoothing constant () and the number of elements in the interval for calculating
moving averages are in fact related. The equation that defines this relationship is given
by equation (11.15).
2
𝛼 = 𝑀+1 (11.15)

In equation (11.15), M is a number of observations in the interval used to calculate the


moving average. The formula indicates that the moving average for three observations
that we used earlier is equivalent to  = 0.5. Equally,  = 0.2 is equivalent to M = 9. So,
the smaller the value of the smoothing constant, the more horizontal the series will be,
just like in the case when larger number of moving averages is used.

To reiterate, if using equation (11.14) we inserted yt-2 and Ft-2, and then yt-3 and Ft-3, etc.,
we would see that effectively we are multiplying the newer observations with higher
values of  and the older data in the series with the smaller values of . By doing this we
are in effect assigning a higher importance to the more recent observations. As we
observe the value of  used as a weight for older observations, the value of  drops
exponentially. This is the reason why we have the word “exponential” in the phrase
exponential smoothing. Every value in the time series is affected by all those that
precede it, but the relative weight (importance) of these preceding values declines
exponentially the further we go in the past.

There is another useful interpretation of this fact. If we chose a small value of  (closer
to zero), we are putting more weight on all observations, including the older ones.
Therefore, the time series of such exponentially smoothed values looks smoother. If we

Page | 669
chose a larger value of  (closer to one), we are putting more weight on more recent
observations. Therefore, the time series of such exponentially smoothed values looks
more like our original time series.

Forecasting with exponential smoothing

When discussing moving averages, we learned that, depending where we place the
moving average value, it can be considered either just a simple moving average value (if
it is centred or at the end of the moving average interval), or a forecast obtained using a
moving average (if it is placed one period after the moving average interval). The
example we can use is if we have 3MA value and we place it in the third period, then we
are simply implying that this value is a moving average of this interval of three
observations. If, on the other hand, we put it in the fourth period position, then we
imply that this moving average value is the forecast for the next period, based on the
previous three. The same principle is valid when using exponential smoothing. If you
remember that the past smoothed value can also be used as a forecast, then hopefully
there is no confusion:

𝐹𝑡+1 = 𝑆𝑡′ (11.16)

Example 11.5

As an example, let’s use a short time series to demonstrate how to use exponential
smoothing to create forecasts using Brown’s exponential smoothing method. The time
series is given in Table 11.2 (same as Example 11.2):

Period Yi
0 150
1 250
2 200
3 360
4 330
5 380
6 280
7 300
8 490
9 450
Table 11.2 A short time series with zero starting period

To start the smoothing process, we must make a choice for the value of the smoothing
constant α and the initial estimate of S'0 . The value of S'0 is needed to determine S'1 ,
and it is calculated as: 𝑆1′ = 𝛼𝑦0 + (1 − 𝛼)𝑆0′.

In this example, we have chosen α = 0.3 and 𝑆0′ = y1 = 150.

Exponentially smoothed values are calculated using equation (11.14):

Page | 670
𝐹1 = 𝛼𝑦0 + (1 − 𝛼)𝐹0 = 0.3 × 150 + 0.7 × 150 = 150
𝐹2 = 𝛼𝑦1 + (1 − 𝛼)𝐹1 = 0.3 × 250 + 0.7 × 150 = 180
𝐹3 = 𝛼𝑦2 + (1 − 𝛼)𝐹2 = 0.3 × 200 + 0.7 × 180 = 186
𝐹4 = 𝛼𝑦3 + (1 − 𝛼)𝐹3 = 0.3 × 260 + 0.7 × 186 = 238
𝐹5 = 𝛼𝑦4 + (1 − 𝛼)𝐹4 = 0.3 × 300 + 0.7 × 238 = 257

Again, the calculations in Excel are easy to implement, as in Figure 11.30. Cell D6=C6,
then D7=$C$3*C7+(1-$C$3)*D6, which is copied down. Forecasts in column F use the
same formulae, except that they are shifted down by one cell.

Excel Solution

Figure 11.30 Applying simple exponential smoothing as a forecasting method

As was the case with moving averages, in order to forecast one value in the future, we
need to shift the exponential smoothing calculations by one period ahead. The last
exponentially smoothed value will in effect become a forecast for the following period.
That is what we did in column F in Figure 11.30.

As an alternative to this formula method, Excel provides the exponential smoothing


method from the Data Analysis add-in pack (Select Data > Select Data Analysis > Select
Exponential Smoothing).

Example 11.6

We will use data from Example 11.3 and 11.4 to create a new Example 11.6 that
demonstrates the use of exponential smoothing via the Excel Data Analysis method.
Figure 11.31 illustrates the data in Excel.

Page | 671
Figure 11.31 The same data as in Examples 11.3 and 11.4

The first step is to go to tab Data > Select Data Analysis > Select Exponential Smoothing.

Figure 11.32 Excel dialogue box to invoke exponential smoothing in Data Analysis
Add-In
Click OK to access the Exponential Smoothing menu.

WARNING: Excel uses the expression “Damping factor”, rather than smoothing
constant, or . Damping factor is defined as (1 - ). In other words, if you want  to
be 0.3, you must specify in Excel the value of the damping factor as 0.7.

Input the menu inputs as illustrated in Figure 11.33 and click OK.

Page | 672
Figure 11.33 Dialogue box to define parameters in exponential smoothing

In our case α=0.9, so we had to enter 0.1 in the dialogue box in Figure 11.33. If you
wanted the Chart Output and the Standard Errors (more about that later), then you can
tick the two boxes at the bottom, as shown in Figure 11.33. The final output is as in
Figure 11.34 (we are again hiding rows 15:55).

Figure 11.34 Output from Excel after exponential smoothing dialogue box as in
Figure 11.33 was completed (columns E and F) and manual calculations (column
G)

The values in column E were produced by the Excel app, and the values in column G are
reproduced manually converting equation (11.13) into Excel formula. Cell
G6=$G$2*C5+(1-$G$2)*E5, etc. As we said, we’ll return to SE values from column F
later. At present just note that they are automatically calculated by Excel using formula:
F8=SQRT(SUMXMY2(C5:C7,E5:E7)/3).

From Figure 11.34, we observe the forecast for time point 2019 is 11.04.

Page | 673
As we can see, the only difference between the formulae from column E (Excel
calculation) and column F (manual calculation) is that manual calculation uses alpha
(α=0.9) and Excel uses Dumping factor (1 - α = 1 - 0.9 = 0.1).

Data Analysis > Exponential Smoothing solution illustrated in Figure 11.34 method
always ignores the first observation and produces exponential smoothing from the
second observation. It also cuts short with the exponential smoothing values, as the last
exponentially smoothed value corresponds with the last observation in the series. You
can easily extend the last cell one period in the future, which is what we did (just
Copy/Paste the last formula to the next cell down).

What becomes obvious from the example above is that by using Excel routine you
cannot change the values of  and see automatically what effect this has on your
forecasts. This means that, as far as simple exponential smoothing is concerned, you are
better off producing your own set of formulae. We will do this and compare two
different values of alpha in Example 11.7.

Example 11.7

We are using the same data set as in Example 11.6, and the two different alpha values
used are 0.1 and 0.9. Rows 15:55 are hidden in Figure 11.35.

Excel Solution

Figure 11.35 The use of two different values of alpha (0.1 and 0.9) for simple
exponential smoothing

As before D5=C4, D6=$D$2*C5+(1-$D$2)*D5 and F6=$D$2*C5+(1-$D$2)*F5, etc.

Page | 674
From Figure 11.35, we observe the forecast for time point 2019 given two values of  =
0.1, 0.9 are 12.15 and 11.04 respectively. The results are charted in the graph in Figure
11.36. As we can see the ex-post forecasts for α=0.9 are following more closely the
original time series, whilst the ex-post forecasts for α=0.1 give us much smoother line,
as expected.

Figure 11.36 A graph of the two exponentially smoothed time series, using =0.1
and =0.9

The way we implemented these formulas by creating a dedicated cell for the smoothing
constant alpha (cells D2 and E2 in Figure 11.35), implies that by just changing this one
single cell we can see the impact on how well our ex-post forecasts fit the original time
series.

SPSS Solution

In this section we will explore how we use SPSS to produce forecasts using exponential
smoothing for the same Example 11.7. SPSS data file: Chapter 11 Example 7 Exponential
smoothing.sav

Enter data into SPSS – first 15 data values shown in Figure 11.37

Page | 675
Figure 11.37 Same data from Examples 11.3 and 11.4 in SPSS

The last data points are illustrated in Figure 11.38 including the time point 60 for the
2019 forecast.

Figure 11.38 The last five observations of the data set from Figure 11.37 and the
empty cell for the first future value of the time series

Data > Define date and time…

Figure 11.39 A dialogue box for time-stamping the data set

Define date by clicking on Define date and time (year, starting 1960)

Page | 676
Figure 11.40 Assigning time stamps to observations from the time series

Click OK

Figure 11.41 The first 15 observations from the time series with the newly
created time stamps

Figure 11.42 The last five and the first future observations with the time stamps

Now run the analysis

Analyze > Forecasting > Create Traditional Models

Page | 677
Figure 11.43 Selecting Create Traditional Models in SPSS to apply
exponential smoothing

Analyze > Forecasting > Create Traditional Models

• Transfer variable Series into the Dependent Variables box


• Select Method: Exponential Smoothing

Figure 11.44 A dialogue box to define exponential smoothing

Click on Criteria button and choose Model Type – Nonseasonal – Simple

Page | 678
Figure 11.45 Selecting a simple exponential smoothing method

Click Continue

Figure 11.46 Defining the series for exponential smoothing

Select Statistics tab and choose options selected in Figure 11.47

Page | 679
Figure 11.47 Defining the outputs for exponential smoothing

Select Plots tab and choose options selected in Figure 11.48

Figure 11.48 Defining the plots for exponential smoothing

Select Save tab

In Variables box choose to Save: Predicted values, Lower Confidence Limits,


Upper Confidence Limits

Page | 680
Figure 11.49 Defining the prediction interval for exponential smoothing
forecasts

Select Options tab and choose Forecast Period: First case after end of estimation
period through last case in active data set (2018) as illustrated in Figure 11.50

Figure 11.50 Defining the width of the confidence limits for the prediction
interval

Click OK

SPSS data file

Figures 11.51 and 11.52 illustrates the SPSS data file (rows 15 – 54 hidden)

Page | 681
Figure 11.51 The first 15 observations, their exponentially smoothed forecasts
and the prediction interval

Figure 11.52 The last five observations and one future exponentially smoothed
forecast and the prediction interval

From Figures 11.51 and 11.52, we observe the predicted (or forecast) value for time
point 2019 is 11.0.

SPSS Output

SPSS will come up with the following printout.

Page | 682
Figure 11.53 Output from SPSS with a variety of error statistics

Figure 11.54 Output from SPSS with model statistics and model parameters

Figure 11.55 A single future exponentially smoothed forecast and the prediction
interval

From Figure 11.55, we observe the predicted (or forecast) value for time point 2018 is
11.0. The forecast values illustrated in Figure 11.55 are comparable but not identical to
those calculated by Excel (the value for 2018 was 11.04) and shown in Figure 11.35.
However, you will notice that SPSS did not ask us to determine the value of alpha. It

Page | 683
optimised it to find the smallest RMSE. We are somewhat familiar with this already, but
we will come back to this soon.

Figure 11.56 illustrates a graph plot for the observed date values , fitted predicted
values from the smoothing model and the forecasted values for time point 60
(prediction interval also included, though we’ll cover it later).

Figure 11.56 SPSS graph with the actual time series, exponentially smoothed
forecasts and the prediction interval

Like single moving averages, simple exponential smoothing as a forecasting method can
produce forecasts only one period ahead. To have longer, i.e. mid-range forecasts, a
modification is needed.

Mid-range forecasting with exponential smoothing

In order to use simple exponential smoothing beyond just a single forecast, we must add
a few simple equations. Like we did with equations (11.5) – (11.8), we need to introduce
double exponential smoothing values (DES) and the related parameters at and bt that
will be used to produce linear forecasts beyond just one future value.

To apply the analogy with the double moving averages, we can say that double
exponential smoothing values are exponentially smoothed values of the exponentially
smoothed values. Single exponential smoothing (SES) equation (11.13) could be applied
to construct double exponentially smoothed (DES) values:
′′
𝑆𝑡′′ = 𝛼𝑆𝑡′ + (1 − 𝛼)𝑆𝑡−1 (11.17)

We are using a single apostrophe (S’) to symbolize single (SES) and a double apostrophe
(S”) to symbolize double exponentially smoothed (DES) values. Again, using the analogy
with the DMA forecasting method, double exponential smoothing (DES) forecasting
method is:

𝐹𝑡+𝑚 = 𝑌̂𝑡+𝑚 = 𝑎𝑡 + 𝑏𝑡 𝑚 (11.18)

Page | 684
The only difference is that the coefficients at and bt are calculated as:

𝑎𝑡 = 2𝑆𝑡′ − 𝑆𝑡′′ (11.19)


𝛼
𝑏𝑡 = 1−𝛼 (𝑆𝑡′ − 𝑆𝑡′′ ) (11.20)

Example 11.8

In Example 11.4 we produced forecasts using the DMA (Double Moving Average)
method. In the following Example 11.8, we will use the same data set and produce
forecasts using the DES (Double Exponential Smoothing) method. We will skip manual
calculations and show only an Excel implementation. The value of alpha used is, =0.3.

Excel Solution

Figure 11.57 DES forecasts for the same example as in 11.6

As before, the past forecasts, or the DES fit of the existing time series is given in cells
H4:H62. However, the future forecasts are given in cells H63:H67 and they were
calculated using equation (11.18). Example below shows these calculations:

𝑌̂59+1 = 𝐹60 = 𝑎59 + 𝑏59 × 1


𝑌̂59+2 = 𝐹61 = 𝑎59 + 𝑏59 × 2
.
.
Page | 685
𝑌̂59+5 = 𝐹64 = 𝑎59 + 𝑏59 × 5

In Figure 11.57, alpha is given in cell G2. SES values in column D are calculated as
D5=$G$2*C5+(1-$G$2)*D4 and DES values in column E using the same formula
E5=$G$2*D5+(1-$G$2)*E4. The dynamic intercept at in column F is F4=2*D4-E4 and the
dynamic slope bt in column G is G4=($G$2/(1-$G$2))*(D4-E4). And finally, DES ex-post
forecasts in cells H5:H61 use simple formula H5=F4+G4, whilst the forecasts in
H63:H67 use formula H63=$F$62 + $G$62*B63.

The chart in Figure 11.58 shows the final result, i.e. the graph of the original time series
and its DES forecasts.

Figure 11.58 A graph containing the DES forecasts for 5 years ahead

Compare these forecasts from the ones in Example 11.4. Although the future values are
not identical (these ones were calculated using the DES and the previous ones using the
DMA method), they both show linear extrapolation when going forward in the future.

SPSS Solution

Input data into SPSS and include dates, as illustrated in Figures 11.59 and 11.60. Figure
11.37 to Figure 11.42 in Example 11.7 show how to time stamp the data, so we are not
repeating them here.

Figure 11.59 Same as Figure 11.37, but with only 5 observations shown

Page | 686
Figure 11.60 Modified Figure 11.38 to show five future periods

Notice the forecast time points are now 60 – 64.

Select Analyze > Forecasting > Create Traditional Models

Transfer Series into Dependent Variables box


Choose Method: Exponential Smoothing

Figure 11.61 Selecting a variable for DES forecasts

Click on Criteria and select Model Type Nonseasonal: Brown’s linear trend (this
is how SPSS refers to the DES method)

Page | 687
Figure 11.62 Selecting Brown’s linear trend, which is equivalent to DES

Click Continue

Figure 11.63 Defining the series for double exponential smoothing (DES)

The steps that follow are identical to Figure 11.47-50, so will not show them
here.

Figure 11.64 Selecting the future forecasting horizon

The result is shown in Figures 11.65 and 11.66, where SPSS added more
variables that define this DES, or Brown’s model.

Figure 11.65 The first 5 observations, their DES forecasts and the prediction
interval

Page | 688
Figure 11.66 The last five observations and five future DES forecasts and the
prediction interval

From Figure 11.66, the double exponential smoothing forecasts are 10.6, 10.2, 9.8, 9.5
and 9.1. The values in Figure 11.66 are slightly different from the values in Figure 11.57,
which is the Excel version.

If you look at SPSS output, you get the following:

Figure 11.67 Output from SPSS with model parameter details

Clearly SPSS used the value of alpha as 0.864, whilst we used 0.3 in Excel. If you
manually input alpha = 0.864 into Cell G2, then you will get the same solutions as SPSS.
Later we will show how to optimise this value in Excel to get better Excel forecasts.
Finally, Figure 11.68 shows the graph.

Figure 11.68 SPSS graph with the actual time series, DES forecasts and the
prediction interval

There are methods based on exponential smoothing principle that are not necessarily
linear, such as the triple exponential smoothing method (TES), but this is beyond the
scope of this textbook.
Page | 689
Check your understanding

X11.6 What is the difference between single exponential smoothing as a technique to


eliminate variations (smoothing technique) and single exponential smoothing as a
forecasting method.

X11.7 What is the range for the smoothing constant alpha and how does it impact the
newly created time series of smoothed values?

X11.8 Is there a difference between the smoothing constant alpha and damping factor in
Excel?

X11.9 What is the relationship between the smoothing constant and moving averages?

X11.10 Would you describe the future double exponential smoothing forecasts as
linear?

11.4 Handling errors for the moving averages or exponential


smoothing forecasts
In previous chapter we demonstrated how to use errors to validate the long-term
forecasts. In principle, there is no difference between the long-term and medium-term
forecasts, at least as far as the error handling is concerned. However, in the context of
the mid-term forecasts, in particular when using exponential smoothing method, we can
use errors to help us optimise the forecasts. Excel is very well suited to help us with this.
We will use Example 11.9 and demonstrate how to execute the optimization.

Example 11.9

Same example as before, and rows 15-55 are again hidden. All the formulae are the
same as before.

Excel Solution

As we can see, the only addition to the spreadsheet from Example 11.8 is the cell G1.
This is the cell where we introduced the Mean Squared Error (MSE).

Rather than adding a new column with errors and then squaring them, adding all up and
finding an average (which would be MSE), we used one single formula to calculate the
MSE: =SUMXMY2(C5:C62,H5:H62)/COUNT(C5:C62).

Page | 690
Figure 11.69 DES forecasts for the same example as in Figure 11.57, but with the
added MSE value

In our case, the value of MSE is 0.37359, providing the value of alpha is α=0.3. Let’s see
how, by manipulating the value of alpha, we can reduce the MSE. To use MSE as a tool
for determining the optimal value of the smoothing constant  we need to use a small
trick. We will use Excel’s Solver add-in option, which can be found by clicking on Data >
Solver menu (see Figure 11.70).

Figure 11.70 Solver option in Excel

By selecting the Solver option, a dialogue box appears, as in Figure 11.71. As we can see
we are aiming (or as Excel would say: Set Objective) to change the values in the cell G1,
which is the MSE, and we specified that we want this value to be the smallest possible
value, i.e. the minimum. How are we going to do this? We are going to do this by
allowing Excel to change all the possible values in G2 (the value of the smoothing
constant alpha), until the value in G1 is the minimum.

Page | 691
Figure 11.71 Complete Solver dialogue box ready to optimise the content of the
cell G1 (get the minimum value), given the restrictions for cell G2

To recap, the logic behind the Solver function is to:

• Set objective: G1 in our case


• By changing Variable Cells: G2 in our case

If we want to impose some restrictions, under the heading of Subject to the Constraints
we need to click on the Add button. This triggers another dialogue box, as in Figure
11.72. We need to restrict the value of alpha to a maximum of one and a minimum of
zero.

Figure 11.72 Adding the Solver constraints

Note that we restricted the smoothing constant in cell G2 to be a maximum of 0.9999.


Sometime, Excel Solver does not converge to a solution if we put the value of 1, hence
0.9999. Equally we defined the minimum value as 0.0001, rather than 0 to help with the
convergence towards the optimum solution.

Figure 11.72 shows all the boxes completed in the Solver dialogue box before we click
on the Solve button.

Page | 692
Once we have clicked the Solve button, Excel offers additional report sheets (reports on
Answers, Sensitivity and Limits), but we will let you explore this independently. The
result that Excel offers as the optimum smoothing constant that minimises the MSE is α
= 0.85686. This value of alpha yields an MSE of 0.11894, which is an improvement over
the existing 0.37359 for =0.3.

Figure 11.73 DES forecasts for the same example as in Figure 11.69, but with the
optimised MSE value

If we look at the graph now, it will be visible that our new ex-post forecasts are even
closer to the historical values of the time series. Compare the forecast values from
Figure 11.73 with the ones from Figure 11.68, which we obtained using SPSS, and you
will see that they are identical. You can also see from Figure 11.67 that SPSS used
almost identical value of alpha (0.863) that was obtained in our Excel example after
using the Solver option (0.85559).

Page | 693
Figure 11.74 A graph containing the optimised DES forecasts for 5 years ahead

It suffices to say that we could have added other constraints, if we wanted it. The
objective, or a cell containing the MSE as a target, could also be changed. We could have
used ME, MAPE, RMSE, SE, or any other measure as a target. We could also have set the
desired value of the MSE, for example. The possibilities are numerous, which makes this
simple and elegant solution for optimising forecasts an ideal tool.

Sometimes when you run the Solver, it will either not converge to any meaningful value,
or it will give you the values #DIV/0!. Do not be discouraged. Either change the Solving
Method in Excel from GRG Nonlinear to Simplex LP or Evolutionary. Or, you modify the
limit values for the constraints, as we did in Figure 11.72. Because alpha should not be
neither 1 or 0, we modified the criteria and put 0.9999 and 0.0001. This helped with the
convergence toward the solution.

Prediction interval for short and mid-term forecasts

In Example 11.3 and 11.6, when we selected moving averages and exponential
smoothing respectively, through one of Excel’s Data Analysis apps, one of the dialogue
boxes gave us an option to get Standard Errors displayed. As we deferred explanations
for this statistic, we will briefly repeat the procedure, but this time we will focus on the
Standard Error option.

In Example 11.3 in Figure 11.24, column E contained standard errors, as calculated by


Excel. The formula that Excel used (the first cell was E12) is:
=SQRT(SUMXMY2(C8:C12,D8:D12)/5).

This is effectively the standard error of the estimate as defined by equation (10.14)
from the previous chapter. The only difference is that equation (10.14) has n-2 in the
denominator, whilst Excel used number 5. The reason why Excel used number 5 was to
match the number of moving averages we selected. This means that the standard error,
as calculated by Excel automatically in this case, is a dynamic standard error. In other

Page | 694
words, it is not constant for the whole data set, but it varies as the moving averages
change.

In Example 11.6 in Figure 11.34, column F contains standard errors calculated by Excel,
but this time for the exponentially smoothed values. The first value in cell F8 has almost
identical formula =SQRT(SUMXMY2(C5:C7,D5:D7)/3). The only difference is that the
denominator is 3. It is unclear why Excel uses 3, and it is consistently used regardless of
the value of alpha. However, as before, this indicates that the standard error of the
estimate is a dynamic value that changes from period to period.

Example 11.10

We will now use the same data as in Example 11.9 where we optimised the value of
alpha using Excel Solver. In this Example 11.10, we will extend the calculations and use
standard errors to calculate the prediction interval.

Excel Solution

Figure 11.75 illustrates the Excel solution (row 15-55 hidden).

Columns A:H contain formulae identical to Figure 11.69. We introduced a few new cells.
L3 is 0.05 (this is the level of significance, also called alpha, but not to be confused with
alpha for the smoothing constant). L4 is the number of degrees of freedom, which is 57
(L4=COUNT(C4:C62)-2). L5 is the t-value calculated as L5=T.INV.2T(L3,L4). Cells L6 and
L7 give the same value of the standard error of the estimate for the whole data set, but
they are calculated in two different ways, as:

L6=SQRT(SUMXMY2(C5:C62,H5:H62)/(COUNT(C5:C62)-1)), and:
L7=STEYX(H5:H62,C5:C62).

Figure 11.75 Same as Figure 11.73, but with the SE values calculated and the prediction
interval

As we said, column I contains the dynamic Standard Errors, as produced automatically


by Excel. We used these values to calculate the prediction interval for every ex-post

Page | 695
forecast given in columns J and K. We are familiar with how to calculate the prediction
interval, as we applied it in Chapter 10 (see equations (10.26)-(10.30) and Example
10.18): I8=SQRT(SUMXMY2(C5:C7,H5:H7)/3), J8=H8-($L$5*I8) and K8=H8+($L$5*I8).
All of them are copied down to row 62.

When we reached the end of the actual time series (row 62 in Figure 11.75), the series
of standard errors of the estimate calculated using automatic Excel formula have
stopped. This means that in order to calculate the future prediction interval for the
forecasts, we have to use some other value of the standard error. We have two options
here. One is that for the first future prediction interval we use the last actual standard
error (prediction interval is in this case: ŷ60  tvalue SE59). However, as this last value of
the standard error might be skewed (because it is calculated for a small rolling number
of observations), it is much safer to use the overall standard error of the estimate. Cell
J63=H63-($L$5*I63*SQRT(B63)) and K63=H63+($L$5*I63*SQRT(B63)). The cells from
J64 and K64 copied downwards are: J64=H64-($L$5*$L$6*SQRT(B64)) and
K64=H64+($L$5*$L$6*SQRT(B64)).

As a reminder the equation for the future prediction interval is: ŷt  tvalue  SE  √ℎ. See
equation (10.29) for the explanation for what is meant by h and why we are taking the
square root of h. Also as a reminder, h represents the future increments (h=1, 2, 3, …)
that are used to correct the future prediction interval and anticipate the future
uncertainty. Cells J63:K67 in Figure 11.75 show the future prediction interval.

Figure 11.76 The actual time series, DES forecasts and the prediction interval

As we can see the two dotted lines, symbolizing (DES – t  SE) and (DES + t  SE), follow
the red line that represents SES. The interval is not of consistent width, because it is not
calculated on the basis of a single standard error for the whole data set. It is calculated
as a dynamic and moving series of standard errors for every rolling 3-period interval.
For the final five prediction intervals we use the constant standard error (for the whole

Page | 696
data set), but they are also multiplied by the square root of h values 1, 2, 3, .., m for the
future prediction intervals (see equation 10.29). We know from equations (10.26) and
(10.27) that SEŷ,y and RMSE can be used interchangeably for a prediction interval ŷt ± z 
SEŷ,y , or ŷt ± z  RMSE. Equation (10.29) defines RMSEh (RMSE for multiple steps h) as:

𝑒2
𝑅𝑀𝑆𝐸ℎ = √ℎ × 𝑀𝑆𝐸 = √ℎ
𝑛

This means that for the prediction interval for the multiple steps in the future, our
equation is: ŷt  tvalue  SEŷ,y  √ℎ. Why square root of h? SEŷ,y or RMSE is already a
square root of MSE, which means that we only need to take the square root of h.

If you compare Figure 11.76 with Figure 11.68, you will see some differences between
the Excel and SPSS solutions. First of all, if in Figure 11.75 we used the constant value of
SE, rather than the dynamic one, Excel prediction interval would be identical to what we
have in SPSS. The only true difference is the future prediction interval between the
Excel and SPSS solutions. The SPSS future prediction interval is much wider than the
one calculated using Excel. We stated that in Excel we will use a quick workaround ŷt ± t
 RMSEh, as proper solutions are too complicated for manual calculations. SPSS uses the
proper algorithm to calculate the future prediction interval, hence the difference.
However, these are the only differences between the two solutions.

SPSS Solution

There is no need to show how this works in SPSS given that these values were already
printed out in Figures 11.65 - 11.68.

Check your understanding

X11.11 How would you use MSE to optimise your forecasts?

X11.12 How is the standard error of forecast calculated in the Excel application related
to exponential smoothing?

X11.13 How would you construct the future prediction interval for forecasts produced
using either double moving averages or double exponential smoothing?

X11.14 Compare the results obtained using Excel function =STEYX() and manual
formula =SQRT(SUMXMY2(known_Y, predicted_Ŷ)/(COUNT(Y)-2)). What conclusions
can you draw?

X11.15 How do you calculate the prediction interval for the future values, as opposed to
historical values when fitting the model to the time series?

Page | 697
11.5 Handling seasonality using exponential smoothing
forecasting
As an alternative to the classical time series decomposition approach, we can also use the
exponential smoothing approach to forecast seasonal data. If you look at equation
(11.12), you will see that effectively we said that the next forecast is equal to the previous
forecasts, plus the fraction of the past forecast error. We know that et = yt - Ft, and
therefore, equation (11.12) become:

Ft = Ft-1 + et-1 (11.21)

Earlier we introduced the concept of double exponential smoothing and used this
technique to apply it to the double exponential smoothing (DES) method, suitable for
following linear time series trend and producing mid-range forecasts. This method is
sometimes called Brown’s linear exponential smoothing method. Equations (11.17)-
(11.20) are used to execute the DES method, or Brown’s linear exponential smoothing
method. We will reuse these equations and combine them into one single equation.

Effectively combining equations (11.19)-(11.20) into equation (11.18) and substituting


equation (11.13) and (11.17) for single and double exponentially smoothed values, we
get equation (11.22):

𝐹𝑡 = 𝑌̂𝑡 = 2𝑦𝑡−1 − 𝑦𝑡−2 − 2(1 − 𝛼)𝑒𝑡−1 + (1 − 𝛼)2 𝑒𝑡−2 (11.22)

We will use this equation (11.22) and combine it with the trend/cycle/seasonal data
principles we learned in previous chapter.

Classical decomposition combined with exponential smoothing

Let’s use an example to show how the principles of time series decomposition can be
combined with exponential smoothing to produce credible seasonal forecasts.

Example 11.11

The data set used are the average quarterly CO2 emissions in ppm measured at the
Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020. The objective is to
forecasts the next four quarters until Q2 of 2021. As we already covered most of the
equations, we will go directly to Excel and implement the method. Figures 11.77 and
11.78 show the complete solution.

Excel Solution

The Excel solution is illustrated in Figure 11.77.

Page | 698
Figure 11.77 Time series decomposition combined with exponential smoothing
for seasonal data

Figure 11.78 Calculations for time series decomposition combined with


exponential smoothing

We are showing somewhat different approach to decomposition here, so let’s go through


it, step by step. The steps are numbered in blue boxes in Figures 11.77-11.78.

In line with the decomposition philosophy, we will first try to isolate the Trend and Cycle
component together. In examples from previous chapter, we isolated the Trend
component by simple trend calculation. However, if the time series is too short to reveal

Page | 699
cycles that appear on top of a trend, it is safer to use a different approach. The approach
is: isolate Trend and Cycle as two combined components bundled together (call it TC).
1 Step 1: If we calculate the moving averages of the data set, this moving average is
effectively a TC component bundled together. All we need to do is to centre them. If we
have the even number of seasons (four quarters in our case), then centring can be
achieved by calculating an average of the two neighbouring averages. In column E in
Figure 11.77 we take the average of the first four observations and the average of the next
four observations. The average of these two averages is a centred average, and we put it
in the middle of the year, or to the nearest quarter, which is Q3, i.e. cell E6. As we proceed,
column E contains the centred Trend and Cycle component (TC) that are combined. In
Figure 11.77 cell E6=(AVERAGE(D4:D7)+AVERAGE(D5:D8))/2 is copied down to cell
E23.
2 Step 2: To establish how much the time series is oscillating around this Trend/Cycle
line, we need to divide the Y data values with the Trend/Cycle values. This ratio indicates
seasonality, and we have it in column F in Figure 11.77. Number 1 means that this index
is the same as the Trend/Cycle component. Numbers below 1 show the dips against the
Trend/Cycle line, and numbers above 1 show the upswings against the same line. In
Figure 11.77 cell F6=D6/E6 is copied down to F23.
3 Step 3: Because our time series consists of quarterly data, the next step is to find
what could be called the typical quarterly indices. This has been achieved in cells P2:R5
in Figure 11.78. We can see that we actually take all Q1 data and average them, then Q2,
etc. The typical seasonal indices are given in R2:R5 in Figure 11.78. Cell R6 is the sum of
all quarterly indices. It adds up to 4 as it should (1 for every quarter), but sometimes it
does not happen, which is the reason why we are showing this. If the sum was above or
below the true sum (4 for quarterly data, or 12 for monthly data, for example), we would
need to adjust our typical seasonal quarterly indices. We have done this in cells R2:R5 in
Figure 11.78.
4 Step 4: The values of the typical seasonal index are copied to column G in Figure
11.77. You can see that the values from R2:R5 from Figure 11.77 are copied for every
appropriate quarter in column G in Figure 11.77. We can now adjust our time series for
seasonal effects. This is done in column H in Figure 11.77and every value is nothing but
the original data value, divided by the typical seasonal index, i.e. H4=D4/G4 copied down.
5
Step 5: The forecasts for seasonally adjusted indices are calculated in column I in
Figure 11.77, but you will notice that the first two cells (I4:I5) are just copies of the
original time series. The equation (11.22) “kicks in” from cell I6. What we have in this
column are double exponential smoothing (DES) forecasts produced based on the
seasonally adjusted time series from column H. Cell I6=2*H5-H4-2*(1-$M$3)*J5+((1-
$M$3)^2)*J4 is copied down to I29.

6 Step 6: You will notice that the formula in column I in Figure 11.78 makes a
reference to column J, which contains errors. It might be paradoxical to take errors into
account at the same time as we are producing forecasts. After all, error is something that
you realize afterwards when you compare the actual with the forecasted value. However,

Page | 700
thanks to Excel automation procedure, we can do this immediately and the cells will be
filled as we copy them down. Cell J4=H4-I4, copied down to J29.
7 Step 7: To complete the forecasting process, we need to recompose our DES forecasts
by multiplying them with the typical seasonal index. This is achieved in column K in
Figure 11.78. Cell K4=I4*G4 copied down to K29. How well does this approach fit the
original time series? Let us first look at the graph of data set vs. its fit and forecasts. Figure
11.79 contains the graph. The alpha smoothing constant used (cell M3 in Figure 11.78)
was 0.5.

Figure 11.79 A graph of the quarterly CO2 emissions in ppm measured at the
Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020 and forecasts using
the time series decomposition and exponential smoothing combined

As we can see, visually we have a good fit. Initially, the first 5 periods, our forecasts are
not completely accurate, which is the consequence of using a very short time series (only
20 observations). For a longer time-series, this initial mismatch would become even less
relevant.

From Example 11.8 we know that we can optimise the forecast by changing the value of
alpha. This is achieved using Excel Solver. We have not done this here, but you can try to
see how much you can improve the forecasts. In Figure 11.78 we have the value of MSE
in cell M4=SUMXMY2(D4:D25,K4:K25) /COUNT(D4:D25) and RMSE in cell
M5=SQRT(M4), and you can use them to optimise the forecasts.

We know that to properly validate our forecasts we need to do some analysis with
forecast errors as well as to show the prediction interval for our forecasts. As before, we
are keeping this for the very end of the section in this chapter.

SPSS Solution

Page | 701
SPSS does not have a “ready-made” solution that is identical to the approach we took here.
However, it offers some other useful methods, and we will show how to use one of them
in the next section.

Holt-Winters’ seasonal exponential smoothing

One of the most effective forecasting methods for seasonal data, based on exponential
smoothing, is the Holt-Winters’ method. It uses three smoothing equations and three
smoothing constants, but it is not as complex as it sounds. The three equations are
dedicated to three different components, namely:

1. Level (ℓt)
2. Trend (bt)
3. Seasonality (St)

Effectively we are combining the linear regression components (intercept or level, and
slope) with the seasonal component. The only difference is that in the case of the Holt-
Winters’ method, these components are dynamic. Recall that in simple regression we had
only one value of a, which is the intercept, and one value of b, which is the slope. Here, we
have changing values of at and bt. However, to avoid confusion, we will call the intercept
as level ℓt, trend as bt and the seasonality component a St.

There is one more point to remember. Holt-Winters’ method comes in two “flavours”. One
is the model where these components are in additive relationship and the other one is
multiplicative relationship. We use additive Holt-Winters’ model mainly for stationary
time series and multiplicative Holt-Winters’ model predominantly for non-stationary
time series.

And lastly, alpha (), the smoothing constant, is used only for the level equation. Beta ()
the smoothing constant is used only for the trend equation, and gamma () only for the
smoothing constant for the seasonal equation. Let us show these three equations.

For the additive model, the equations are:

ℓ𝑡 = 𝛼(𝑦𝑡 − 𝑆𝑡−𝑠 ) + (1 − 𝛼)(ℓ𝑡−1 + 𝑏𝑡−1 ) (11.23)

𝑏𝑡 = 𝛽(ℓ𝑡 − ℓ𝑡−1 ) + (1 − 𝛽)𝑏𝑡−1 (11.24)

𝑆𝑡 = 𝛾(𝑦𝑡 − ℓ𝑡 ) + (1 − 𝛾)𝑆𝑡−𝑠 (11.25)

For the multiplicative model, the equations are:


𝑦
ℓ𝑡 = 𝛼 (𝑆 𝑡 ) + (1 − 𝛼)(ℓ𝑡−1 + 𝑏𝑡−1 ) (11.26)
𝑡−𝑠

𝑏𝑡 = 𝛽(ℓ𝑡 − ℓ𝑡−1 ) + (1 − 𝛽)𝑏𝑡−1 (11.27)


𝑦
𝑆𝑡 = 𝛾(ℓ 𝑡) + (1 − 𝛾)𝑆𝑡−𝑠 (11.28)
𝑡

Page | 702
The small s in the above equations is used for periodicity or seasonality (for quarterly
data s=4, for monthly s=12, etc.). This means that forecasts for the additive and
multiplicative models are produced respectively as:

𝐹𝑡+𝑚 = ℓ𝑡 + 𝑏𝑡 𝑚 + 𝑆𝑡−𝑠+𝑚 (11.29)

𝐹𝑡+𝑚 = (ℓ𝑡 + 𝑏𝑡 𝑚)𝑆𝑡−𝑠+𝑚 (11.30)

Where m is the number of forecasts ahead and the values of the smoothing constants
will be as follows:

0 <   1, 0 <   1 and 0 <   1-

To start the calculations, the initial values of ℓt, bt and St are:


𝑦𝑡
ℓ0 = ∑𝑠𝑡=1 (11.31)
𝑠

𝑏0 = 0 (11.32)
𝑦
𝑆0,𝑠 = ℓ 𝑡 (11.33)
0

The above equations indicate that ℓ0 is equal to the average value of just the first-year
data. b0 is zero and s0 is calculated only for the first year (because here t=1, …, s) as a ratio
between the individual observations and their first-year average.

Note that if  = 0 and  = 0, then the Holt-Winters’ model is equivalent to single


exponential smoothing (SES).

Let us see how this works. We will use the same data set as in Example 11.11.

Example 11.12

The data set used are the average quarterly CO2 emissions in ppm measured at the
Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020. The objective is to
forecasts the next four quarters until Q2 of 2021. We will skip manual equations and
move directly into Excel solution.

Excel Solution

Figure 11.80 shows how the model was implemented in Excel. Note that in all three
equations we used the initial value of 0.5 for all three smoothing constants. This is
arbitrary. We know that the range is 0 <   1, 0 <   1 and 0 <   1-, so 0.5 is as good
as any other value in this range.

Page | 703
Figure 11.80 Holt-Winters’ forecasting method for seasonal data

Cell E7=AVERAGE(D8/D11) and cell E8=$J$3*(D8/G4)+(1-$J$3)*(E7+F7), copied down


to E29. Cell F7=0 and cell F8=$J$4*(E8-E7)+(1-$J$4)*F7, copied down to F29. Cell
G4=D8/AVERAGE($D$8:$D$11) copied to G7 and G8=$J$5*(D8/E8)+(1-$J$5)*G4, copied
down to G29. This covers ℓt, bt and St. Forecasts Ft are calculated in column H, starting
with H8=(E7+F7)*G4, copied down to H29. And finally, errors et in column I are calculated
as I8=D8-H8, copied down to I29.

The formulae in Figure 11.80 are the direct implementations of all the equations from
this section in the Excel format. Column E in Figure 11.80 is equation (11.26), column F
equation (11.27), column G equation (11.28) and column H is equation (11.30). To
check the visual appearance of our forecasts (column H in Figure 11.80), we produce
the graph in Figure 11.81.

Page | 704
Figure 11.81 A graph of the quarterly CO2 emissions in ppm measured at the
Mauna Loa observatory in Hawaii from Q1 2015 to Q2 2020 and forecasts using
the Holt-Winters’ method

We can see more than a reasonable fit with the data set. However, we can again
optimise this further using Excel Solver, as we did in Example 11.9. Example 11.13
demonstrates how we did the optimization here.

Example 11.13

We will continue where we stopped in Example 11.12 by using Figure 11.80 to optimise
the values of the three smoothing constants.

Excel Solution

For optimisation, we invoked the following Solver dialogue box:

Page | 705
Figure 11.82 Solver dialogue box for Holt-Winters’ forecasts

As before, we are targeting to minimise cell J6 in Figure 11.80, which is the value of
MSE. The objective will be achieved by changing cells J3:J5 in Figure 11.80, which are
the three smoothing constants. The constraints we used to limit changes are that both α
and  should be between zero and one, whilst  should be between 0 and 1- α.

After we clicked on Solve button, the improvements were not so significant. The MSE
went down from 0.979 to 0.891. We set the initial value of all three smoothing constants
to 0.5. After optimisation we got: α=0.41, =0.31 and =0.58. Although not a significant
drop in MSE, still we can say that we optimised our forecasts. Figure 11.83 shows the
optimised version of the same example.

Page | 706
Figure 11.83 Optimised constants for Holt-Winters’ method using Solver in Excel

As we have not changed any formulae in the spreadsheet, except optimised the values in
cells J3:J7, there is no need to repeat the Excel Solution descriptions. All the cells retain
the same formulae as in Example 11.12. In Figure 11.84 we show the results of the
optimised forecasts in a visual form. The results are not much different, but some
improvement has been achieved.

Figure 11.84 Optimised forecasts (for the basic case see Figure 11.81)

Page | 707
SPSS Solution

The date is in year/quarters rather than in years compared to the previous examples.

Figure 11.85 Quarterly CO2 emissions in ppm measured at the Mauna Loa
observatory in Hawaii from Q1 2013 to Q2 2018 time series in SPSS

Select Analyze > Forecasting > Create Traditional Model

Transfer Series into Dependent Variables box

Select Method: Exponential Smoothing

Click on Criteria box. Select Winters’ multiplicative model

Figure 11.86 A dialogue box for selecting Winters’ multiplicative model, which is
the Holt-Winters’ method

Page | 708
Click on Continue

Figure 11.87 Selecting the variables for analysis

Click on Statistics tab and choose options

Figure 11.88 Selecting the statistics to display in outputs

Click on Plots tab and select options

Page | 709
Figure 11.89 Selecting the plots to display

Click on Save tab and select options

Figure 11.90 Selecting the confidence interval

Click on Options tab and select options

Page | 710
Figure 11.91 Defining the confidence interval width
Click OK

New variables have been created showing predicted values, lower and upper confidence
levels (i.e. the prediction interval), and the residuals, as in Figure 11.92.

Figure 11.92 Complete output with fitted data using the Winters’ multiplicative
model (Holt-Winters’ method) together with the prediction interval and the future
forecasts

SPSS Output

Page | 711
Figure 11.93 Error statistics for the selected model

Figure 11.94 Model statistics

Figure 11.95 Constants for the Winters’ multiplicative model (Holt-Winters’


method)

Figure 11.96 Forecasts and the prediction interval for the future 4 observations
and the prediction interval

Note that for the level we used the word alpha for the smoothing constant, which is the
same as SPSS. However, we used the word beta for the trend component, and SPSS uses
the word gamma. On the other hand, we used the word gamma for the seasonal
component and SPSS uses the word delta.

And finally, the graph output is as in Figure 11.97

Page | 712
Figure 11.97 SPSS graph of the historical data and the future forecasts with the
prediction interval

The results are not completely identical to the ones from Excel, for a simple reason that
SPSS optimises the three constants differently to what we did with the Solver function in
Excel and it initiates the starting values in a different way. Nevertheless, the forecasted
values are in close vicinity to one another. We did not show how to calculate the
prediction interval in this example using Excel, but it is identical to all the previous ones.
However, in Figure 11.98 we are showing Excel graph of the forecasts and the prediction
interval. As an exercise, readers might like to reproduce these calculations in their
versions of the spreadsheet.

Figure 11.98 Excel graph of the historical data and the future forecasts with the
prediction interval

Check your understanding

Page | 713
X11.16 Use the dataset from Table 11.3 and apply the decomposition method using
centred moving averages as the TC component. Forecast the next four periods in
2021.

Year Q Y Year Q Y
2017 1 100.0 2019 1 141.0
2 160.0 2 204.0
3 150.0 3 201.0
4 140.0 4 179.0
2018 1 122.0 2020 1 150.0
2 188.0 2 200.0
3 160.0 3 220.0
4 159.0 4 200.0
Table 11.3 A seasonal time series data

X11.17 Use forecasts from X11.16 and construct the future prediction interval using
RMSE and h.

X11.18 Use the data from Table 11.3 (same as X11.16) and apply Holt-Winters method
to forecasts for the next four periods in 2021. Use 0.5 as the value for alpha, beta and
gamma.

X11.19 Optimise forecasts from X11.18 by using Microsoft Solver. What are the new
values of alpha, beta and gamma?

X11.20 Calculate the prediction interval for forecasts from X11.19.

Chapter summary
Short-term and medium-term forecasting methods rely very much on moving average
techniques and exponential smoothing, both of which were introduced in this chapter.
We started with simple, or single moving averages (SMA), and demonstrated how this
“smoothing technique” can also be used to produce short-term (one-period ahead)
forecasts. This followed with the introduction of double moving averages (DMA). They
were necessary to introduce a simple DMA linear method which enable us to produce
mi-term forecasts (2 to approximately 6 periods ahead).

The next forecasting method was based on Brown’s simple exponential smoothing
technique. When this technique was converted into a short-term forecasting method, it
became the single exponential smoothing (SES) method. We demonstrated how to use it
to produce just one period ahead forecasts. Just like with DMA, we subsequently
introduced double exponential smoothing (DES) technique, which enabled us to
produce DES forecasts that were linear in nature and capable of producing reasonable
forecasts on the medium-term basis (2-6 periods ahead).

We also reminded ourselves of the concept of forecasting error and explored how these
errors could be used to improve forecasts. Specifically, we use MSE together with the
Excel Solver function to optimise the value of the smoothing constant alpha. This

Page | 714
enabled us to achieve highly optimised forecasts that fit the actual time series more
closely.

The construction of the prediction interval in case where we are dealing with the short-
term and the medium-term forecasts was also introduced. The application was very
similar to what we learned in the previous chapter, with the exception that in this case
we used the notion of the rolling standard error.

We concluded the chapter by combining the principles of classical time series


decomposition method and exponential smoothing. At first, we used an improvised
method to combine the two. This was followed by introduction of the Holt-Winters’
method, as one of the best performing short to medium-range forecasting techniques
suitable for seasonal time series.

Test your understanding


TU11.1 Take the below time series that consists of 30 observations:

Time Series Time Series Time Series


1 26.21 11 36.41 21 29.24
2 25.72 12 29.14 22 28.75
3 27.51 13 28.42 23 28.02
4 28.32 14 28.58 24 26.69
5 28.42 15 29.05 25 25.08
6 28.28 16 30.25 26 23.39
7 27.1 17 29.42 27 22.65
8 32.35 18 27.38 28 22.02
9 35.33 19 27.68 29 23.39
10 33.35 20 30.22 30 26.35
Table 11.4

a) Calculate single moving average forecasts (SMA) for the following periods: 3,
5, 7 and 9. What can you conclude?
b) Calculate double moving averages (DMA) for the SMA values from a).

TU11.2 For the same time series as in TU1, calculate the following:
a) Calculate single exponential smoothing forecasts (SES) for the following values
of alpha: 0.1, 0.3, 0.5 and 0.9. What can you conclude?
b) Calculate double exponential smoothing values (DES) for the SES values from
a).

TU11.3 For the same time series as in TU1, calculate the following:
a) Produce forecasts using 5DMA for 6 periods in the future.
b) Produce forecasts using alpha=0.3 DES for 6 periods in the future.
c) Calculate the ME and MSE for DMA and DES. Which forecast would you
recommend?

TU11.4 Optimise alpha using the Solver. Which forecast is better now?

Page | 715
TU11.5 For the forecast you decided is the best from TU4, calculate the prediction
interval for the fitted series and the future 6 forecasts.

TU11.6 As part of a longitudinal study, look at the UK birth rates. Table 11.5 below
shows data 1960-2017.

UK Birth rate UK Birth rate UK Birth rate


per 1,000 per 1,000 per 1,000
Year people Year people Year people
1960 17.5 1980 13.4 2000 11.5
1961 17.9 1981 13 2001 11.3
1962 18.3 1982 12.8 2002 11.3
1963 18.5 1983 12.8 2003 11.7
1964 18.8 1984 12.9 2004 11.9
1965 18.3 1985 13.3 2005 12
1966 17.9 1986 13.3 2006 12.3
1967 17.5 1987 13.7 2007 12.6
1968 17.2 1988 13.8 2008 12.9
1969 16.6 1989 13.6 2009 12.7
1970 16.2 1990 13.9 2010 12.9
1971 16.1 1991 13.8 2011 12.8
1972 14.9 1992 13.6 2012 12.8
1973 13.9 1993 13.2 2013 12.1
1974 13.1 1994 13 2014 12
1975 12.4 1995 12.6 2015 11.9
1976 12 1996 12.6 2016 11.8
1977 11.7 1997 12.5 2017 11.4
1978 12.2 1998 12.3 2018 11.0
1979 13.1 1999 11.9 2019
Table 11.5

a) Produce forecasts for 2019 using Linear trend, Parabola, Power trend and
polynomial trend, as well as 5MA and SES for alpha=0.3.
b) Use error metrics to evaluate the best method (use ME, MAD. MSE, RMS, MPE
and MAPE)
c) Calculate the prediction interval for the chosen method

Want to learn more?


The textbook online resource centre contains a range of documents to provide further
information on the following topics:
1. S11Wa Different ways to implement exponential smoothing forecasting in Excel
2. S11Wb Excel ETS seasonal exponential forecasting

Page | 716
Appendices
Appendix A Microsoft Excel Functions
Table A.1 provides a list of all Excel functions that you may find helpful in solving
business statistics type problems. The Excel function includes a link to the Microsoft
support website for that Excel function.

Function Description
1 AVEDEV Returns the average of the absolute deviations of
data points from their mean
2 AVERAGE Returns the average of its arguments
3 AVERAGEA Returns the average of its arguments, including
numbers, text, and logical values
4 AVERAGEIF Returns the average (arithmetic mean) of all the
cells in a range that meet a given criteria
5 AVERAGEIFS Returns the average (arithmetic mean) of all cells
that meet multiple criteria.
6 BASE Converts a number into a text representation
with the given radix (base)
7 BINOM.DIST Returns the individual term binomial distribution
probability
8 BINOM.DIST.RANGE Returns the probability of a trial result using a
binomial distribution
9 BINOM.INV Returns the smallest value for which the
cumulative binomial distribution is less than or
equal to a criterion value
10 BINOMDIST Returns the individual term binomial distribution
probability
11 CHIDIST Returns the one-tailed probability of the chi-
squared distribution
12 CHIINV Returns the inverse of the one-tailed probability
of the chi-squared distribution
13 CHISQ.DIST Returns the cumulative beta probability density
function
14 CHISQ.DIST.RT Returns the one-tailed probability of the chi-
squared distribution
15 CHISQ.INV Returns the inverse of the left-tailed probability
of the chi-squared distribution.
16 CHISQ.INV.RT Returns the inverse of the one-tailed probability
of the chi-squared distribution
17 CHISQ.TEST Returns the test for independence
18 CHITEST Returns the test for independence

Page | 717
19 COMBIN Returns the number of combinations for a given
number of objects
20 COMBINA Returns the number of combinations with
repetitions for a given number of items
21 CONFIDENCE Returns the confidence interval for a population
mean
22 CONFIDENCE.NORM Returns the confidence interval for a population
mean
23 CONFIDENCE.T Returns the confidence interval for a population
mean, using a Student's t distribution
24 CORREL Returns the correlation coefficient between two
data sets
25 COUNT Counts how many numbers are in the list of
arguments
26 COUNTA Counts how many values are in the list of
arguments
27 COUNTBLANK Counts the number of blank cells within a range
28 COUNTIF Counts the number of cells within a range that
meet the given criteria
29 COUNTIFS Counts the number of cells within a range that
meet multiple criteria
30 COVAR Returns covariance, the average of the products
of paired deviations
31 COVARIANCE.P Returns covariance, the average of the products
of paired deviations
32 COVARIANCE.S Returns the sample covariance, the average of
the products deviations for each data point pair in
two data sets
33 CRITBINOM Returns the smallest value for which the
cumulative binomial distribution is less than or
equal to a criterion value
34 DEVSQ Returns the sum of squares of deviations
35 EXP Returns ‘e’ raised to the power of a given number
36 EXPON.DIST Returns the exponential distribution
37 EXPONDIST Returns the exponential distribution
38 F.DIST Returns the F probability distribution
39 F.DIST.RT Returns the (right-tailed) F probability distribution
for two data sets.
40 F.INV Returns the inverse of the F probability
distribution
41 F.INV.RT Returns the inverse of the (right-tailed) F
probability distribution.

Page | 718
42 F.TEST Returns the result of an F-test
43 FACT Returns the factorial of a number
44 FACTDOUBLE Returns the double factorial of a number
45 FDIST Returns the F probability distribution
46 FINV Returns the inverse of the F probability
distribution
47 FORECAST Returns a value along a linear trend
48 FORECAST.ETS Uses an exponential smoothing algorithm to
predict a future value on a timeline, based on a
series of existing values
49 FORECAST.ETS.CONFINT Returns a confidence interval for a forecast value
at a specified target date.
50 FORECAST.ETS.SEASONALITY Returns the length of the repetitive pattern Excel
detects for a specified time series.
51 FORECAST.ETS.STAT Returns a statistical value relating to a time series
forecasting.
52 FORECAST.LINEAR Predicts a future point on a linear trend line fitted
to a supplied set of x- and y- values.
53 FREQUENCY Returns a frequency distribution as a vertical
array
54 FTEST Returns the result of an F-test
55 GEOMEAN Returns the geometric mean
56 HARMEAN Returns the harmonic mean
57 HYPGEOM.DIST Returns the hypergeometric distribution
58 HYPGEOMDIST Returns the hypergeometric distribution
59 IF Specifies a logical test to perform
60 IFS Tests a number of supplied conditions and
returns a result corresponding to the first
condition that evaluates to TRUE.
61 INT Rounds a number down to the nearest integer
62 INTERCEPT Returns the intercept of the linear regression line
63 KURT Returns the kurtosis of a data set
64 LARGE Returns the k-th largest value in a data set
65 LINEST Returns the parameters of a linear trend
66 MAX Returns the largest value from a list of supplied
numbers

Page | 719
67 MAXIFS Returns the largest value from a subset of values
in a list that are specified according to one or
more criteria.
68 MIN Returns the smallest value from a list of supplied
numbers
69 MINIFS Returns the smallest value from a subset of
values in a list that are specified according to one
or more criteria.
70 MEDIAN Returns the median of the given numbers
71 MODE Returns the most common value in a data set
72 NEGBINOM.DIST Returns the negative binomial distribution
73 NEGBINOMDIST Returns the negative binomial distribution
74 NORM.DIST Returns the normal cumulative distribution
75 NORM.INV Returns the inverse of the normal cumulative
distribution
76 NORM.S.DIST Returns the standard normal cumulative
distribution
77 NORM.S.INV Returns the inverse of the standard normal
cumulative distribution.
78 NORMDIST Returns the normal cumulative distribution
79 NORMINV Returns the inverse of the normal cumulative
distribution
80 NORMSDIST Returns the standard normal cumulative
distribution
81 NORMSINV Returns the inverse of the standard normal
cumulative distribution
82 PEARSON Returns the Pearson product moment correlation
coefficient
83 PERCENTILE Returns the k-th percentile of values in a range
84 PERCENTILE.EXC Returns the k-th percentile of values in a range,
where k is in the range 0 to 1, exclusive
85 PERCENTILE.INC Returns the k-th percentile of values in a range,
where k is in the range 0 to 1, inclusive.
86 PERCENTRANK Returns the percentage rank of a value in a data
set
87 PERCENTRANK.EXC Returns the rank of a value in a data set as a
percentage (0..1, exclusive) of the data set
88 PERCENTRANK.INC Returns the percentage rank of a value in a data
set
89 PERMUT Returns the number of permutations for a given
number of objects

Page | 720
90 PERMUTATIONA Returns the number of permutations for a given
number of objects (with repetitions) that can be
selected from the total objects
91 PI Returns the value of pi
92 POISSON Returns the Poisson distribution
93 POISSON.DIST Returns the Poisson distribution
94 POWER Returns the result of a number raised to a power
95 QUARTILE Returns the quartile of a data set
96 QUARTILE.EXC Returns the quartile of the data set, based on
percentile values from 0..1, exclusive
97 QUARTILE.INC Returns the quartile of a data set
98 RAND Returns a random number between 0 and 1
99 RANDBETWEEN Returns a random number between the numbers
you specify
100 RANK Returns the rank of a number in a list of numbers
101 RANK.AVG Returns the rank of a number in a list of numbers
102 RANK.EQ Returns the rank of a number in a list of numbers
103 ROUND Rounds a number to a specified number of digits
104 ROUNDDOWN Rounds a number down, toward zero
105 ROUNDUP Rounds a number up, away from zero
106 RSQ Returns the square of the Pearson product
moment correlation coefficient
107 SKEW Returns the skewness of a distribution
108 SKEW.P Returns the skewness of a distribution based on a
population: a characterization of the degree of
asymmetry of a distribution around its mean
109 SLOPE Returns the slope of the linear regression line
110 SMALL Returns the k-th smallest value in a data set
111 SQRT Returns a positive square root
112 STANDARDIZE Returns a normalized value
113 STDEV Estimates standard deviation based on a sample
114 STDEV.P Calculates standard deviation based on the entire
population
115 STDEV.S Estimates standard deviation based on a sample
116 STDEVA Estimates standard deviation based on a sample,
including numbers, text, and logical values

Page | 721
117 STDEVP Calculates standard deviation based on the entire
population
118 STDEVPA Calculates standard deviation based on the entire
population, including numbers, text, and logical
values
119 STEYX Returns the standard error of the predicted y-
value for each x in the regression
120 SUM Adds its arguments
121 SUMIF Adds the cells specified by a given criteria
122 SUMIFS Adds the cells in a range that meet multiple
criteria
123 SUMPRODUCT Returns the sum of the products of corresponding
array components
124 SUMSQ Returns the sum of the squares of the arguments
125 SUMX2MY2 Returns the sum of the difference of squares of
corresponding values in two arrays
126 SUMX2PY2 Returns the sum of the sum of squares of
corresponding values in two arrays
127 SUMXMY2 Returns the sum of squares of differences of
corresponding values in two arrays
128 T Converts its arguments to text
129 T.DIST Returns the Student's left-tailed t-distribution.
130 T.DIST.2T Returns the cumulative, two-tailed Student's t-
distribution.
131 T.DIST.RT Returns the cumulative, right-tailed Student's t-
distribution.
132 T.INV Returns the t-value of the Student's t-distribution
as a function of the probability and the degrees of
freedom
133 T.INV.2T Returns the two-tailed inverse of the Student's t-
distribution.
134 T.TEST Returns the probability associated with a
Student's t-test
135 TDIST Returns the Student's t-distribution
136 TINV Returns the inverse of the Student's t-distribution
137 TREND Returns values along a linear trend
138 TRUNC Truncates a number to an integer
139 TTEST Returns the probability associated with a
Student's t-test
140 VAR Estimates variance based on a sample

Page | 722
141 VAR.P Calculates variance based on the entire
population
142 VAR.S Estimates variance based on a sample
143 VARA Estimates variance based on a sample, including
numbers, text, and logical values
144 VARP Calculates variance based on the entire
population
145 Z.TEST Returns the one-tailed probability-value of a z-
test
146 ZTEST Returns the one-tailed probability-value of a z-
test

Page | 723
Appendix B Areas of the standardised normal curve

Page | 724
Appendix C Percentage points of the Student’s t distribution (5%
and 1%)

Page | 725
Appendix D Percentage points of the chi-square distribution

Page | 726
Appendix E Percentage points of the F distribution

Upper 5%

Page | 727
Upper 2.5%

Page | 728
Upper 1%

Page | 729
Appendix F Binomial critical values

Page | 730
Appendix G Critical values of the Wilcoxon matched-pairs signed-
ranks test
Source: White, Yeats, Skipworth - Tables for statisticians (1979, table 21, p32, Stanley
Thornes Ltd).

Page | 731
Appendix H Probabilities for the Mann–Whitney U test
Source: White, Yeats, Skipworth - Tables for statisticians (1979, table 22, pages 33 - 35,
Stanley Thornes Ltd).

Mann–Whitney p-values (n2 = 3)

Mann–Whitney p-values (n2 = 4)

Mann–Whitney p-values (n2 = 5)

Page | 732
Mann–Whitney p-values (n2 = 6)

Mann–Whitney p-values (n2 = 7)

Page | 733
Mann–Whitney p-values (n2 = 8)

Page | 734
Appendix I Statistical glossary
Adjusted r2 This is the same as R-squared but adjusted for the sample size (see R-
squared or coefficient of determination).
Alpha, α Alpha refers to the probability that the true population parameter lies outside
the confidence interval. Not to be confused with the symbol alpha in a time series
context (exponential smoothing), where alpha is the smoothing constant.
Alternative hypothesis (H1) The alternative hypothesis, H1, is a statement of what a
statistical hypothesis test is set up to establish.
Availability sampling See convenience sampling.
Average This is a vague term for central tendency that usually is equivalent to the
arithmetic average (or mean) but could also be represented by the median, mode, or
geometric mean.
Average from a frequency distribution Arithmetic average for data in a frequency
distribution.
Arithmetic mean The sum of a list of numbers divided by the number of numbers.
Bar chart A bar chart is a way of summarising a set of categorical data using bar shapes.
Beta,  Beta refers to the probability that a false population parameter lies inside the
confidence interval. In the context of exponential smoothing, it is one of the smoothing
constants.
Binomial experiment An experiment consisting of a fixed number of independent
trials each with two possible outcomes, success and failure, and the same probability of
success. The probability of a given number of successes is described by a binomial
distribution.
Binomial distribution A distribution created by binomial experiments. A binomial
experiment is a statistical experiment that has the following properties: n repeated
trials, each trial with two possible outcomes; probability of success p; and trials are
independent.
Bootstrapping It is a statistical procedure that resamples a single dataset to create
many simulated samples.
Box plot A box plot is a way of summarising a set of data measured on an interval scale
(also called a box-and-whisker plot).
Brown’s single exponential smoothing method Identical to single exponential
smoothing (SES), with the exception that it is placed in line with the observation that is
smoothed, as opposed to SES which is placed as the next value and, as such, becomes a
forecast.
Categorical A variable whose value ranges over categories, such as (red, green, blue), or
(male, female).
Category A class or division of people or things regarded as having shared
characteristics.
Central limit theorem States that when a large number of simple random samples are
selected from the population and the mean is calculated for each sample, the
distribution of these sample means will assume the normal probability distribution,
even if the original population from which the samples were selected was not normally
distributed.
Central tendency This is a catch-all term for the location of the middle or the centre of
a distribution.

Page | 735
Characteristics of a random experiment A random experiment is a statistical
experiment that has the following properties: the experiment can be repeated any
number of times; a random trial consists of at least two possible outcomes (e.g.
success/failure, Monday/Tuesday/Wednesday); and the result depends on chance and
cannot be predicted uniquely.
Chart A chart is a graphical representation of data.
Chi-square distribution The chi-square distribution is a mathematical distribution
that is used directly or indirectly in many tests of significance.
Chi-square test This applies the chi-square distribution to test for homogeneity,
independence, or goodness of fit.
Chi-square test for independence A chi-square test for independence is applied when
you have two categorical variables from a single population.
Chi-square test for two independent samples The chi-square test of independence
tests if there is no difference in the distribution of responses to the outcome across
comparison groups.
Chi-square test for variance The chi-square test for variance is used to test the null
hypothesis that the variance of the population from which the data sample is drawn is
equal to a hypothesised value.
Class interval In creating a frequency distribution or plotting a histogram, one starts by
dividing the range of values into a set of non-overlapping intervals, called class
intervals, in such a way that every datum is contained in some class interval.
Class limit (or class boundary) A point that is the left endpoint of one class interval,
and the right endpoint of another class interval.
Class mid-point For any given class interval of a frequency distribution, this is the
value halfway across the class interval, (upper class limit + lower class limit)/2.
Class widths equal The distance between the lower- and upper-class limits for each
class has the same numerical value.
Class widths unequal The distance between the lower- and upper-class limits for each
class does not have the same numerical value.
Cluster sampling This is a sampling technique used when ‘natural’ but relatively
homogeneous groupings are evident in a statistical population. It is often used in
marketing research. In this technique, the total population is divided into these groups
(or clusters) and a simple random sample of the groups is selected.
Coefficient of correlation This measures the strength and direction of the linear
relationship between two variables. Sometimes it is referred to as the Person product
moment coefficient of correlation.
Coefficient of determination (COD) This is the proportion of the variance in the
dependent variable that is predicted from the independent variable. Also called R-
squared or r2.
Coefficient of variation The coefficient of variation measures the spread of a set of
data as a proportion of its mean. It is often expressed as a percentage.
Component bar chart A subdivided or component bar chart is used to represent data
in which the total magnitude is divided into different or components.
Confidence interval A range of likely values for estimates and forecasts, usually
expressed as 90%, 95%, 99% or any other interval from the trend, within which the
likely values of the forecast reside (also known as a prediction interval).
Confidence interval of a population mean This is an interval estimate for the
population mean based upon the sample mean.

Page | 736
Confidence interval of a population proportion This is an interval estimate for the
population proportion based upon the sample proportion.
Confidence interval of a population mean when the sample size is small This is an
interval estimate for the population mean based upon the sample mean where the
sample size is small.
Consistent estimator This is an estimator having the property that as the number of
data points used increases indefinitely, the resulting sequence of estimates converges to
the population value.
Contingency table A contingency table is a table of frequencies classified according to
the values of the variables in question.
Continuous A set of data is said to be continuous if the values/observations belonging
to it may take on any value within a finite or infinite interval. You can count, order and
measure continuous data.
Continuous probability distribution If a random variable is a continuous variable, its
probability distribution is called a continuous probability distribution.
Continuous random variable This is a random variable where the data can take
infinitely many values.
Convenience sampling This is a non-probability sampling technique where subjects
are selected because of their convenient accessibility and proximity to the researcher.
This method is also sometimes referred to as haphazard, accidental, or availability
sampling.
Coverage error Coverage error occurs in statistical estimates of a survey. It results
from gaps between the sampling frame and the total population. This can lead to biased
results and can affect the variance of results.
Critical test statistic The critical value for a hypothesis test is a limit at which the value
of the sample test statistic is judged to be such that the null hypothesis may be rejected.
Cumulative distribution function A function whose value is the probability that a
corresponding continuous random variable has a value less than or equal to the
argument of the function. See also probability density function.
Cumulative frequency distribution A cumulative frequency distribution is the sum of
the class and all classes below it in a frequency distribution. Rather than displaying the
frequencies from each class, a cumulative frequency distribution displays a running
total of all the preceding frequencies.
Damping factor A unique factor only named like this in Microsoft Excel. Its value is 1 –
α (where α is a smoothing constant). This naming convention is unique to Excel and is
not used in any other software or textbook.
Degrees of freedom The number of degrees of freedom is the number of values in the
final calculation of a statistic that are free to vary.
Dependent populations Two populations are dependent if the measured values of the
items observed in one population directly affect the measured values of the items
observed in the other population.
Dependent variable A variable that is expected to show the change as an independent
variable is manipulated or changed.
Determining the sample size This is the act of choosing the number of observations to
include in a statistical sample. The sample size is an important feature of any empirical
study in which the goal is to make inferences about a population from a sample.
Discrete A set of data is said to be discrete if the values/observations belonging to it are
distinct and separate, that is, they can be counted (1,2,3, …).

Page | 737
Discrete probability distributions If a random variable is a discrete variable, its
probability distribution is called a discrete probability distribution.
Discrete random variable A set of data is said to be discrete if the values belonging to
it can be counted as 1, 2, 3, ….
Dispersion (or spread) The variation between data values is called dispersion.
Disproportionate stratification This is a type of stratified sampling. With
disproportionate stratification, the sample size of each stratum does not have to be
proportionate to the population size of the stratum. This means that two or more strata
will have different sampling fractions.
Distribution shape A graph plotting frequency (or probability) against actual values.
There are some typical shapes: normal (bell-shaped), Student’s t, and F distributions.
Double exponential smoothing (DES) A linear forecasting method that uses
smoothing values of the smoothing values in order to define linear components at and
bt, which are used to produce linear forecasts.
Double moving averages (DMA) A linear forecasting method that uses moving
averages of the moving averages in order to define linear components at and bt, which
are used to produce linear forecasts.
Durbin–Watson test A test used to detect serial correlation, or autocorrelation, among
the residuals, which means a relationship between residuals separated by each other by
a given time lag.
Efficient estimator Among a number of estimators of the same class, the estimator
having the least variance is called the efficient estimator.
Empirical approach This denotes information gained by means of observation,
experience, or experiment.
Error analysis or residual analysis After a forecasting model is fitted to the actual
time series and deviations between the actual and forecasted values are calculated,
these deviations are called errors or residuals. They are scrutinised as in regression
analysis to validate the model, but also to decide which model is the best, if more than
one model is used to forecast the same time series.
Estimator An estimator is a rule for calculating an estimate of a given quantity based on
observed data.
Explanatory variable Another expression for independent variable.
Explained variations See Sum of squares for regression (SSR).
Extrapolate Extend the results produced by a method into the future, by assuming that
the existing rules will continue to apply.
Exponential smoothing A method that relies on a constant which is used to correct the
past forecasts’ deviations from the actual values (errors). A time series obtained in such
a way is invariably ‘smoother’ than the original time series that it has been derived
from.
Event An event is a set of outcomes of an experiment to which a probability is assigned.
Expected frequency In a contingency table the expected frequencies are the
frequencies that you would predict in each cell of the table, if you knew only the row
and column totals, and if you assumed that the variables under comparison were
independent.
Expected value of the probability distribution The expected value of a random data
variable indicates its population average value.
F distribution The F distribution is a probability distribution used in analysis of
variance when comparing the variance of two samples for significance.

Page | 738
F test for equality of population variance The F test for two population variances
(variance ratio test) is used to test if the variances of two populations are equal.
F test for multiple regression models A test to determine whether any of the
independent variables is significant.
F test for simple regression models A test to determine if the independent variable is
a significant contributor to the predicted values ŷ .
First quartile, Q1 This is also referred to as the 25th percentile and is the value below
which 25% of the population falls.
Fisher kurtosis coefficient Kurtosis is a measure of the ‘peakedness’ of the probability
distribution of a random variable.
Fisher–Pearson skewness coefficient Skewness is a measure of the asymmetry of the
probability distribution of a random variable about its mean.
Fitting a trend Finding the line (linear or curve) that best fits the historical data.
Five-number summary A five-number summary is especially useful when we have so
many data that it is enough to present a summary of the data rather than the whole data
set.
Forecasting A method of predicting the future values of a variable, usually represented
as the time series values.
Forecasting errors The difference between the actual and forecasted values in a time
series.
Frequency definition of probability This defines the probability of an outcome as the
frequency or the number of times the outcome occurs relative to the number of times
that it could have occurred.
Frequency density The frequency density can be calculated by using the following
formula: frequency density = frequency ÷ class width.
Frequency distribution A summary of data presented in table form representing
frequency and class intervals.
Frequency polygon Frequency polygons are line graphs joined by all the midpoints at
the top of the bars of histograms.
Goodness-of-fit test A chi-square goodness-of-fit test attempts to answer the following
question: are sample data consistent with a hypothesised distribution?
Graph A graph or chart is a visual illustration of a set of data.
Grouped frequency distribution In a grouped frequency distribution, data are sorted
and separated into groups called classes.
Histogram A histogram is a way of summarising data that are measured on an interval
scale (either discrete or continuous). It is often used in exploratory data analysis to
illustrate the major features of the distribution of the data in a convenient form. In this
case, the class intervals are constant.
Histogram with unequal class intervals A histogram is a way of summarising data
that are measured on an interval scale (either discrete or continuous). It is often used in
exploratory data analysis to illustrate the major features of the distribution of the data
in a convenient form. In this case, the class intervals are not constant.
Homogeneity of variance This is another name for equal population variances.
Homoscedasticity This is also called homogeneity or uniformity of variance. It refers to
the requirement that the variance of regression errors, or residuals, is constant for all
values of X. This is one of the linear regression assumptions.
Hypothesis A testable statement about the relationship between two or more variables
or a proposed explanation for some observed phenomenon.

Page | 739
Hypothesis test procedure A series of steps to determine whether to accept or reject a
null hypothesis, based on sample data.
IBM SPSS Statistics SPSS Statistics is a software package used for statistical analysis.
Independence of errors Current error must not be dependent on the previous error.
Another term is serial correlation.
Independent populations Two populations are said to be independent if the measured
values of the items observed in one population do not affect the measured values of the
items observed in the other population.
Independent variable A variable that stands alone and is expected to cause some
changes to the dependent variable. Also called explanatory variable.
Inference This is the process of deducing properties of an underlying probability
distribution by analysis of data. Inferential statistical analysis infers properties about a
population; this includes testing hypotheses and deriving estimates.
Independent events Two events are independent if the occurrence of one of the events
has no influence on the occurrence of the other event.
Intercept This is the value of the regression equation (y) when the x value is 0.
Interquartile range The interquartile range is a measure of the spread of or dispersion
within a data set.
Interval An interval scale is a scale of measurement where the distance between any
two adjacent units of measurement (or ‘intervals’) is the same but the zero point is
arbitrary.
Interval estimates This refers to the use of sample data to calculate an interval of
plausible values of an unknown population parameter.
Kurtosis This is a measure of the ‘peakedness’ of the probability distribution of a
random variable.
Least squares The method of least squares is a criterion for fitting a specified model to
observed data. If refers to finding the smallest (least) sum of squared differences
between fitted and actual values.
Left-skewed (or negatively skewed) A distribution is said to be left-skewed (or left-
tailed), even though the curve itself appears to be leaning to the right; ‘left’ here refers
to the left tail being drawn out, with the mean value being skewed to the left of a typical
centre of the data set.
Leptokurtic A statistical distribution is leptokurtic when the points along the X-axis are
clustered, resulting in a higher peak, or higher kurtosis, than the curvature found in a
normal distribution.
Level of confidence This is the likelihood that the true value being tested will lie within
a specified range of values.
Likelihood of an event happening This is the probability that an event will occur.
Linear relationship The relationship between two variables is linear, that is, they
move together in the same or opposite direction, but in a linear fashion. This is one of
the linear regression assumptions.
Lower class limit A point that is the left endpoint of one class interval.
Lower one-tailed test A lower one-tail test is a statistical hypothesis test in which the
values for which we can reject the null hypothesis, H0, are located entirely in the left tail
of the probability distribution.
Mann–Whitney U test The Mann–Whitney U test is used to test the null hypothesis that
two populations have identical distribution functions against the alternative hypothesis
that the two distribution functions differ only with respect to location (median), if at all.

Page | 740
Margin of error This provides an estimate of how much the results of the sample may
differ due to chance when compared to what would have been found if the entire
population were interviewed. It is a statistic expressing the amount of random sampling
error in a survey’s results.
McNemar’s test for matched pairs McNemar’s test is a nonparametric method used on
nominal data to determine whether the row and column marginal frequencies are equal.
Mean The mean is a measure of the average data value for a data set.
Mean absolute error, or deviation (MAD) The mean value of all the differences
between the actual and forecasted values in a time series. The differences between
these values are represented as absolute values, with the effects of the sign ignored.
Mean absolute percentage error (MAPE) The mean value of all the differences between
the actual and forecasted values in a time series. The differences between these values
are represented as absolute percentage values, that is, the effects of the sign are
ignored.
Mean error (ME) The mean value of all the differences between the actual and
forecasted values in a time series.
Mean of the binomial distribution The expected value, or mean, of a binomial
distribution is calculated by multiplying the number of trials by the probability of
successes (np).
Mean of the Poisson distribution The expected value, or mean, of a Poisson
distribution is the average number of events in a given time or space interval and is
represented by the symbol .
Mean percentage error (MPE) The mean value of all the differences between the
actual and forecasted values in a time series. The differences between these values are
represented as percentage values.
Mean squared error (MSE) The mean value of all the differences between the actual
and forecasted values in a time series. The differences between these values are
squared to avoid positive and negative differences cancelling each other.
Measure of average A measure of average is a number that is typical for a set of figures
(e.g. mean, median, or mode).
Measure of central tendency Measures of central tendency are numbers that describe
what is average or typical of the distribution of data, (e.g. mean, median, or mode).
Measure of dispersion Dispersion is the extent to which a distribution is stretched or
squeezed (e.g. standard deviation, and interquartile range).
Measure of location Measures of location are numbers that describe what is average or
typical of the distribution of data, (e.g. mean, median, or mode).
Measure of shape The principal measures of distribution shape used in statistics are
skewness and kurtosis.
Measure of spread Spread is the extent to which a distribution is stretched or
squeezed (e.g. standard deviation, and interquartile range).
Measure of variation Variation is the extent to which a distribution is stretched or
squeezed (e.g. standard deviation, and interquartile range).
Measurement error (or observational error) This is the difference between a
measured value or quantity and its true value. In statistics, an error is not a ‘mistake’.
Variability is an inherent part of things being measured and of the measurement
process.
Median The median is the value halfway through an ordered data set.
Mesokurtic A statistical distribution is mesokurtic when its kurtosis is similar, or
identical, to that of a normally distributed data set.

Page | 741
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft. It organises
numeric or text data in spreadsheets or workbooks.
Mode The mode is the most frequently occurring value in a set of discrete data.
Model An abstraction of something, in this context typically referring to an equation, or
a series of equations, that mimic a variable.
Moving averages are averages calculated for a limited number of rolling periods. Every
subsequent moving average drops the first observation from the rolling period and
takes the subsequent one.
Moving average forecasts If the moving average is placed as the value following the
last observation taken into the moving average interval, then this moving average
becomes a forecast.
Multiple regression Multiple linear regression aims to find a linear relationship
between a response variable and several possible predictor variables.
Multistage sampling This refers to sampling plans where the sampling is carried out in
stages using smaller and smaller sampling units at each stage.
Mutually exclusive events Two or more events are said to be mutually exclusive if
they cannot occur at the same time.
Nominal A set of data is said to be nominal if the values/observations belonging to it
can be assigned a code in the form of a number, where the numbers are simply labels.
You can count but not order or measure nominal data.
Nonparametric tests are often used in place of their parametric counterparts when
certain assumptions about the underlying population are questionable.
Non-probability sampling This is a sampling technique where the samples are
gathered in a process that does not give all the individuals in the population equal
chances of being selected.
Non-proportional quota sampling This captures a minimum number of respondents
in a specific group.
Non-response error Non-response errors occur when a survey fails to get a response
to one, or possibly all, of the questions.
Non-stationary A time series that does not have a constant mean and/or variance and
oscillates around a moving mean.
Normal distribution The normal distribution is a symmetrical, bell-shaped curve,
centred at its expected value.
Normal probability plot This is a graphical technique to assess whether a set of data is
normally distributed.
Normality of errors Regression errors, or residuals, need to be distributed in
accordance with the normal distribution. This is one of the linear regression
assumptions.
Null hypothesis (H0) The null hypothesis, H0, represents a theory that has been put
forward but has not been proved.
Observational error (see Measurement error).
Observed frequency In a contingency table the observed frequencies are the
frequencies obtained in each cell of the table, from our random sample.
Ogive A cumulative frequency graph.
One-sample test A one-sample test is a hypothesis test for answering questions about
the mean (or median) where the data are a random sample of independent observations
from an underlying distribution

Page | 742
One-sample t test A one-sample t test is a hypothesis test for answering questions
about the mean where the data are a random sample of independent observations from
an underlying normal distribution whose population variance is unknown.
One-sample z test A one-sample z test is used to test whether a population parameter
is significantly different from some hypothesised value.
One-tailed test A one-tailed test is appropriate if the estimated value may depart from
the reference value in only one direction.
Ordinal A set of data is said to be ordinal if the values/observations belonging to it can
be ranked (put in order) or have a rating scale attached. You can count and order, but
not measure, ordinal data.
Outcome An outcome is a possible result of an experiment.
Outlier An outlier is an observation in a data set which is far removed in value from the
others in the data set.
p-value The p-value is the probability of getting a value of a test statistic as extreme as
or more extreme than that observed by chance alone, if the null hypothesis is true.
Paired-samples t test A two-sample t test for population mean (dependent or paired
samples) is used to compare two dependent population means inferred from two
samples (‘dependent’ indicates that the values from both samples are numerically
dependent upon each other – there is a correlation between corresponding values).
Parametric Any statistic computed by procedures that assume the data were drawn
from a distribution.
Pearson coefficient of skewness A measure of the symmetry of a distribution
Percentile The xth percentile is the value beneath which x% of the population falls.
Pie chart A pie chart is a way of summarising a set of categorical data using a circle
which is divided into segments, where each segment represents a category.
Pivot table Excel expression for a table that was generated from a ‘flat’ table that
contains raw data. A pivot table reorganises and summarises data by enabling the
rotation of variables, effectively facilitating the structural (or pivotal) change the way
data are presented.
Platykurtic A statistical distribution is platykurtic when the points along the X-axis are
extremely dispersed, resulting less peakedness, or lower kurtosis, than found in a
normal distribution.
Point estimate Is the use of sample data to calculate a point value of an unknown
population parameter.
Point estimate of the population mean Is a single value of a sample mean statistic
that represents the population mean.
Point estimate of the population proportion Is a single value of a sample proportion
statistic that represents the population proportion.
Point estimate of the population variance Is a single value of a sample variance
statistic that represents the population variance.
Poisson probability distribution This is a discrete probability distribution that
expresses the probability of a given number of events occurring in a fixed interval of
time and/or space if these events occur with a known average rate and independently
of the time since the last event.
Pooled estimates Pooled estimates (also known as combined, composite, or overall
variance) is a method for estimating a statistic (mean, variance) of several different
populations.
Pooled-variance t test Two-sample t test with the two populations having the same
variance.

Page | 743
Population A collection of persons, objects, or items of interest.
Population mean The population mean is the mean value of all possible values.
Population mean for the normal distribution This is the expected value for a variable
that follows a normal distribution.
Population parameter This is a characteristic of a population. The population is a set
of individuals, items, or data from which a statistical sample is taken.
Population standard deviation The population standard deviation is the standard
deviation of all possible values.
Population variance The population variance is the variance of all possible values.
Prediction interval See confidence interval.
Probability A measure of the likelihood that events will occur.
Probability density function A statistical expression for the probability distribution of
a continuous variable. It provides the probability that a variable fall in some interval.
Probability distribution A listing of all possible events or outcomes associated with a
course of action and their associated probabilities.
Probability sampling This is any method of sampling that utilises some form of
random selection. To have a random selection method, you must set up some process or
procedure that ensures that the different units in your population have equal
probabilities of being chosen.
Probability trees Method to use tree diagrams to aid the solution of problems
involving probability.
Proportional quota sampling This represents the major characteristics of the
population by sampling a proportional amount of each.
Proportionate stratification This is a type of stratified sampling. With proportionate
stratification, the sample size of each stratum is proportionate to the population size of
the stratum. This means that each stratum has the same sampling fraction.
Purposive sampling This is a non-probability sample that is selected based on
characteristics of a population and the objective of the study. Purposive sampling is also
known as judgmental, selective, or subjective sampling.
Qualitative A qualitative variable is one whose values are adjectives, such as colours,
genders, nationalities, etc.
Quantitative A variable that takes numerical values for which arithmetic makes sense
(e.g. counts, temperatures, weights, amounts of money).
Quartile A quartile is one of three values that divide a sample of data into four groups
containing an equal number of observations.
Quota sampling This is a method for selecting survey participants that is a non-
probabilistic version of stratified sampling. In quota sampling, a population is first
segmented into mutually exclusive subgroups, just as in stratified sampling. Then
judgement is used to select the subjects or units from each segment based on a specified
proportion.
R-squared or R2 See coefficient of determination.
Random experiment Randomised experiments are the experiments that allow the
greatest reliability and validity of statistical estimates of treatment effects.
Random sample A random sample is one in which every element in the population has
an equal chance of being selected.
Random variable A random variable is a set of possible values from a random
experiment.
Range The range of a data set is a measure of the dispersion of the observations.
Rank To rank is to list data in order of size.

Page | 744
Ratio data are interval data with a natural zero point.
Raw data Raw data, also known as primary data, are data collected from a source.
Regression analysis This is a set of statistical processes used to estimate relationships
between variables.
Regression assumptions In order to apply linear regression, four assumptions need to
be met: (i) linearity; (ii) independence of errors; (iii) normality of errors; and (iv)
constant variance, or homoscedasticity.
Regression coefficients In linear regression there is one coefficient that describes the
value of the intercept (when x = 0) and one coefficient that describes the slope (or
gradient).
Rejection region This is the range of values that leads to rejection of the null
hypothesis.
Relative frequency This is defined as how often something happens divided by the
total number of outcomes.
Residual The residual represents the unexplained variation (or error) after fitting a
regression model. It is also the differences between the actual and predicted values.
Residual analysis Is the analysis of the residuals of the regression. It typically means
that the residuals comply with certain assumptions. See regression assumptions.
Right-skewed (or positively skewed) A distribution is said to be right-skewed (or
right-tailed), even though the curve itself appears to be leaning to the left; ‘right’ here
refers to the right tail being drawn out, with the mean value being skewed to the right of
a typical centre of the data set.
Robust test Robust statistics are statistics with good performance for data drawn from
a wide range of probability distributions, especially for distributions that are not
normal.
Root mean square error (RMS) This is the root value of the MSE. It takes the error
statistic back to the same units as used by the variable. It can also be used as the
standard error of the estimate to calculate a prediction interval.
Sample A sample is a set of data collected and/or selected from a statistical population
by a defined procedure. The elements of a sample are known as sample points, sampling
units or observations.
Sample mean The sample mean (or empirical mean) is computed from a collection of
data from a population.
Sample point This is a single possible observed value of a variable; a member of the
sample space of an experiment.
Sample statistics Is a set of data collected and/or selected from a statistical population
by a defined procedure. The elements of a sample are known as sample points, sampling
units or observations.
Sample standard deviation This is the extent to which a collection of data from a
population are stretched or squeezed along the X-axis.
Sample space The sample space is an exhaustive list of all the possible outcomes of an
experiment.
Sample variance This is a measure of the dispersion of a sample of observations taken
from a population.
Sampling distribution This is the probability distribution of a given statistic based on a
random sample. Sampling distributions are important in statistics because they provide
a major simplification en route to statistical inference.

Page | 745
Sampling distribution of the sample mean This is a theoretical distribution of the
values that the mean of a sample takes on in all the possible samples of a specific size
that can be drawn from a given population.
Sampling distribution of the sample proportion This is a theoretical distribution of
the values that the proportion of a sample takes on in all the possible samples of a
specific size that can be drawn from a given population.
Sampling distribution of the sample variance This is a theoretical distribution of the
values that the variance of a sample takes on in all the possible samples of a specific size
that can be drawn from a given population.
Sampling error This occurs when the statistical characteristics of a population are
estimated from a subset, or sample, of that population. Since the sample does not
include all members of the population, statistics on the sample, such as means and
standard deviations, generally differ from the characteristics of the entire population,
which are known as parameters.
Sampling frame This is the source material or device from which a sample is drawn. It
is a list of all those within a population who can be sampled, and may include
individuals, households, or institutions.
Sampling with replacement Sampling is called ‘with replacement’ when a unit
selected at random from the population is returned to the population and then a second
element is selected at random.
Sampling without replacement Sampling is called ‘without replacement’ when a unit
selected at random from the population is not returned to the population and then a
second element is selected at random.
Scaling Changing of the scale of the ordinate (y-axis) on a graph to modify the
resolution of the diagram.
Scatter plot A scatter plot is a useful summary of a set of bivariate data (two variables),
usually drawn before working out a linear correlation coefficient or fitting a
regression/time series line. It gives a good visual picture of the relationship between
the two variables and aids the interpretation of the correlation coefficient or regression
model.
Seasonal A time series, represented in the units of time smaller than a year, that shows
regular pattern in repeating itself over a number of these units of time
Second quartile, Q2 This is also referred to as the 50th percentile or the median and is
the value that divides the population in the middle and has 50 % of the population
values below it.
Serial correlation There should be no serial correlation among the residuals, and this
is one of the linear regression assumptions. See independence of errors.
Set of all possible outcomes The set of all possible outcomes of a random experiment.
Shapiro–Wilk test The Shapiro–Wilk test calculates a statistic that tests whether a
random sample comes from a normal distribution.
Simple random sampling This is the basic sampling technique where we select a
group of subjects (a sample) for study from a larger group (a population). Everyone is
chosen entirely by chance and each member of the population has an equal chance of
being included in the sample.
Simple regression analysis Simple linear regression aims to find a linear relationship
between one response variable and one possible predictor variable by the method of
least squares.
Sign test The sign test is designed to test a hypothesis about the location of a population
distribution.

Page | 746
Single (or simple) moving averages (SMA) These are identical to moving average
forecasts, but the word ‘single’ is used to differentiate it from double moving averages.
Sometimes it is shown as 3MA or 5MA, for example, where the prefix number implies
how many periods are taken into to rolling interval.
Single exponential smoothing (SES) Simple weighted averages of the past deviations
of the forecasts from the actual values are used to create a new fit for the original time
series. If these values are shifted one period in the future, they become simple
exponential smoothing forecasts.
Significance level, α The significance level of a statistical hypothesis test is a fixed
probability of wrongly rejecting the null hypothesis, H0, if it is in fact true.
Skewness is a measure of the asymmetry of the probability distribution of a real-valued
random variable about its mean.
Slope This is the gradient of the fitted regression line.
Smoothing constant, α This is defining what fraction of the past forecasting error will
be considered to produce future forecasts. The smoothing constant will vary between 0
and 1. The closer to zero, the smoother the SES time series will be. This also implies that
a larger smoothing constant puts more emphasis on more recent data in the time series,
while a smaller smoothing constant puts more emphasis on the whole history of the
time series.
Snowball sampling This is a true multipurpose sampling technique. Through its use, it
is possible to make inferences about social networks and relations in areas in which
sensitive, illegal, or deviant issues are involved.
Spearman’s rank coefficient of correlation This measures dependence between the
ranking of two variables.
Standard deviation Measure of the dispersion of the observations (A square root value
of the variance).
Standard deviation for the probability distribution The standard deviation of a
random variable, statistical population, data set, or probability distribution is the square
root of its variance.
Standard deviation of a normal distribution The standard deviation of a normal
distribution is .
Standard deviation of the sample mean This is the standard deviation of the sampling
distribution of the mean. It is therefore the square root of the variance of the sampling
distribution of the mean. It is also called the standard error of the mean.
Standard error The standard error of any parameter (or statistic) is the standard
deviation of its sampling distribution, or an estimate of the standard deviation. If the
parameter or statistic is the mean, then we call it the standard error of the mean.
Standard error in time series This is a measure of accuracy of predictions made using
one of the extrapolation techniques.
Standard error of the estimate (SEE) This is a measure of the accuracy of predictions
made with a regression line.
Standard error of the sample proportion Is the standard error of the sampling
distribution of the proportion. It is therefore the square root of the variance of the
sampling distribution of the proportion.
Standard normal distribution This is the normal distribution with mean 0 and
standard deviation 1.
Standard normal distribution table This is a mathematical table of the values of the
cumulative distribution function of the normal distribution. It is used to find the

Page | 747
probability that a statistic is observed below, above, or between values on the standard
normal distribution.
Standardised sample mean Z value A standardized value is what you get when you
take a sample mean and scale it by population data Population mean and standard
error).
Stated limits (true or mathematical) Class limits are the smallest and largest
observations (data, events, etc.) in each class. Therefore, each class has two limits: a
lower limit and upper limit.
Stationary A time series that has a constant mean and variance and oscillates around
this mean is referred to as stationary.
Statistical independence Two events are independent if the occurrence of one of the
events gives us no information about whether the other event will occur.
Statistical inference This is the process of deducing properties of an underlying
probability distribution by analysis of data. Inferential statistical analysis infers
properties about a population; this includes testing hypotheses and deriving estimates.
Statistical power The power of a statistical test is the probability that it will correctly
lead to the rejection of a false null hypothesis.
Stratified random sampling This is a method of sampling from a population. In
statistical surveys, when subpopulations within an overall population vary, it is
advantageous to sample each subpopulation (stratum) independently.
Student’s t distribution This is any member of a family of continuous probability
distributions that arises when estimating the mean of a normally distributed population
in situations where the sample size is small, and population standard deviation is
unknown.
Student’s t distribution table This is a mathematical table of the values of the
cumulative distribution function of the Student’s t distribution. It is used to find the
probability that a statistic is observed below, above, or between values on the Student’s
t distribution.
Student’s t test This is a method of testing hypotheses about the mean of a small
sample drawn from a normally distributed population when the population standard
deviation is unknown.
Sum of squares for error (SSE) This is the sum of squared distances of every point
from the regression line. Also called unexplained variations.
Sum of squares for regression (SSR) This is the sum of squared distances of every
point on the regression line from the mean value. Also called explained variations.
Symmetric distributions A data set is symmetrical when the data values are
distributed in the same way above and below the middle value.
Systematic random sampling This is a method of choosing a random sample from
among a larger population. The process of systematic sampling typically involves first
selecting a fixed starting point in the larger population and then obtaining subsequent
observations by using a constant interval between samples taken.
t test for multiple regression models This tests whether the predictor variables in the
regression equation are significant contributors.
t test for simple regression models This is a test that validates if the regression model
is usable. It tests whether the slope coefficient in the regression equation is significantly
different from zero.
Table A set of facts or figures systematically displayed, especially in columns.
Tally chart A table used to record values for a variable in a data set, by hand, often as
the values are collected. One tally mark is used for each occurrence of a value. Tally

Page | 748
marks are usually grouped in sets of five, to aid the counting of the frequency for each
value.
Test of association The chi-square test of association allows the comparison of two
attributes in a sample of data to determine if there is any relationship between them.
Test statistic A test statistic is a quantity calculated from our sample of data.
Third quartile Q3 This is also referred to as the 75th percentile and is the value below
which 75% of the population falls.
Tied ranks Two or more data values share a rank value.
Time period A unit of time by which the variable is defined (an hour, day, month, year,
etc.)
Time series This is a variable whose observations are measured in time and/or time-
stamped and equidistant in units of time.
Time series plot A time series plot is a graph that you can use to evaluate patterns and
behaviour in a data variable over time.
Total sum of squares (SST) This is the total variation of the data from their mean,
consisting of the regression sum of squares (SSR) and the error sum of squares (SSE).
Total variation See total sum of squares (SST).
Tree diagram A tree diagram may be used to represent a probability space. Tree
diagrams may represent a series of independent events or conditional probabilities.
Trend This is a component in the classical time series analysis approach to forecasting
that covers underlying directional movements of the time series. In general, it is the
underlying tendency of any time series, indicating the direction and pace.
Trend parameters In the case of a linear trend, these are the slope and intercept. If a
more complex trend equation is used, they do not have a name, but are referred to as a,
b, c, etc.
Triple exponential smoothing (TES) A nonlinear forecasting method that uses three
levels of smoothing values in order to define components at, bt and ct, which are used to
produce linear forecasts.
Two-sample test A two-sample test is a hypothesis test for answering questions about
the mean where the data are collected from two random samples of independent
observations, each from an underlying distribution.
Two sample t test A two-sample t test for the population mean (independent samples,
equal variance) is used when two separate sets of independent and identically
distributed samples are obtained, one from each of the two populations being
compared.
Two sample t test (independent samples, unequal variances) A two-sample t test
for population mean (independent samples, unequal variances) is used when two
separate sets of independent but differently distributed samples are obtained, one from
each of the two populations being compared.
Two-sample z test for the population mean A two-sample z test for the population
mean is used to evaluate the difference between two group means.
Two-sample z test for the population proportion A two-sample z test for the
population proportion is used to evaluate the difference between two group
proportions.
Two-tailed test This is a statistical hypothesis test in which the values for which we can
reject the null hypothesis, H0, are in both tails of the probability distribution.
Type I error, α A Type I error occurs when the null hypothesis, H0, is rejected when it
is in fact true.

Page | 749
Type II error, β A Type II error occurs when the null hypothesis, H0, is not rejected
when it is in fact false.
Types of trends Any curve can be a trend. Most often used are linear, logarithmic,
polynomial, power, and exponential trends.
Unbiased estimator If the expected value of the sample statistic is equal to the
population value then we say that the sample statistic is an unbiased estimator of the
population value.
Uncertainty This is a situation which involves imperfect and/or unknown information.
Unexplained variation See Sum of squares for error (SSE).
Uniform distribution A type of probability distribution in which all outcomes are
equally likely.
Univariate methods These use only one variable and try to predict its future value
based on the past values of the same variable.
Upper class limit A point that is the right endpoint of one class interval.
Upper one-tail test An upper one tail test is a statistical hypothesis test in which the
values for which we can reject the null hypothesis, H0 are located entirely in the right
tail of the probability distribution.
Variable A numerical value or a characteristic that can differ from individual to
individual or change through time.
Variance This is a measure of the dispersion of the observations.
Variance of a binomial distribution The variance of the binomial distribution is npq.
Variance of a normal distribution The variance of a normal distribution is 2.
Variance of a Poisson distribution The variance of a Poisson distribution is .
Variance of a probability distribution This is the expectation of the squared deviation
of a random variable from its mean. Informally, it measures how far a set of numbers
are spread out from their average value.
Visual display This is a visual representation of data in graphic form.
Weighted average This is an average in which each quantity to be averaged is assigned
a weight, and these weightings determine the relative importance of each quantity for
the average.
Welch's unequal-variances t test This is a two-sample location test which is used to
test the hypothesis that two populations have equal means.
Welch–Satterthwaite equation This equation is used to calculate an approximation to
the effective degrees of freedom of a linear combination of independent sample
variances, also known as the pooled degrees of freedom, corresponding to the pooled
variance.
Wilcoxon signed-rank test This is designed to test a hypothesis about the location of
the population median (one or two matched pairs).
Yates' correction An adjustment made to chi-square values obtained from 2 by 2
contingency table analyses.

Page | 750
Book index

5 steps in hypothesis testing ............................................................................................................... 345


Alpha (α).............................................................................................................................................. 346
Alternative hypothesis (H1) ......................................................................................................... 337, 345
Assumptions for the equal-variance t-test ......................................................................................... 384
Bar charts .............................................................................................................................................. 47
Bessel correction factor ...................................................................................................................... 293
Beta, β ................................................................................................................................................. 355
Biased estimate ................................................................................................................................... 262
Binomial .............................................................................................................................................. 171
Binomial experiment........................................................................................................................... 216
Binomial probability distribution ........................................................................................................ 211
Bootstrapping ..................................................................................................................................... 291
Box plot ....................................................................................................................................... 152, 191
Categories ............................................................................................................................................. 21
Central limit theorem.......................................................................................................................... 172
Central Limit Theorem ................................................................................................................ 273, 341
Central tendency ............................................................................................................................. 95, 96
Centred moving average ..................................................................................................................... 649
Chance................................................................................................................................................. 167
Chart...................................................................................................................................................... 47
Chi-square distribution ............................................................................................................... 208, 423
Chi-square test .................................................................................................................................... 419
Chi-square test for two independent samples ................................................................................... 441
Chi-square test of independence ........................................................................................................ 421
Class interval ......................................................................................................................................... 29
Class limits............................................................................................................................................. 22
Class widths equal ................................................................................................................................. 66
Classical time series analysis ............................................................................................................... 616
Classical time series decomposition ................................................................................................ 618
Cluster sampling.................................................................................................................................. 246
Coefficient of determination .............................................................................................................. 529
Coefficient of variation ....................................................................................................................... 130
Component bar chart ............................................................................................................................ 49
Confidence interval ............................................................................................................................. 309
Confidence interval for the population mean .................................................................................... 291
Confidence interval for the population proportion ............................................................................ 291
Confidence interval of a population mean when the sample size is small ......................................... 316
Consistent estimator ........................................................................................................................... 292
Contingency table ............................................................................................................................... 423
Continuous probability distribution .................................................................................................... 171
Continuous random variable .............................................................................................................. 171
Continuous variable .............................................................................................................................. 20
Convenience sampling ........................................................................................................................ 247
Covariance........................................................................................................................................... 519
Coverage error .................................................................................................................................... 249
Critical test statistic ............................................................................................................................. 349
Critical value of the test statistic......................................................................................................... 347

Page | 751
Critical Z test statistic .......................................................................................................................... 362
Cyclical ................................................................................................................................................ 616
Degrees of freedom ............................................................................................................ 294, 367, 421
Dependent .......................................................................................................................................... 340
Determining the sample size .............................................................................................................. 326
Discounting ......................................................................................................................................... 668
Discrete probability distributions ....................................................................................................... 211
Discrete random variable............................................................................................................ 171, 213
Discrete variable ................................................................................................................................... 20
Dispersion ............................................................................................................................................. 95
Disproportionate stratification ........................................................................................................... 245
Distribution shape ............................................................................................................................... 131
Double exponential smoothing........................................................................................................... 684
Double moving averages..................................................................................................................... 662
Efficient estimator............................................................................................................................... 292
Empirical ............................................................................................................................................. 168
Error measurements ........................................................................................................................... 594
Error sum of squares, SSE................................................................................................................. 543
Estimate .............................................................................................................................................. 250
Event ................................................................................................................................................... 169
Expected frequency ............................................................................................................................ 423
Expected value of the probability distribution ................................................................................... 213
Experiment .......................................................................................................................................... 168
Exponential smoothing ....................................................................................................................... 667
Exponentially smoothed ..................................................................................................................... 646
Exponentially weighted moving average (EWMA) ............................................................................. 669
F-distribution....................................................................................................................................... 205
First quartile ........................................................................................................................................ 111
Fisher - Pearson skewness coefficient ................................................................................................ 134
Fisher’s kurtosis coefficient ................................................................................................................ 141
Five-number summary ................................................................................................................ 146, 191
Forecasting errors ............................................................................................................................... 589
Four assumptions of regression .......................................................................................................... 545
Frequency distribution .................................................................................................................. 25, 213
Frequency distributions ........................................................................................................................ 21
Graph .................................................................................................................................................... 47
Group frequency distribution ............................................................................................................... 66
Grouped frequency distribution ........................................................................................................... 28
Histogram .............................................................................................................................................. 66
Holt-Winters method ........................................................................................................................ 702
Homogeneity of variance assumption ................................................................................................ 384
Homoscedasticity ................................................................................................................................ 546
Hypothesis test procedure .................................................................................................................. 345
IBM SPSS Statistics ................................................................................................................................ 18
Independent........................................................................................................................................ 340
Inference ............................................................................................................................................. 242
Interquartile range .............................................................................................................................. 119
Interval data .......................................................................................................................................... 98
Interval estimate ......................................................................................................................... 290, 309
Irregular .............................................................................................................................................. 616
Kurtosis ................................................................................................................................. 97, 132, 140

Page | 752
Left skewed ......................................................................................................................................... 133
Leptokurtic .......................................................................................................................................... 140
Likelihood ............................................................................................................................................ 167
Linear regression analysis ................................................................................................................ 533
Linearity regression assumption ......................................................................................................... 545
Long-term forecasts ............................................................................................................................ 644
Mann-Whitney U test ......................................................................................................................... 488
Margin of error ................................................................................................................................... 249
McNemar’s test for matched pairs ..................................................................................................... 450
Mean of a standard normal distribution ............................................................................................ 178
Mean of the sampling distribution for the proportions ..................................................................... 282
Measure of average .............................................................................................................................. 95
Measure of dispersion ........................................................................................................................ 108
Measure of spread ........................................................................................................................ 95, 109
Measure of variation........................................................................................................................... 109
Measurement Error ............................................................................................................................ 249
Measures of average............................................................................................................................. 96
Measures of central tendency .............................................................................................................. 95
Measures of dispersion ................................................................................................................... 95, 96
Measures of shape ................................................................................................................................ 95
Median ................................................................................................................................................ 100
Medium-term forecasts ...................................................................................................................... 644
Mesokurtic .......................................................................................................................................... 140
Microsoft Excel...................................................................................................................................... 18
Mid-term forecasts ............................................................................................................................. 662
Mode ................................................................................................................................................... 102
Moving average .................................................................................................................................. 646
Multistage sampling............................................................................................................................ 246
Mutually exclusive .............................................................................................................................. 169
Negatively skewed .............................................................................................................................. 133
Nominal data......................................................................................................................................... 98
Nominal variable ................................................................................................................................... 20
Non-parametric................................................................................................................................... 291
Non-parametric hypothesis tests........................................................................................................ 334
Non-parametric test - Mann-Whitney U test ............................................................................. 385, 386
Non-parametric tests .......................................................................................................................... 459
Non-probability samples ..................................................................................................................... 243
Non-probability sampling ................................................................................................... 243, 246, 247
Non-proportional quota sampling ...................................................................................................... 248
Non-response error............................................................................................................................. 249
Non-seasonal time series .................................................................................................................... 573
Non-stationary time series ................................................................................................................. 572
Non-symmetrical distributions ........................................................................................................... 147
Normal distribution............................................................................................................................. 172
Normal probability plot....................................................................................................................... 191
Normality assumption......................................................................................................................... 384
Normally distributed population ........................................................................................................ 349
Null hypothesis (H0) .................................................................................................................... 337, 345
Observed frequency............................................................................................................................ 422
Odds .................................................................................................................................................... 167
One sample t test ................................................................................................................................ 368

Page | 753
One sample test .................................................................................................................................. 340
One sample z test for the proportion, π ............................................................................................. 377
One sample z-test statistic .................................................................................................................. 361
One-tail test ........................................................................................................................................ 338
Ordinal data .......................................................................................................................................... 98
Ordinal variable............................................................................................................................... 20, 21
Outcome ............................................................................................................................................. 168
Outlier ................................................................................................................................................. 518
Outliers.................................................................................................................................................. 99
Parametric ........................................................................................................................................... 291
Parametric hypothesis test ................................................................................................................. 334
Pearson chi square test statistic ......................................................................................................... 422
Pearson’s product moment correlation coefficient............................................................................ 523
Pearson's coefficient of skewness ...................................................................................................... 133
Percentile .................................................................................................................................... 101, 109
Pie chart ................................................................................................................................................ 58
Pivot table ............................................................................................................................................. 38
Platykurtic ........................................................................................................................................... 140
Point estimate ..................................................................................................................................... 290
Point estimate of the population mean .............................................................................................. 292
Poisson ................................................................................................................................................ 171
Poisson distribution ............................................................................................................................ 229
Poisson distribution experiment ......................................................................................................... 230
Poisson probability distribution .......................................................................................................... 211
Pooled estimates................................................................................................................................. 308
Pooled standard deviation .................................................................................................................. 385
Pooled-variance t-test......................................................................................................................... 384
Population ........................................................................................................................................... 250
Population mean ................................................................................................................. 100, 173, 350
Population mean of the normal distribution ...................................................................................... 172
Population parameters ....................................................................................................................... 250
Population standard deviation ................................................................................................... 122, 173
Population standard deviation for a normal distribution ................................................................... 172
Population variance ............................................................................................................................ 122
Positively skewed ................................................................................................................................ 133
Probability distribution ....................................................................................................................... 213
Probability samples ............................................................................................................................. 243
Probability sampling ........................................................................................................................... 243
Probable .............................................................................................................................................. 167
Proportional quota sampling .............................................................................................................. 248
Proportionate stratification ................................................................................................................ 245
Purposive sampling ..................................................................................................................... 247, 248
P-value ................................................................................................................................ 343, 347, 349
Qualitative variable ............................................................................................................................... 20
Quantitative variable ............................................................................................................................ 20
Quartile ............................................................................................................................................... 109
Quota sampling ................................................................................................................................... 248
Random experiment properties.......................................................................................................... 168
Random variable ................................................................................................................................. 170
Range .................................................................................................................................................. 118
Ranks ................................................................................................................................................... 461

Page | 754
Ratio ...................................................................................................................................................... 98
Ratio data .............................................................................................................................................. 21
Raw data ............................................................................................................................................... 18
Region of rejection .............................................................................................................................. 338
Regression equal variance assumption............................................................................................... 546
Regression independence of errors assumption ................................................................................ 545
Regression sum of squares, SSR ....................................................................................................... 543
Relative frequency .............................................................................................................................. 169
Residuals ............................................................................................................................................ 542
Residuals (R) ........................................................................................................................................ 576
Right skewed ....................................................................................................................................... 133
Robust test .......................................................................................................................................... 377
R-squared ............................................................................................................................................ 529
Sample......................................................................................................................................... 242, 250
Sample correlation coefficient ............................................................................................................ 523
Sample covariance .............................................................................................................................. 519
Sample mean .............................................................................................................................. 100, 292
Sample point ....................................................................................................................................... 168
Sample space .............................................................................................................................. 168, 170
Sample standard deviation ......................................................................................................... 123, 293
Sample statistics.................................................................................................................................. 250
Sample variance .......................................................................................................................... 123, 293
Sampling distribution .................................................................................................................. 259, 339
Sampling distribution of the mean ..................................................................................................... 259
Sampling distribution of the proportion ..................................................................................... 259, 282
Sampling distribution of the sample proportion ................................................................................ 282
Sampling error .................................................................................................................................... 249
Sampling frame ................................................................................................................................... 243
Scatter plot............................................................................................................................................ 78
Scatter plots ........................................................................................................................................ 514
Seasonal .............................................................................................................................................. 616
Seasonal time series............................................................................................................................ 573
Second quartile ................................................................................................................................... 109
Semi-inter quartile range .................................................................................................................... 119
Shape..................................................................................................................................................... 96
Short-term forecasts ........................................................................................................................... 644
Sign test............................................................................................................................................... 460
Significance level ................................................................................................................................. 346
Simple moving average ....................................................................................................................... 662
Simple random sampling .................................................................................................................... 244
Skewness ....................................................................................................................................... 97, 132
Snowball sampling .............................................................................................................................. 248
Spearman’s rank correlation coefficient............................................................................................. 530
Stacked chart ........................................................................................................................................ 49
Standard deviation ........................................................................................................................ 96, 120
Standard deviation for the probability distribution............................................................................ 214
Standard deviation of a standard normal distribution ....................................................................... 178
Standard error..................................................................................................................................... 262
Standard error of the estimate ........................................................................................................... 548
Standard error of the sample means .................................................................................................. 260
Standard normal distribution.............................................................................................................. 178

Page | 755
Stated limits .......................................................................................................................................... 22
Stationary time series ......................................................................................................................... 572
Statistical inference............................................................................................................................. 250
Statistical power ................................................................................................................................. 355
Statistical test to be used .................................................................................................................... 346
Stratified random sampling ................................................................................................................ 245
Student’s t distribution ....................................................................................................................... 316
Student’s t distribution tables ............................................................................................................ 318
Student’s t-distribution ....................................................................................................................... 198
Student’s t-test ................................................................................................................................... 366
Symmetrical distributions ................................................................................................................... 147
Systematic random sampling .............................................................................................................. 244
Table...................................................................................................................................................... 23
Tally chart.............................................................................................................................................. 26
T-distribution ...................................................................................................................................... 198
Test of association .............................................................................................................................. 422
Test statistic ................................................................................................................................ 343, 346
Third quartile ...................................................................................................................................... 111
Time period ......................................................................................................................................... 573
Time periods ...................................................................................................................................... 575
Time series plot ..................................................................................................................................... 80
Total sum of squares, SST ................................................................................................................. 543
Trend ................................................................................................................................................... 616
T-test assumptions .............................................................................................................................. 366
Two sample tests ................................................................................................................................ 383
Two-sample tests ................................................................................................................................ 340
Two-tail test ........................................................................................................................................ 338
Type I error.......................................................................................................................................... 354
Type II error......................................................................................................................................... 354
Unbiased estimator..................................................................................................................... 259, 292
Unbiased estimator of the population mean ..................................................................................... 262
Unbiased estimator of the population standard deviation ................................................................ 294
Uncertainty ......................................................................................................................................... 167
Variance .............................................................................................................................................. 120
Variance of a binomial distribution .................................................................................................... 221
Variance of a Poisson distribution ...................................................................................................... 230
Variance of the binomial distribution ................................................................................................. 282
Variance of the probability distribution.............................................................................................. 214
Visual display......................................................................................................................................... 47
Welch’s unequal variances t-test ........................................................................................................ 385
Welch–Satterthwaite equation........................................................................................................... 385
Wilcoxon signed rank test ................................................................................................................... 474
Yate’s chi-square statistic ................................................................................................................... 426
Yates's correction for continuity ......................................................................................................... 426

Page | 756

You might also like