Statistical Techniques in Business

STATISTICAL TECHNIQUES IN BUSINESS
Module Guide
Copyright © 2021
MANCOSA
All rights reserved; no part of this book may be reproduced in any form or by any means, including photocopying machines, without the
written permission of the publisher. Please report all errors and omissions to the following email address:
modulefeedback@mancosa.co.za
This Module Guide,
Statistical Techniques in Business (STB7)
will be used across the following programmes:
 Advanced Diploma in Business Management

 Bachelor of Commerce in Entrepreneurship
 Bachelor of Commerce in Retail Management
STATISTICAL TECHNIQUES IN BUSINESS
Preface............................................................................................................................................................... 3
Unit 1: Introduction to Business Statistics ......................................................................................................... 9
Unit 2: Types of Data ....................................................................................................................................... 14
Unit 3: Management Statistics ........................................................................................................................ 28
Unit 4: Probability and Probability Distributions............................................................................................... 73
Unit 5: Index Numbers ..................................................................................................................................... 93
Unit 6: Linear Correlation and Regression ......................................................................................................102
Unit 7: Time Series Forecasting ......................................................................................................................115
References......................................................................................................................................................143
i
Statistical Techniques in Business
List of Contents
List of Tables
Table 3.1 Grouped frequency table of number of years in an organisation .......................................................... 31
Table 3.2: Crosstabulation of task grade and level of achievement ...................................................................... 33
Table 3.4: Extended table to calculate statistics using grouped frequency data ................................................... 43
Table 3.5: Suggested statistics for varying levels of measurement ...................................................................... 45
Table 3.6: Suggested statistics and graphical representation of varying levels of measurement ......................... 46
Table 6 Table 4.1 The z-table ............................................................................................................................... 84
Table 5.1. Table calculating Laspeyres Price and Quantity Index ........................................................................ 98
Table 5.2. Table of prices and quantities in millions of shares for three different car brands ................................ 99
Table 7.1. Table illustrating the forecasted figures.............................................................................................. 117
Table 7.2 Table of demand quarterly, for three consecutive years .................................................................... 132
List of Figures and Illustrations
Figure 3.1. Bar Graph graphically illustrating years of work experience ............................................................... 32
Figure 3.2 Pie chart of the percentage of males and females (Gender) ............................................................... 34
Figure 3.3: Scatterplot of productivity by year ....................................................................................................... 36
Figure 4.5. Scatterplot depicting a non-linear, inverse relationship ....................................................................... 37
Figure 6:3.10.: ....................................................................................................................................................... 47
Figure 5:3.11. ........................................................................................................................................................ 47
Figure 3.14. Histogram illustrating a roughly symmetrical shape .......................................................................... 58
Figure 3.15. Histogram illustrating a right, positive skew ...................................................................................... 59
Figure 3.16. Histogram illustrating a left, negative skew ....................................................................................... 59
Figure 3.17. Graphical illustration of skewness (Source: Groebner et al., 2011) .................................................. 60
Figure 3.18 Graphical illustration of a normal distribution using IQ as an example (Source: MANCOSA) ............ 61
Figure 4.1 The Binomial distribution...................................................................................................................... 81
Figure 4.2 The Normal Distribution ....................................................................................................................... 82
Figure 4.3. Histograms illustrating normally distributed data ................................................................................. 83
1 MANCOSA
Figure 4.4 The Standard Normal Distribution ........................................................................................................ 84
Figure 4.6 Graphical representation of the theory underlying the sampling distribution of the mean ................... 87
Figure 7.1 Graphical representation of a Time Series........................................................................................ 117
Figure 7.2. Graphical representation of trend ..................................................................................................... 118
Figure 7.3 Graphical representation of cyclical variation..................................................................................... 119
Figure 7.4 Graphical representation of seasonal variation ................................................................................. 119
Figure 7.5. Illustration of the additive model ........................................................................................
Figure 7.6. Illustration of the multiplicative model .................................................................................. 121
Figure 7.7 Graphical representation of seasonal demand ....................................................................... 132
MANCOSA 2
Preface
A. Welcome
Dear Student
It is a great pleasure to welcome you to Statistical Techniques in Business (STB7). To make sure that you
share our passion about this area of study, we encourage you to read this overview thoroughly. Refer to it as
often as you need to, since it will certainly make studying this module a lot easier. The intention of this module is
to develop both your confidence and proficiency in this module.
The field of Statistical Techniques in Business is extremely dynamic and challenging. The learning content,
activities and self- study questions contained in this guide will therefore provide you with opportunities to explore
the latest developments in this field and help you to discover the field of Statistical Techniques in Business as
it is practiced today.
This is a distance-learning module. Since you do not have a tutor standing next to you while you study, you need
to apply self-discipline. You will have the opportunity to collaborate with each other via social media tools. Your
study skills will include self-direction and responsibility. However, you will gain a lot from the experience! These
study skills will contribute to your life skills, which will help you to succeed in all areas of life.
We hope you enjoy the module.
MANCOSA does not own or purport to own, unless explicitly stated otherwise, any intellectual property rights in or
to multimedia used or provided in this module guide. Such multimedia is copyrighted by the respective creators
thereto and used by MANCOSA for educational purposes only. Should you wish to use copyrighted material from
this guide for purposes of your own that extend beyond fair dealing/use, you must obtain permission from the
copyright owner.
3 MANCOSA
B. Module Overview
 The purpose of this module is to serve as an introduction to Statistical Techniques in Business to
orientate and equip you with dealing with basic business statistics in the real world, and to facilitate and
inform sound business decisions
 This module will cover basic managerial statistics like basic descriptives, and explores other business
statistics tools such as Time Series Forecasting, Index Numbers, Seasonal Indices and Probability
 These basic managerial statistics constitutes a key contributor to your programme efficacy by equipping
you for the realities of the business world
 The module is a 20 credit module at NQF level 7
 Statistics is a subject that is best learned through doing. Thus, the suggested order of learning for this
module is to first grapple with the theoretical underpinnings and uses of statistical tests, thereafter work
through the examples and the activities provided. Try to work towards your own answers, and keep
practicing
C. Learning Outcomes and Associated Assessment Criteria of the Module
LEARNING OUTCOMES OF THE MODULE ASSOCIATED ASSESSMENT CRITERIA OF THE MODULE
 Explain why quantitative techniques  Understand what is implied by business statistics

and calculations are important to a
manager;
 Perform statistical analyses in practice  Explain why statistics is important within the
and extract additional information from management context
business data;
 Manipulate gathered (grouped and  Identify ways in which managers need to rely on
ungrouped) data through various business statistics
statistical methods to generate useful
 Identify ways in which statistics can be useful within
information to support management
business environments
decisions;
 Prepare and interpret reports  See the relevance of particular types of statistical tests,
expressed in statistical terms; and their use within particular contexts, and in relation to
particular business problems
 Be able to manipulate data in optimal ways to inform

decisions
 Be able to calculate descriptive statistics
MANCOSA 4
 Assess the validity of statistical findings  Calculate the mean, median, mode and standard
and the relevance and reliability of deviation on both raw, and grouped data
results
 Interpret the output with reference to the scenario
 Utilise learned statistical techniques to generate and

interpret reports that can be used for decision making
 Determine the appropriate statistical techniques

necessary for the task at hand, and the best inform the
decisions that need to be made
 Understand what is implied by validity and reliability
 Be able to interpret and apply the concepts
D Acronyms
IV Independent Variable
DV Dependent Variable
E. Learning Outcomes of the Units

You will find the Unit Learning Outcomes and the Associated Assessment Criteria on the introductory pages of
each Unit in the Module Guide. The Unit Learning Outcomes and Associated Assessment Criteria lists an
overview of the areas you must demonstrate knowledge in and the practical skills you must be able to achieve at
the end of each Unit lesson in the Module Guide.
F. How to Use this Module

This Module Guide was compiled to help you work through your units and textbook for this module, by breaking
your studies into manageable parts. The Module Guide gives you extra theory and explanations where
necessary, and so enables you to get the most from your module.
The purpose of the Module Guide is to allow you the opportunity to integrate the theoretical concepts from the
prescribed textbook and recommended readings. We suggest that you briefly skim read through the entire guide
to get an overview of its contents. At the beginning of each Unit, you will find a list of Learning Outcomes and
Associated Assessment Criteria. This outlines the main points that you should understand when you have
completed the Unit/s. Do not attempt to read and study everything at once. Each study session should be 90
minutes without a break
5 MANCOSA
This module should be studied using the prescribed and recommended textbooks/readings and the relevant
sections of this Module Guide. You must read about the topic that you intend to study in the appropriate section
before you start reading the textbook in detail. Ensure that you make your own notes as you work through both
the textbook and this module. In the event that you do not have the prescribed and recommended
textbooks/readings, you must make use of any other source that deals with the sections in this module. If you
want to do further reading, and want to obtain publications that were used as source documents when we wrote
this guide, you should look at the reference list and the bibliography at the end of the Module Guide. In addition,
at the end of each Unit there may be link to the PowerPoint presentation and other useful reading.
G. Study Material
The study material for this module includes tutorial letters, programme handbook, this Module Guide, a list of
prescribed and recommended textbooks/readings which may be supplemented by additional readings.
H. Prescribed and Recommended Textbook/Readings

There is at least one prescribed and recommended textbooks/readings allocated for the module.
The prescribed and recommended readings/textbooks presents a tremendous amount of material in a simple,
easy-to-learn format. You should read ahead during your course. Make a point of it to re-read the learning
content in your module textbook. This will increase your retention of important concepts and skills. You may wish
to read more widely than just the Module Guide and the prescribed and recommended textbooks/readings, the
Bibliography and Reference list provides you with additional reading.
The prescribed and recommended textbooks/readings for this module is:

 Wegner, T. (2015). Applied Business Statistics. (4th ed.). Juta: Cape Town.
In addition to the prescribed textbook, the following should be considered for recommended books/readings:
 Durrheim, K., and Tredoux, C. (2012). Numbers, Hypotheses & Conclusions: A Course in Statistics for
the Social Sciences. (2nd ed.). Cape Town: UCT Press.
MANCOSA 6
I. Special Features
In the Module Guide, you will find the following icons together with a description. These are designed to help you
study. It is imperative that you work through them as they also provide guidelines for examination purposes.
Special Feature Icon Explanation
LEARNING The Learning Outcomes indicate aspects of the particular Unit you
OUTCOMES have to master.
The Associated Assessment Criteria is the evaluation of the students’

ASSOCIATED
understanding which are aligned to the outcomes. The Associated
ASSESSMENT
Assessment Criteria sets the standard for the successful
CRITERIA
demonstration of the understanding of a concept or skill.
A Think Point asks you to stop and think about an issue. Sometimes
THINK POINT you are asked to apply a concept to your own experience or to think of
an example.
You may come across Activities that ask you to carry out specific
tasks. In most cases, there are no right or wrong answers to these
ACTIVITY
activities. The purpose of the activities is to give you an opportunity to
apply what you have learned.
At this point, you should read the references supplied. If you are
READINGS unable to acquire the suggested readings, then you are welcome to
consult any current source that deals with the subject.
PRACTICAL
Practical Application or Examples will be discussed to enhance
APPLICATION
understanding of this module.
OR EXAMPLES
You may come across Knowledge Check Questions at the end of

KNOWLEDGE
each Unit in the form of Knowledge Check Questions (KCQ’s) that will
CHECK
test your knowledge. You should refer to the Module Guide or your
QUESTIONS
textbook(s) for the answers.
You may come across Revision Questions that test your

REVISION understanding of what you have learned so far. These may be
QUESTIONS attempted with the aid of your textbooks, journal articles and Module
Guide.
7 MANCOSA
Case Studies are included in different sections in this Module Guide.

CASE STUDY This activity provides students with the opportunity to apply theory to
practice.
You may come across links to Videos Activities as well as instructions

VIDEO ACTIVITY
on activities to attend to after watching the video.
MANCOSA 8
Unit
1: Introduction to
Business Statistics
9 MANCOSA
Unit Learning Outcomes
CONTENT LIST LEARNING OUTCOMES OF THIS UNIT:
1.1. Introduction  Introduce topic areas for the unit
1.2. What is statistics?  Define what is meant by statistics
 Discern the need for business statistics
1.3. Types of statistics  Define and differentiate between the types of available statistics
1.4. Conclusion  Summarises topic areas of units
Unit One: Introduction to Business Statistics

Statistics - the science of designing studies, gathering data, and then classifying, summarising, and interpreting
and drawing conclusions from the data.
Descriptive statistics - summarising and describing data using frequencies, percentages, measures of central
tendency, dispersion/spread and by looking at shapes of distributions
Inferential statistics - moves beyond merely describing data to makes inferences or generalisations about the
population from which the sample was drawn.
Prescribed Readings
 Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South
Western, Cengage Learning.
 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses &

Conclusions: A Course in Statistics for the Social Sciences. Cape Town:
UCT Press.
MANCOSA 10
1.1 Introduction to Business Statistics

Managers are required to make decisions everyday. They often do so in very rushed and pressurised situations,
and require numerous problem-solving, leadership, communication, reasoning, experience, sound judgement
amongst other skills in order to do so. In a lot of instances, they rely heavily on numbers in order to inform their
decisions. If you think about it, numbers are everywhere – income, profit, turnover, levels of production,
forecasting, demand, costing and so forth. In order for them to grapple adequately with such numbers in order to
reach the optimal conslusions based on such numbers, it is important for them to possess the necessary skill
and ability to do so. This ability means they should be able to, at the most fundamental level, understand
statistical reasoning and be able to accurately interpret results. In addition to this, they should possess sufficient
skill to be able to detect any anomalies, shortfalls or errors when presented with quantified information.
Numbers are useful in that they not only provide a clear, precise and objective measure, but they can also be
manipulated via calculations in order to arrive at particular answers. These answers can be used to increase our
understanding of things, and thus more accurately inform decisions.
1.2 What is Statistics?

Statistics refers to the science of designing studies, gathering data, and then classifying, summarising, and
interpreting and drawing conclusions from the data. It facilitates the informational presentation of data to support
decisions that are needed and refers to a range of techniques and procedures for analysing, interpreting and
displaying data in order to make decisions based on data.
Why statistics in important for researchers and managers

Managers use statistics so that the researchers can comprehend the better, and in a more understandable
fashion, what information is being relayed.
Statistics as a field is highly variable, and can range from highly superfluous, to highly useful. For example, in
2011 the Pew Internet & American Life Project study found that 8% of internet users do not complete their
searches using search engines. This could possibly imply that they are completing searches by typing URL’s in.
An important finding by the University of Connecticut in Mansfield debunked the cholesterol myth. Traditionally,
cholesterol was attributed to saturated fats clogging arteries, leading to coronary heart disease. However, it was
found that clogged arteries may be a result of bacteria, not diet. So you can see, the use of statistics has varied
importance, and use, including the political, medical, educational and social science arenas, but in business it is
a formidable and invaluable tool in order to effectively assimilate and use information in ways as to facilitate
sound business decisions.
For example, in marketing research, our behaviour as consumers will generate statistics that will inform
companies about what products can be retained, discontinued or modified.
11 MANCOSA
Think Point
Try to identify ways in which can business statistics be useful in today’s

business environment?
1.3 Types of Statistics

There are generally two main types of statistics, Descriptive, and Inferential statistics. Descriptive statistics
involves summarising and describing data using frequencies, percentages, measures of central tendency,
dispersion/spread and by looking at shapes of distributions. For example, if you were to observe the make of
people’s pencils, you will find that 62% of people in your class use Staedtler pencils. You are not trying to say
that everybody in KwaZulu Natal use Staedtler pencils, you are merely describing your observation.
Inferential statistics, on the other hand, move beyond merely describing data to makes inferences or
generalisations about the population from which the sample was drawn. For example, based on surveying 80%
of Netflix user preferences, decide to cancel a show, under the assumption that the entire Netflix watching
population hold similar views regarding that series.
Activity
1.1. Research reveals that men are twice as likely to watch soccer on TV in
South Africa. Is this statistic descriptive or inferential? Why?
1.2. What is meant by the term “statistics”?
1.3. What is the importance of statistics in decision-making related to business

activities?
1.4. There are generally two types of statistics.
a. What are the two types of statistics?
b. Define each making reference to their purpose.
1.4 Conclusion
This chapter provides an introduction to statistics, define the role and need for statistics, with special
To business contexts
Summary
This chapter serves to describe what is inferred when using terms such as statistics, and aimed to create a case
for statistics, particularly in business contexts.
MANCOSA 12
1.5 Answers to Activities

1.1. It is inferential as you have used a sample of males to extrapolate and make inferences about all males in
South Africa.
1.2. Statistics refers to the science of designing studies, gathering data, and then classifying, summarising, and
interpreting and drawing conclusions from the data. It facilitates the informational presentation of data to
support decisions that are needed and refers to a range of techniques and procedures for analysing,
interpreting and displaying data in order to make decisions based on data. A "statistic" is defined as a
numerical quantity (such as the mean) calculated in a sample.
1.3. Numbers are useful in that they not only provide a clear, precise and objective measure, but they can also
be manipulated via calculations in order to arrive at particular answers. These answers can be used to
increase our understanding of things, and thus more accurately inform decisions. Managers make decisions
daily, and therefore, statistics is an invaluable tool to assist in making accurate, responsive and reliable
decisions. Thus, statistics in business is a formidable and invaluable tool in order to effectively assimilate
and use information in ways as to facilitate sound business decisions.
1.4. a. Descriptive and Inferential statistics

b. Descriptive statistics involves summarising and describing data using frequencies, percentages,
measures of central tendency, dispersion/spread and by looking at shapes of distributions
Inferential statistics, on the other hand, move beyond merely describing data to makes inferences or
generalisations about the population from which the sample was drawn
13 MANCOSA
Unit
2: Types of Data
MANCOSA 14
2.1 Introduction to data and  Introduces data and information

information
2.2 Sampling  Differentiate between the types and methods of sampling
2.3 Data Collection  Define and describe data
 Differentiate between data and information
 List and describe types and sources of data
 List and describe methods of data collection
 Understand and describe the characteristics of data
2.4 Variables  Understand what is inferred by the term variable, and discuss
the various types and characteristics of variables
 Identify the importance of defining variables in statistics
2.5 Levels of Measurement  List and understand the four levels of measurement
2.6 Constructs  Define and determine the importance of constructs to

measurement
2.7 Conclusion  Summarises topic areas of units
Unit 2: Types of Data

Data - Data are raw numbers that we process in various ways in order to give us meaningful information
Information - Conveying or representing something meaningfully through a particular arrangement, processing
or sequence of things
Raw data - Raw data has not been manipulated or treated in any other way since their collection, and still exist
in original form.
Variables - A variable is any characteristics, number, or quantity that can be measured or counted. Variables are
derived when converting a construct into a measurable form.
Constructs - Deriving theoretical meaning, are often abstract and not directly observable.
Primary data - refers to data that has been collected for the first time by the researcher themselves, captured for
the first time, with a particular purpose in mind.
Secondary data - is data that has already been collected and is already been in existence for purposes other
than the original study and accessed for the purposes of another study
Categorical or discrete variables refers to whole numbers (integers)
15 MANCOSA
Continuous variables can take on any value on the number line from negative to positive infinity, and inlcludes
decimals and fractions
Mutually exclusive means that belonging to one category, they are automatically excluded from belonging to
another
Independent variables - variables which are under the control of the researcher, and are manipulated in order
to bring about changes to dependents variables.
Dependent variables - outcome variables, are observed to see if they change when the Independent variable
changes
Prescribed and Recommended Textbooks/Readings

 Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western, Cengage
Learning.
 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses & Conclusions: A Course in
Statistics for the Social Sciences. Cape Town: UCT Press.
MANCOSA 16
2.1 Introduction to Data and Information

2.1.1 Data and information
Data is of integral importance when performing a valid analysis and reaching reliable conclusions. Considering
the the type and characteristics of data is important as their mathematical properties determine what we can do
with them statistically, and hw to go about interpreting our findings. Data can exist in many forms, each type
needs to be collected, analysed and presented differently. You often hear the terms “data” and “information” –
but what does it mean?
Data are facts,especially numerical fact that we process in various ways in order to give us meaningful
information. Raw data has not been manipulated or treated in any other way since their collection, and still exist
in original form. Take for example the number 76, 52, 78, 38, 80. These numbers represent data, and unless
processed, mean very little to us. They could represent steps taken, percentages, scores, anything really. If I
were to say they were percentages, when processing them we can draw conclusions such as the mean score
being 64.8%, which translates to average performance, which gives other information like being able to discern
the high performers (78% and 80%), from the poor performers (38%). Alone the data means very little, but once
we understand what the numbers represent, we are in a better position to be able to manipulate and process
them in ways to give us more holistic and usable information upon which we can base our decisions.
For example. Market Researchers collect data daily in order to provide information about consumer opinions in
order for organisational leaders to make strategic business decisions. The type of data that managers need
therefore needs to be relevant to the informational requirements of decision makers. The types of available data
therefore informs the types of processing and conclusions we can draw.
Data RAW DATA

SOURCES AND collection
TYPES OF DATA
g
Processin
MANAGERS
Presentation Use information to
INFORMATION
Useful form make decisions
Illustration 1. Processing data into information
17 MANCOSA
When we are preparing data into useable information, we generally take four steps:
1. Sampling
2. Data Collection
3. Processing data using descriptive and inferential statistics
4. Presenting results
2.2 Sampling
We sample because often we do not have sufficient resources like time or money, in order to collect data from
everyone in the population. Please note that when we refer to population, we do not mean everyone in South
Africa – we referring rather to all of the people we wish to say something about. For example, if I wished to say
something about MANCOSA employees, my target population are all MANCOSA employees from which I draw
the sample from.
When drawing a sample – you need to consider:
 Who will be surveyed? (The Sample)
 How many people will be surveyed? (Sample Size)
 How should the sample be chosen? (Sampling Methodology)
At all times, you should guard against bias and error.
There are two approaches to sampling, probability and non-probability sampling.
2.2.1 Probability Sampling

Probability sampling refers to that sampling method whereby each element in the sample has an equal
chance of being selected. Ensuring that every element has an equal chance of selection requires that the
sample is drawn randomly. Random selection is thus pivotal in ensuring a representative sample is drawn.
A representative sample is necessary to ensure that conclusions drawn based on the sample accurately
represent the target population they wish to say something about.
2.2.1.1 Simple Random Sampling

Each member of the population is chosen randomly, where each unit within that population has an equal, known
chance of being selected. The assumption, therefore, is that the selected sample is representative of the
population from which it is drawn as there are no factors biasing it in one or other direction. In order to draw a
random sample, you need an exhaustive list of all the elements comprising the population. You need to ascribe
an identifier to each element, and then randomly select (by virtue of a random number generator, or another
programme capable of true random selection) x numbers of elements to make up your sample. The difficulty with
this type of sampling, however, is that it can sometimes be near impossible to collect and collate a list of every
element in the target population
MANCOSA 18
PRACTICAL APPLICATION – EXAMPLE OF SIMPLE RANDOM SAMPLING
Say for example you wished to draw a simple random sample from your place of
work. Say your workplace has 467 employees. A comprehensive list of each
employee should be compiled, and a key identifier assigned to each. Thereafter,
you place all the numbers in a hat, and randomly draw numbers. Alternatively, you
could utilise a computerised random number generator, or a table of random
numbers to whom the questionnaire should be sent.
2.2.1.2 Systematic Sampling

This is when you select your sample systematically from an ordered sampling frame, or basically, when data is
chosen in a systematic/regular way. In other words, the units making up the population of interest are placed into
a list, and every k-the element is chosen. Here “k”, or your sampling interval, stands for the number with which
each element will be chosen, for example, every 9th person, and is calculated by dividing the population size by
the sampling size. Firstly, you need to decide how many people you would like your sample to comprise of, then
divide the total number in your population with the number of elements you wish to include in your sample and
you will derive your “k” value, or your sampling interval. Systematic sampling differs from simple random
sampling in that once your first case is selected, this will automatically determine subsequent selections. This
means that your choices are not independent.
PRACTICAL APPLICATION – EXAMPLE OF SIMPLE RANDOM SAMPLING
If you wished to survey MANCOSA employees, you create a comprehensive and

exhaustive list of all MANCOSA employees across all of the regions. Once you
have a list, you will decide what percentage of employees you wish to survey,
then divide it out to determine k. For arguments sake, let’s say there are 510
employees. You wished to draw a 250 people as your sample. You divide 510 by
250, to get an answer of 2.04, so you start anywhere on the list, and send the
survey to every second person.
2.2.1.3 Stratified Sampling

This sampling technique involves the division of the entire population into sub-groups or strata, based on a
particular attribute, then randomly samples within each stratum. They can be divided into strata systematically, or
randomly.
19 MANCOSA
PRACTICAL APPLICATION – EXAMPLE OF STRATIFIED SAMPLING
Say for example we wished to investigate attitudes towards Work-life balance

amongst organisations situated in rural, suburban, and urban areas. The strata
will consist of rural/suburban/urban organisations, and a random selection within
each stratum would ensue. The advantage of stratified sampling is that not only
will you be able to measure general employee attitudes towards Wok-life blance,
but you would also be able to make a cross comparison between the various sub-
groups of the population.
2.2.1.4 Cluster Sampling

This type of sampling exists when there are “natural” and homogenous sub-groups within a population. The
sample is divided into groups, known as clusters, and simple random sampling within groups ensues. The
difference between cluster sampling and stratified sampling is that in cluster sampling the cluster itself is the
sampling unit and analysis is done on a population of clusters, whereas in stratified, the sampling is done on
elements within the strata.
PRACTICAL APPLICATION – EXAMPLE OF CLUSTER SAMPLING
An example would be, because it is impossible to list all of the people in, say for
example, KwaZulu Natal, you could sample the number of people over an x km
radius. Those people represent a cluster of subjects, similar in certain
characteristics associated with living within proximity of each other. Your clusters
need to be randomly chosen from the population of clusters, and all members of
the cluster sample need to be included.
2.2.2 Non-Probability Sampling
Non-probability, or non-random sampling is one in which random sampling is either not possible, not
permissible, or not required. In non-random sampling, not every element in the target population has an
equal chance of being selected.
2.2.2.1 Convenience Sampling

This type of sampling involves the use of people who are readily available to participate in the study. It is the
least rigorous and weakest sampling technique. An example would be to utilise the fellow shoppers in the mall
you’re shopping in simply because they are there, and available. Generalisability if not possible, and
interpretation of findings needs to be done with extreme caution.
MANCOSA 20
PRACTICAL APPLICATION – EXAMPLE OF CONVENIENCE SAMPLING
You are a market researcher, and are interested in determining people’s preference for
coffee brand. You pop down to your nearest shopping centre and interview shoppers as
they walk past.
2.2.2.2 Judgement/Purposeful Sampling

This sampling technique is most common in qualitative studies, and involves the use of knowledge for selection,
as the researcher or field workers select the sample based on their knowledge of that population, and the
characteristic of interest. The aim is to gather a sample that would yield the most insight into the phenomenon at
hand, and therefore, those who would best answer the research question. It involves selecting a sample judged
to be most representative or typical, from the population. Questions do arise regarding the use of judgment in
selecting “typical”, or if “typical” remains to be so over time.
PRACTICAL APPLICATION – EXAMPLE OF JUDGEMENT/PURPOSIVE

SAMPLING
In order to explore employee experiences of a recent acquisition, you survey

employees who have undergone the acquisition.
2.2.2.3 Quota Sampling

A quota sampling technique is one in which the researcher chooses their sample based on a fixed quota, i.e. one
divides the population into mutually exclusive sub-groups. Here units of the population are selected into a
sample based on pre-determined characteristics such that the sample is made up of the same proportion of
representativity as it would appear in the total population. The purpose of quota sampling is to represent the
major population characteristics by sampling proportionally of each. The main difference between stratified and
quota sampling is that in quota sampling judgment is used instead of randomness to select within each stratum.
The number of sampling units chosen within each sample depends on the proportion.
PRACTICAL APPLICATION – EXAMPLE OF QUOTA SAMPLING
If you wished to investigate the effects of socio-economic levels on experiences of

the workplace, you will ensure that included in your sample, you have
representation of low, middle and high socio-economic status in accordance to
their prevalence.
21 MANCOSA
2.3 Data Collection

When we collect data, we need to focus on the purpose with which the data is required, and how that data will be
used. This will determine what type of data is needed, and what the best method is to gather that data. When
deciding on the method of data collection, it is important to determine the amount of data required, the sources
from whence the data can be obtained, and the means by which data can be collected. Again, the purpose of the
study determines what type of data that is needed to be collected. Up to this point, we have primarily dealt with
quantitative data, but for certain topics, qualitative data is preferable. Quantitative studies, as the name
suggests, deals with quantification, or the expression or measurement of the quantity of something. Qualitative
studies on the otherhand deal with gaining in-depth descriptions and understandings of phenomena. To illustrate
this differentiated approach to research, quantitative studies count how many tins of The Best coffee was sold in
a month, whereas qualitative studies focus more on coming to understand why people prefer purchasing The
Best coffee over other brands. Again, the purpose of your study will determine what type of data to collect, and
this invariably infers the best method to go about collecting it.
There are several means by which to collect data. Some will involve primary data, and others, secondary data.
Primary data refers to data that has been collected for the first time by the researcher themselves, captured for
the first time, with a particular purpose in mind. Secondary data is data that has already been collected and is
already been in existence for purposes other than the original study and accessed for the purposes of another
study. Primary data is useful in that it is directly related to the research problem, and the researcher themselves
expressed greater control over the data collection process, and can therefore ensure better accurcy and
credibility over the manner with which data was collected. Despite these benefits though, primary data can be
somewhat cumbersome to collect, often taking time and eliciting a poor response rate. It can also generally cost
more to collect than data that is already in existence.
Secondary data has the advantage of already being in existence with short access times and incurs less
expenditure, but because it is already in existence and was not collected for the purposes of the study, it may not
be entirely relevant. Secondary data may also be dated, and assessments regarding its accuracy and reliability
may be difficult to ascertain.
If the researcher intends on collecting primary data, quantitative data collection instruments include
questionnaires, rating scales and observations. Secondary sources of quantative data may include market
research figures, financial statements, census data, government data, economic indexes, epediomological and
population statistics to mention a few. Qualitative primary data collection instruments include interviews (often
face-to-face, semi-structured) interviews, qualitative observations, and focus groups.
As previosuly mentioned, there exist several means by which we collect primary data.
MANCOSA 22
2.3.1. Questionnaires
A questionnaire is most often used in research to collect data from several people using a standard or ordered
list of questions in written form. A researcher is able to email, personally adminster, telephonically adminster or
post questionnaires. Although questionnaires are useful in collecting large amounts of data in a relavtively short
period of time, researchers often battle with poor response rates, and they are highly infelxible in that you are
unable to ask follow-up questions, or seek clarification of responses. Reliability of answers to questionnaires are
inflexible, and highly influenced by wording and understanding of the questions therein, the layout, level of
literacy of respondents and so forth.
2.3.2. Observation
This method of data collection involves directly observing and counting instances of an event, taking
measurements, or determining how things work, or have changed as a result of an intervention of sort.
Observation can be both a qualitative or quantitative method of data collection. For quantitative observations
the researcher is first required to select that aspect of behaviour they wish to observe, then define the behaviour
characteristic of that behaviour, develop a system to quantify observations and procedures to record behaviour.
Whereas the observation involved in qualitative research is a systematic, structured observation, qualitative
observers aim to describe a particular behaviour within the setting in which it naturally occurs, and observation
tends to take place over extended periods of time (more so than quantitative observation). Unlike quantitative
observation, there is no previously determined hypotheses, and whereas quantitative observation involves the
use of checklists and other tools developed prior to investigation, qualitative researchers rely on narratives and
words about the setting, behaviours and interactions.
Direct observation is made to ascertain the extent to which a particular behaviour is demonstrated, and the
number of occurrences is recorded. Before the behaviour can be recorded – there is need to identify the
behaviour of interest, define markers thereof and devise a procedure to identify, categorise, and record it.
Observation is useful in that it provides a record of behaviour, as it occurs, without having to ask test subjects
what they feel or think, and is therefore particularly useful for studies involving young children who cannot as yet
effectively articulate and interpret their feelings or communicate. In addition, observation can occur in natural
settings, like the classroom or playground.Although directly observing and gathering data is reliable and relatively
easy to do, human observers are prone to get tired, make mistakes, get distracted and misinterpet what they’re
observing.
2.3.3. Interviews
Interviews are often the most widely used Qualitative data collection technique, and usually involves a
conversation between two people. It allows for the collection of opinions, beliefs, and feelings about situations,
using words, and that cannot otherwise be collected through, for example, observation. The difference between
a normal, day to day, conversation and an interview is that interviews are merely conversations with a particular
topic or focus in mind. Interviews are often recorded, transcribed and words are analysed in order to find
23 MANCOSA
common themes and findings across the data. The advantages of interviews is that they allow the respondent to
relax (unlike questionnaires where they feel like they are being tested) and you can often gain in-depth insight
and understanding into the phenomenon at hand. They are highly flexible in that you can ask probing and follow-
up questions based on interviewee answers, and can provide clarification where necessary.
2.4. Variables
Variables are constructs or characteristics that can take on different values or scores. Variables are created by
converting constructs into a measurable form. Researchers study variables, and are particularly interested in the
relationships that exist between them. The variable under the control of the researcher, and which he/she
manipulates is known as the independent variable. The observed, measured or outcome variable is known as
the dependent variable.
There are two types of variables, namely, qualitative and quantative variables. Qualitative variables indicate that
the person or object belongs to a particular category. Examples of categorical variables include gender
(male/female), in posession of a car, or not. Quantative variables, on the other hand, can be either discrete or
continuous. Similarly, categorical variables represent qualitative data, and take on names or labels, whereas
quantitative dtaa is essentially numerical.
Discrete variables refers to whole numbers (integers), whereas continuous variables can take on any value on
the number line from negative to positive infinity, and inlcludes decimals and fractions. Height is a good example
of a continuous variable. Discrete variables represent categories, and are essentially whole numbers when
represented by numbers. They are mutually exclusive in the sense that by belonging to one category, they are
automatically excluded from belonging to another. For example, driver’s licences, a person may be in possession
of a valid driver’s licence, or not. They cannot be in possession of a valid drivers licence, whilst at the same time
not possessing one. It is important to consider the type of variable data is, as this determines the level and type
of statistics that can be run on them. For categorical data, you would produce frequencies and percentages, and
measures of centrality like the mode.
In research we are particularly interested in Independent and Dependent variables. Independent variables are
those variables which are under the control of the researcher, and are manipulated in order to bring about
changes to dependents variables. Dependent, are outcome variables, are observed to see if they change when
the Independent variable changes. In other words, imagine a researcher were to be interested in the effect of
training on performance. In this case, we have two variables:
Independent variable (IV) - Training
Dependent variable (DV) – Performance
The researcher divides subjects into two groups, one of which received training and the other no training. He
then opbserves and compares performance (DV) between the groups to see if there is any difference in
performance as a result of having received training.
MANCOSA 24
2.5. Levels of Measurement

Whether you are dealing with discrete or continuous data, invariably you need to work with numbers in order to
analyse data statistically. This involves representing categorical data using numbers. For instance, If we were to
represent the five primary colours using numbers, It would look something like 1 = blue, 2 = yellow, 3 = red, 4 =
white, and 5 = black. Can you see how using numbers when assigned to categories can differ from numbers
used to represent, for example, height? That is simply because different numbers have different properties or
characteristics, represented by different levels of measurement. These different properties allow for particular
statistics, and exclude others, and are therefore important to consider when analysing data, and deciding what
the appropriate statistical tests to perform are.
There are four levels of measurement, or scales of measurement, and they include, nominal, ordinal, interval,
and ratio.
Nominal
– Used to identify membership to a group or category
– Variables with no inherent order or ranking sequence
– Simply assign numerical values to mutually exclusive categories
– The numbers are meaningless, they do not represent a quantity - they simply represent a category,
and therefore you cannot add, subtract, multiply or divide them
– Calculating frequencies of occurrence and using the mode is ideal for this level of measurement
e.g. male = 1, female = 2
Ordinal
– Variables with an ordered series or rank according to how much of an attribute they possess
– Numbers assigned to such variables indicate rank order only - the "distance" between the numbers
has no meaning
– Ideal for statistics that indicate the points below which certain percentages of the cases fall in a
distribution of scores
– Use median, and basic descriptives for this level of measurement
e.g. "greatly dislike, moderately dislike, indifferent, moderately like, greatly like"
Interval
– Equally spaced variables
– No true zero – zero denotes a value
e.g. temperature. The difference between a temperature of 66 degrees and 67 degrees is taken to be
the same as the difference between 76 degrees and 77 degrees. Interval variables do not have a true
zero, e.g. 88 degrees is not necessarily double the temperature of 44 degrees.
25 MANCOSA
Ratio
– Highest level of measurement
– True zero, numbers mean something, it is numeric data with a zero origin
– Has equal intervals
– Can take on any number on the number line, from negative to positive infinity, including fractions and
decimals
e.g. Age, distance, time, mass, sales, units and income are examples of ratio data
2.6. Constructs
Constructs refer to phenomena that are not directly observable, it is an abstract idea, characteristic or subject
matter that one wishes to measure. Intelligence is an example of a construct. It is not directly observable, but in
order to account for differences in scholastic performance, scientists or researchers came up with the idea that
this thing, called intelligence, accounts for this difference. Other examples of constructs include motivation,
school and reading readiness, creativity, emotional intelligence and so forth. Constructs are defined according to
their general meaning and characteristics, and includes how they will be measured, or manipulated in a study.
When we define a construct by their general meaning, or a formal definition much like the ones provided in a
dictionary, this is known as a constitutive definition. If we were to look at the constructive definition of intelligence,
it may be defined as the ability to acquire and apply knowledge and skills, and although this definition is useful in
conveying the general meaning of the construct, it is insufficient for the purposes of research because it lacks the
level of specificity required to replicate the study.
Whereas if we were to define a construct by the operations by which they will be measured it is known as an
operational definition. It defines how researchers are to measure a construct, and determine how to collect the
data relevant to that observable event. It serves the purpose of delimiting a term to ensure that everyone knows
what is meant by that term, and provide indicators of constructs.
ACTIVITY 2.1.
1. In each of the following examples, think about the level of measurement of the
data, and if they are categorical/discrete or continuous:
a) A persons gender
b) A persons height (in cm)
c) A student’s business stats average for the semester
d) The number of children in a class
e) An index of learner motivation aggregated from three measures
f) The items on a 5-point Likert scale ranging from 1 – strongly agree, to 5 –
Strongly disagree
MANCOSA 26
2. Define what a variable is, and provide an example to illustrate your answer
3. Define a construct and provide an example to illustrate your answer
4. What is the difference between a qualitative variable, and a quantitative variable?
2.7 Conclusion
This chapter provided a brief introduction to the types and characteristics of data that need to be considered
when deciding on the most appropriate statistics to run in order to make sense of data in meaningful and
accurate ways.
2.8 Answers to Activities

1. In each of the following examples, think about the level of measurement of the data, and if they are
categorical/discrete or continuous:
a) A persons gender – Nominal, Discrete
b) A persons height (in cm) – Ratio, Continuous
c) A student’s business stats average for the semester - Ratio, Continuous
d) The number of children in a class – Ratio, Discrete
e) An index of learner motivation aggregated from three measures – Ratio, continuous
f) The items on a 5-point Likert scale ranging from 1 – strongly agree, to 5 – Strongly disagree –
Ordinal, discrete
2. Define what a variable is, and provide an example to illustrate your answer
Variables are constructs or characteristics that can take on different values or scores. Age, sex, business
income and expenses, country of birth, capital expenditure, class grades, eye colour and vehicle type are
examples of variables. We sometimes differentiate between two types of variables, Independent and
Dependent variables – IV = Test scores, DV = Hours slept.
3. Define a construct and provide an example to illustrate your answer
Constructs refer to phenomena that are not directly observable, it is an abstract idea, characteristic or
subject matter that one wishes to measure. For example, motivation. Motivation is not directly observable, so
wee create indices to measure this unobservable phenomenon “motivation”.
4. What is the difference between a qualitative variable, and a quantitative variable?
Qualitative variables indicate that the person or object belongs to a particular category, whereas quantative
variables, on the other hand, can be either discrete or continuous. So qualitative variables are only categorical,
and use numbers to represent categories, they are very limited in their mathematical and statitical use as a
result. Whereas actual numbers can be used to illustrate quantitative variables, and therefore possess more
mathematical; and statistical properties because of their numeric nature.
27 MANCOSA
Unit
3: Management Statistics
MANCOSA 28
3.1 Introduction  Introduces topic areas of the unit
3.2 Presentation of Data  Understand and discuss the various ways in which data can be
presented
 Demonstrate an understanding of the best methods of presenting

different types of data
3.3 Measures of Central Tendency  Demonstrate an understanding of measures of central location
 Calculate measures of central location for grouped and

ungrouped data
3.4 Measures of Dispersion or  Define and understand what is implied by spread, and why it’s
spread important
 List and understand the various measures of spread
 Calculate the range, variance, standard deviation, interquartile

range and quartile deviation
3.5 The shape of the distribution  Determine the need to assess the shape of distributions
 Understand what is implied by skewness and kurtosis
 Interpret skewness and kurtosis, and conclude with reference to

findings
3.6 Inferential Statistics  Define inferential statistics, and when they are necessary
 Discern between descriptive and inferential statistics
 Describe and understand the theoretical underpinnings of

inferential statistics
 Demonstrate an understanding of choosing appropriate statistical

tests
3.7 Summary  Summarises content areas of the unit
29 MANCOSA
Class - each category of the frequency distribution

Central tendency - describes the most typically occurring values. There are three commonly used numerical
measures of central tendency or central location of a dataset: the mean, the median and the mode
Mean - Average of a group of scores
Median - the score that falls in the middle of an ordered dataset
Mode – most frequently occurring score
Range - the difference between the highest and lowest values in a dataset
Variance - the squared distance of random numbers from the mean, or average
Standard deviation - square root of the variance
Coefficient of variation - measure of the dispersion relative to the mean
Interval - the range of scores that you are summarising data into
Interval width - the actual number of values that each interval consists of
Dispersion (or spread) - the extent to which the data values of a numeric random variable are scattered about
their central location value
Range - measures the difference between the highest and lowest values in a dataset
Variance - measures the squared distance of random numbers from the mean, or average.
Standard deviation - the square root of the variance. It offers a measure of the average deviation from the
mean.

 Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western,
Cengage Learning.
 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses & Conclusions: A

Course in Statistics for the Social Sciences. Cape Town: UCT Press.
MANCOSA 30
3.1 Introduction
In previous chapters, it was mentioned that statistics is the science of collecting, analysing and presenting data in
informative and informational ways. Statistical analysis of numerical data can be done descriptively, and/or
inferentially. Descriptive statistics involves describing the data summatively through looking at the central
tendency, dispersion/spread, and shape of the distribution. Inferential statistics on the other hand involves
analysing sample data in such a way as to make inferences or say something about the population from which
the sample is drawn. As such, this chapter will cover the various methods with which data can be presented, and
analysed.
3.2 Presentation of data
3.2.1 Frequency Distributions

One way in which we can organise data is through frequencies. Frequency distributions indicate the number of
instances a variable takes each of its possible values. It is used to summarise a single categorical variable. We
generate frequency tables as a basic descriptor of the number of times a particular response/outcome occurs.
When the count of each instance is summed and represented graphically, you are able to ascertain the “shape”
and spread of the distribution. By examining the spread, you are able to determine if and where scores cluster
together.
For example, we are trying to asses the overall number of years at an organisation, and receive raw data as
follows:
0.5 0.5 0.5 1 1 1 1 1 2 2 3 3 3 2 2 3
2 3 4 4 5 5 4 4 5 4 5 4 6 6 7
7 7 8 6 7 6 7 9 10
An example of a frequency table output of working experience in SPSS (statistical software for data analysis) is
as follows:
Table 3.1 Grouped frequency table of number of years in an organisation
Number of years at the organisation

Frequency Percent Valid Percent Cumulative
Percent
Valid 0 - 1 years 8 20.0 20.0 20.0
2 - 3 years 10 25.0 25.0 45.0
4 - 5 years 10 25.0 25.0 70.0
6 or more years 12 30.0 30.0 100.0
31 MANCOSA
Total 40 100.0 100.0

Take note that:
 Classes are mutually exclusive (belonging to one category automatically excludes you from belonging
to another)
 There is no overlap between classes i.e. 0-1 and the lower limit of the following class starts at 2. You
cannot allow for an overlap e.g. 0-1 year, 1-2 years
 All categories are exhaustive i.e. all people within that sample or population are accounted for
 The classes are of equal width where possible to facilitate an easy analysis and interpretation
From the table above, 20% of respondents have been working for between 0-1 years, 25% 2-3 years, 25% 4-5
years, and 30% more than 6 years (n = 40).
The frequency distribution is as follows:
Bar Graph of working experience

30 12
25 10 10
8
20
15
10
1Figure 3.1. Bar Graph graphically illustrating years of work experience
The frequency distribution indicates a moderate left skew, with a bulk of frequencies falling to the right i.e. more
than 16 years. Skewness and kurtosis will be discussed a little further on in the chapter. This frequency
distribution is a graphic presentation of categorical data, and as such is known as a Bar Chart. The two most
commonly used charts for presentations are bar charts and pie charts – both of these very clearly and simply
convey a large amount of information. We will look at the bar chart first. A bar chart consists of a series of bars,
the length of each bar representing the value of the variable being plotted. The bars can be either drawn
vertically or horizontally. When we graphically present continuous data, the chart used then becomes a
histogram, and there are no gaps between the rectangles used to represent categories as histograms deal with
continuous data.
MANCOSA 32
When we convert raw data into frequency distributions, there are a number of decisions we need to make when
grouping the data.
You need to fist decide on the class. The class refers to each category of the frequency distribution. Class
limits refer to the boundaries for each class, and determines which scores fall within that class. The class
interval refers to the width of each class, and delineates the lower and upper limits. When we are trying to
ascertain the optimal class width we can use the following formula:
𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑟𝑎𝑤 𝑑𝑎𝑡𝑎−𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑒
Approximate Class width = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
3.2.2. Cross-tabulations
Whereas frequency tables are used to summarise a single categorical variable, cross-tabulations are used to
summarise the relationship between two categorical variables (also called a contingency table). A cross-
tabulation (or crosstab for short) is a table that depicts the number of times each of the possible category
combinations occurred in the sample data.
An example of a crosstab in SPSS is as follows:
Table 3.2: Crosstabulation of task grade and level of achievement

Q1_3. Task grade * Level of achievement Cross-tabulation
Count
Q2_2. How easy is it to communicate with your Manager? Total
Not Not Somewhat Satisfied Very
satisfied satisfied satisfied satisfied
at all
Q1_3. Task grade 5-7 2 4 7 10 7 30
Position Task grade 8-11 0 0 3 4 3 10
category Task grade 12-14 1 1 5 3 1 11
Task grade 16 - 18 0 1 4 1 3 9
Total 3 6 19 18 14 60
From the table above, you can see how the left-hand column of the table shows task grades, and the top row
refers to the extent to which the learner successfully performed the task grade. The cells contain the frequency
with which that certain phenomenon where they meet a particular task grade, and level of satisfaction, occurs.
33 MANCOSA
3.2.3. Pie Chart

A pie chart is used when proportions are to be depicted relative to a whole. It is a circle divided into segments,
with the size of each segment proportional to the value of the variable, relative to the whole, and is usually
expressed in percentage terms
• Advantage:
– Visual impact that they have in conveying information
• Disadvantage:
– Pie charts are limited to a relatively small amount of data; with more data you will need to
resort to a bar chart.
Pie chart of gender composition
11
89
Male Female
Figure 3.2 Pie chart of the percentage of males and females (Gender)
From the pie chart above, it is evident that females constitute the majority when compared to the sample as a
whole.
MANCOSA 34
Activity 3.1
3.1.1. What is a frequency distribution and how does it assist in summarising

and reporting data?
3.1.2. Generally speaking, how do we determine how many classes to include
in a frequency distribution?
3.1.3. Stats SA collected information regarding the age breakdown of reasons
for driving in South Africa per gender:
Male Female
Leisure/holiday 1.5 1.1
Shopping - business 3.1 2.2
Shopping - personal 35 39.6
Shopping -spectator 1.1 0.5
Visit friends/family 20.8 15.3
Medical 3.2 5.2
Wellness (Spa, health farm) 2.2 0.2
Religious 5 6.5
Wedding 1.9 2.3
Other 26.2 27.1
3.1.3.a. What do we call this table?
3.1.4. Provide examples of the following in this scenario:
3.1.4.a Nominal data
3.1.4.b Qualitative data
3.1.4.c Discrete data
3.1.5 Using the example above to illustrate your answer – provide a definition for
mutually exclusive and exhaustive.
3.1.6 If you were to make sense of this data, what type of statistical activities
would you perform?
3.1.7 What conclusions can be met based on the stats above?
3.1.8 Given the following temperatures (in Fahrenheit):
51 55 92 71 87 73 52 62 40 54
40 41 31 16 38 11 23 25 23 9
3.1.8 a. Construct a frequency distribution
3.1.8 b. Is your distribution a histogram, or a bar Chart? Why?
35 MANCOSA
3.2.4 Scatter diagrams

The relationship between two quantitative variables can be depicted in a scatter diagram. A scatter diagram, or
otherwise a scatterplot, is a plot of all pairs of values (x, y) for the variables x and y. Thus, each dot represents a
pair of known or observed variables. The x values occur along the horizontal x-axis, and the y-values along the
vertical y-axis. x as used in these instances denote the Independent variable, and y, the dependent variable.
In the example below, we wished to ascertain the rate of productivity pre- and post-2015. Company x noticed
that the productivity rates were dropping subsequent to acquisition in 2015. Management hypothesised that due
to the acquisition, motivation, morale and performance was dropping. As a result, they managed to perform
multiple teambuilding sessions, increased communication and transparency to all employees, secure more trust
and decrease overall uncertainty in staff, and decided to measure success of their initiatives by measuring
productivity pre=- and post-intervention.
Year Productivity
2014 89%
2015 76%
2016 80%
2017 90%
2018 93%
Table 3.2. Table illustrating the levels of productivity between 2014-2018
To illustrate the relationship between the year and the pass rate, a scatterplot was produced.
Productivity
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
2013.5 2014 2014.5 2015 2015.5 2016 2016.5 2017 2017.5 2018 2018.5
3Figure 3.3: Scatterplot of productivity by year
MANCOSA 36
The scatterplot clearly illustrates a decline in productivity in 2015, and picks up again steadily for the
next three years subsequent to implementing changes in response to the perceived negative effects of
the acquisition. Think of the independent variable (x-axis) as the ‘cause’ and the dependent variable (y-axis) as
the ‘effect’. The scatter diagram allows us to observe two characteristics of the relationship between year
(x) and productivity (y). Because these two variables move together, i.e. their values tend to increase together
and decrease together; there is a positive relationship between the two variables.
If we feel the value of one variable (such as productivity) depends to some degree on the value of the
other variable (such as the intervention), the first variable (productivity) is the dependent variable and is
plotted on the vertical or y-axis. The second variable is the independent variable and is plotted on the x-
axis. The pattern of a scatter diagram provides us with information about the relationship between two variables.
Negative linear relationship

30
25
20
Y 15
10
0
0 5 10 15 20 25
X
Figure 3.4. Scatterplot depicting a linear, negative, inverse relationship
Non-linear relationship
100
80
60
Y
40
20
0
0 5 10 15 20 25
X
Figure 4.5. Scatterplot depicting a non-linear, inverse relationship
37 MANCOSA
Non-linear relationship
25
20
15
Y
10
0
0 5 10 15 20 25
X
Figure 3.6. Scatterplot depicting a non-linear, cubic relationship
No relationship
30
25
20
X 15
10
0
0 5 10 15 20 25
Y
Figure 3.7. Scatterplot indicating a non-existing relationship
If we wished to determine the overall trend, we can fit a “best-fit” line to the data which best minimises the
average distance between each point and the overall trend. By fitting a best fit line, we can determine how
variables are related. We can also develop a straight line equation to illustrate the relationship, and make future
predictions. This will be covered in further detail later in the Study Guide. e.g.
MANCOSA 38
Negative linear relationship

30
25
20
15
Y
10
0 y = -1.4448x + 28.649
0 5 10 15 20 25
-5
X
Figure 3.8. Scatterplot indicating a negative, linear relationship
In addition to frequencies and counts, there are three general ways in which we can summarise data:
1. Central tendency, or the single value that best describes the sample (Mean, Median, Mode)
2. The spread of the distribution (Variance and Standard Deviation)
3. The shape of the distribution (Skewness and Kurtosis)
When data exists alone, they make very little intuitive sense to us. As such, they need to be processed and
converted to an intelligible form, and by looking at the highest, most frequently occurring scores, the shape and
spread, we are able to make a good start at manipulating data into more useful and understandable forms.
3.3 Measures of Central Tendency
Central tendency describes the most typically occurring values. There are three commonly used numerical
measures of central tendency or central location of a dataset: the mean, the median and the mode. You are
expected to know how to compute each of these measures for a given dataset. Moreover, you are expected to
know the advantages and disadvantages of each of these measures, as well as the type of data for which each
is an appropriate measure.
An easy way to remember the difference between the mean, median and mode is through the following useful
riddle:
Hey diddle diddle,
The median is the middle
You add and divide for the mean
The mode is the one that appears the most
And the range is the difference in between
39 MANCOSA
Which measure to use?

If the data is qualitative or categorical, nominal, the only appropriate measure of central location is the mode
e.g. gender
If the data are ranked or ordinal, the most appropriate measure of central location is the median e.g. a scale
from strongly agree to strongly disagree
For quantitative interval and ratio data, however, it is possible to compute all three measures i.e. mean, median
and mode
Calculating the Mean, Median and Mode
3.3.1. Central Tendency: Ungrouped or raw data
3.3.1.1. Mean
The formula for calculating the mean of a set of individual scores is:
𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 ∑𝑛
𝑖=1 𝑥𝑖
x̅ = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
OR x̅ = 𝑛
To illustrate, take for example the marks received for a spelling test for seven learners:
85% 48% 52% 64% 64% 80% 75%
The mean is the sum of all data divided by the total number of scores (n = 7)
(48 + 52 + 64 + 64 + 75 + 80 + 85)
7
Mean = 66.86%
The problem with the mean is visually depicted as follows:
Figure 3.9. Drawing depicting the effect of outliers on the mean
MANCOSA 40
The problem inherent and as depicted in the illutsration above is that the mean is susceptible to extreme values,
that pull the mean up, or down. So whereas the majority of the scores lie to the left of the see-saw, the nature of
the extreme right value is pulling the mean up. One manner in which to overcome this problem is through the use
of the median.
3.3.1.2. Median
The median would be the score that falls in the middle of an ordered dataset and has as many values that fall
above, as it does below it. So you need to order the dataset from smallest to largest (or largest to smallest), and
the median will be the score occuring in the middle of the ordered dataset. If there are an even number of scores
in the dataset, you will average the two middle scores. In the example above:
48% 52% 64% 64% 75% 80% 85%
The median is thus 64%
3.3.1.3. Mode
The mode would be the most frequently occurring score, or the number that occurs the most frequently. In the
case of the marks above, 64% occurs twice, so the mode is 64
Activity 3.2.
This is the average closing price (in rands) of 15 stocks held.
31.69 56.69 65.50 83.50 56.88 72.06

121.44 97.00 42.25 71.88 70.63
35.81 83.19 43.63 40.061
3.2.1. Determine the average/mean of the stock price
3.2.2. Determine the median
3.2.3. Group the data into a grouped frequency table with a lower limit of 30
and a class width of 10.
3.3.2 Central Tendency: Grouped data

So far, we have dealt with deriving the middle measure from a group of raw data, or actual unmanipulated
scores. However, there will be instances where you will need to analyse grouped frequency data.
Take for example the following set of scores:
38 29 33 41 56 35 19 27 32 52
24 45 34 51 28 17 46 31 44 22
41 MANCOSA
These are the number of cars sold monthly at 20 various dealerships throughout Durban for the month of
February. These numbers represent raw scores. If I were to group them based on their frequency, it would look
as follows:
Interval Frequency
10 - 19 2
20 - 29 5
30 - 39 6
40 - 49 4
50 - 59 3
Σ 20
Table 3.3: Grouped frequencies of cars sold
The manner in which you can convert data from raw, to grouped data is by counting the number of numbers
falling within each category or interval. For example, the number of scores that fell between 10-19 were 2, i.e. 17
and 19. The interval refers to the range of scores that you are summarising data into, and you will always
include the first number in the interval i.e. 10 when determining the interval width. The interval width is the
actual number of values that each interval consists of, and in this case the width is 10 units. The width will remain
the same for each row.
Because we are now dealing with grouped data, rather than individual scores, the formula for calculating the
mean, median and mode changes. Because we are dealing with tables, the starting point is to create an even
bigger table (below). Then:
1. Calculate the mid-point: The mid-point is then calculated by adding the lower limit to the upper limit of each
interval, and dividing that number by two. For example, and using the table below, the mid-point of the first row is
(10+19)/2 = 14.5. We know the width is 10, so you can add ten to arrive at each of the subsequent mid-points.
2. Multiple each column by the next: The mid-point is annotated using an x, and frequency f. Multiply the
frequency by the mid-point, then multiply the mid-point by fx to give you fx².
3. Calculate the cumulative frequency: The cumulative frequency is the sum of all of the frequencies line-by-
line, adding each frequency cumulatively to the next.
4. Summate the scores: Work out the summation (Σ) of the fx and fx². Check that the cumulative frequency
equals the summation of the frequency column to ensure all data has been accounted for.
5. Determine the modal class and median interval: This is the row with the highest frequency, or the row into
𝑛
which the 2 𝑡ℎ observation falls. If you can, highlight this row, as it will inform all subsequent substitutions. The
median class is the middle of the dataset.
MANCOSA 42
6. Use the table to substitute numbers into the formulae.
Table 3.4: Extended table to calculate statistics using grouped frequency data
Interval Frequency (f) Midpoint (x) fx fx² Cum. Freq.
10 - 19 2 14.5 29 420.5 2
20 - 29 5 24.5 122.5 3001.25 7
30 - 39 6 34.5 207 7141.5 13
40 - 49 4 44.5 178 7921 17
50 - 59 3 55.5 163.5 8910.75 20
Σ 20 700 27395
3.3.2.1. Mean of grouped data:

∑ 𝑓𝑥
x̅ = 𝑛
Where:
𝒳 = The mid-point of the interval

𝑓 = Frequency of the interval
700
x̅ = 20
x̅ = 35 cars sold
3.2.2.2. Median of grouped data:

𝑛
𝑐[( )−𝑓(<)]
𝑀𝑒 = 𝑂𝑀𝑒 + 2
𝑓𝑚𝑒
Where:
𝑜𝑚𝑒 = lower limit of the median class
𝑓𝑚𝑒 = absolute frequency of the median interval
𝑓 (<) = cumulative absolute frequency of the interval before the median interval
𝑛 = sample size
c or i = interval or class width
Using the example above:

𝑛
𝑐[( )−𝑓(<)]
𝑀𝑒 = 𝑂𝑀𝑒 + 2
𝑓𝑚𝑒
20
10[( )−7]
𝑀𝑒 = 30 + 2
6
𝑀𝑒 = 35 cars sold
43 MANCOSA
3.3.2.3. Mode of grouped data:

𝑐[𝑓𝑚 − 𝑓𝑚−1 ]
𝑀𝑜 = 𝑂𝑀𝑜 +
2𝑓𝑚 − 𝑓𝑚−1 − 𝑓𝑚+1
Where:
𝑂𝑚𝑜 = lower limit of the modal class
𝑓𝑚 = frequency of the modal class
𝑓𝑚−1 = the frequency of the class before or above the modal class
𝑓𝑚+1 = the frequency of the class after or below the modal class
c or i = interval or class width
Using the example above:

𝑐[𝑓𝑚 − 𝑓𝑚−1 ]
𝑀𝑜 = 𝑂𝑀𝑜 +
2𝑓𝑚 − 𝑓𝑚−1 − 𝑓𝑚+1
10[6 − 5]
𝑀𝑜 = 30 + 2(6− 5− 4)
10[6 − 5]
𝑀𝑜 = 30 + 2(6) − 5− 4)
𝑀𝑜 = 33.33 cars sold
Check that the answers you’ve calculated for all three (the mean, median and
mode) are roughly the same. If they are very different – you may have made a
mistake with your calculations.
As a rule of thumb, when central tendency is reported, so should spread. Once you have your central tendency it
is easy to ascertain the degree with which cases are dispersed around it.
Activity 3.3.
Given the following hourly wage rates (in Rands):

12.5 9.45 13.85 7.25 8.7 14.6 11.75 14.6 11.75 14.5 10.8
12.45 7.5 15.9 9.75 11.5 13.3 6.25 15.5 12.8 5.35 9.5
3.3.1. Group the data into a grouped frequency table, with a lower limit of R5
and a class width of R2.
3.3.2. Determine the mean, and median using the RAW data
MANCOSA 44
Activity 3.4.
Given the following age (in years) of passengers going to Thailand:

Ages (years) Frequency
1-13 7
14-26 6
27-39 5
40-52 9
53-65 15
66-78 12
3.4.1. Expand the table to include the mid-point, fm, fm² and the cumulative
frequency
3.4.2. Determine the mean and modal age
3.4 Measures of Dispersion or spread

3.4.1 Spread
“Dispersion (or spread) refers to the extent to which the data values of a numeric random variable are scattered
about their central location value” Wegner (2012). It numerically describes how data is scattered or spread.
Previously, the concept of central location was introduced. The degree to which scores are spread around the
mean, or average, is the second manner in which we can summarise data. The variability among data is one
characteristic to which averages are not sensitive. It is possible to have two datasets with identical measures of
central location but with wider spreads of data.
The measures that are used to measure dispersion are:
 Range
 Variance and Standard deviation
 Interquartile range
 Quartile deviation
Once again, the level of measurement will determine how we gauge the spread of a distribution (GAO, 1992:41)
4 Table 3.5: Suggested statistics for varying levels of measurement
Use of Measure
Index of Dispersion Range Interquartile Std Deviation
Range
Nominal Yes No No No
Ordinal Sometimes Sometimes Yes No
Interval/Ratio No Yes Yes Yes
45 MANCOSA
The best ways to illustrate the spread of the distribution for each level of measurement is as follows:
5 Table 3.6: Suggested statistics and graphical representation of varying levels of measurement
Level of Representation
measurement
Nominal Table or frequency distribution showing frequencies
Ordinal Tables/frequency distribution, but choosing a single measure is problematic. Use
interquartile range if single measure is chosen.
Interval/Ratio Graphic dispersion, standard deviation provided cases have an approximately normal
distribution.
When there is a possibility that the underlying distribution may not be normal, interquartile range is a good
alternative.
Consider two groups of data:
Dataset A Dataset B
65 42
66 54
67 58
68 62
71 67
73 77
74 77
77 85
77 93
77 100
Computed measures of central location
Mean = 71.5 Mean = 71.5
Median = 72 Median = 72
Mode = 77 Mode = 77
Table 3.7: Table illustrating the effect of spread

Although there is no difference in the computed central measures between the two groups, the scores of
dataset B are much more widely scattered than those of dataset A. You may be wondering what the
importance of dispersion is, to demonstrate, consider a company who are trying maximise shipments by
combining products to fill containers before shipping them internationally and thus save costs.
If we were to plot them graphically using histograms:
Figure 3.10. Histogram illustrating a roughly normally Figure 3.11. Histogram illustrating a roughly shaped
distribution around mean 4.72, and normally shaped distribution around mean 9.22 with a
small deviation of 2.207 a larger deviation of 4.463
MANCOSA 46
Figure 5:3.11. Histogram illustrating a roughly shaped Figure 6:3.10 Histogram illustrating a roughly
normally shaped distribution around mean 9.22 with a normally distribution around mean 4.72, and
a larger deviation of 4.463 small deviation of 2.207
Figure 3.12. Histogram illustrating a roughly normal

distribution around mean 10.643 with a marginal standard deviation
of 1.805
Mean Std. Deviation
CONTAINERCOMPA 4.7209 2.20741
CONTAINERCOMPB 9.2188 4.46314
CONTAINERCOMPC 10.6429 1.80543
If you look at the distributions for each company packing their containers:
Company A: The mean is 4.721 Tonnes with a standard deviation of 2.21
Company B: The mean is 9.219 Tonnes with a standard deviation of 4.46
Company C: The mean is 10.642 Tonnes with a standard deviation of 1.80
47 MANCOSA
Company A has the smallest mean tonnage of goods packed into containers. So you can see that although
company B has of the highest means, the also have the highest standard deviation. Which means they are the
most inconsistent when it comes to packing containers efficiently. Whereas Company C is most efficient as it not
only has the highest mean, but also has the smallest standard deviation. So of all the companies, Company C is
the most economical and cost saving.
3.4.2. Range
The range measures the difference between the highest and lowest values in a dataset. It is considered a rough
measure of spread as it depends on only two values. It is affected by outliers or extreme values and gives no
indication of the clustering of the data.
Formula: Range for ungrouped data:
𝑟𝑎𝑛𝑔𝑒 = ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝑙𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒
EXAMPLE
For the data in a previous example:
Dataset A Dataset B
𝑟𝑎𝑛𝑔𝑒 = 77 − 45 = 32 𝑟𝑎𝑛𝑔𝑒 = 100 − 42 = 58
Table 3.8 Table indicating the range calculations for datasets A and B
The ranges indicate that the data in dataset B are more widely spread than that in dataset A.
3.4.3. Variance and Standard Deviation

3.4.3.1. Variance
Variance measures the squared distance of random numbers from the mean, or average. It is the most
commonly used measure of spread. It includes all data, and is calculated through the use of a statistical formula.
Standard Deviation is simply the square root of the variance. Because it focuses on how numbers fall around the
mean, it focuses on residuals. Residuals are the difference between each individual data, and the mean.
Notation:
s = standard deviation of a set of sample scores.
σ = standard deviation of a set of population scores.
s2 = variance of a set of sample scores.
σ2 = variance of a set of population scores.
The variance (𝑠 2 ) measure the average squared deviation from the mean for a dataset.
Formula: Variance for ungrouped data:
∑(𝑥 − 𝑥̅ )2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑠 2 ) =
𝑛−1
MANCOSA 48
or
∑ 𝑥 2 − 𝑛𝑥̅ 2
𝑛−1
Where:
x is each value of the dataset
𝑥̅ is the mean of the dataset
n is the sample size
When dealing with ungrouped data, the following steps should be followed:
1. Calculate the mean: Find the mean of the dataset
2. Subtract the mean from each score: Use a table to help you subtract the mean from each individual score,
create a new column for the result
3. Square it: Square the newly acquired figure from step two
4. Summate it: Add all of the squared deviations together
5. Divide it: Divide this figure by n-1
Let’s look at a sample of 7 the ages of cars sold in the previous example of car sales:
7 9 10 12 13 15 18
1. Calculate the mean:
∑𝑥 7 + 9 + 10 + 12 + 13 + 15 + 18
𝑥̅ = =
𝑛 7
84
𝑥̅ =
7
𝑥̅ = 12 𝑦𝑒𝑎𝑟𝑠 𝑜𝑙𝑑
Step 2: Subtract the mean

from the raw score (x)
Step 3: Square the
deviation column
value in the
Car age Mean Deviation Squared deviation

7 12 -5 25
9 12 -3 9
10 12 -2 4
12 12 0 0
13 12 1 1
15 12 3 9
18 12 6 36
together
Add the
column
Step 4:
Σ 0 84
49 MANCOSA
∑(𝑥 − 𝑥̅ )2
𝑛−1
84
7−1
84
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑠 2 ) = = 14
6
The alternative method of calculation is as follows;
∑ 𝑥 2 − 𝑛𝑥̅ 2
𝑛−1
Step 1: Compute the mean: Calculate the mean of all the raw scores
Step 2: Square all of the raw scores
Step 3: Summate the result of the squared values
Step 4: Substitute it into the formula
For illustrative purposes:
Step 1: Calculate the mean:
∑𝑥 7 + 9 + 10 + 12 + 13 + 15 + 18
𝑥̅ = =
𝑛 7
84
𝑥̅ =
7
𝑥̅ = 12 𝑦𝑒𝑎𝑟𝑠 𝑜𝑙𝑑
Step 2: Square each raw
score (x)
Car age x²
7 49
9 81
10 100
12 144
13 169
15 225
Step 3: Add all
values together
18 324
the squared
Σ 1092
MANCOSA 50
Step 4: Substitute the values into the formula:
∑ 𝑥 2 − 𝑛𝑥̅ 2
𝑛−1
∑ 1092 − (7)(12)2
7−1
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 (𝑠 2 ) = 14
Practical Application
Calculate the variance of the sample scores: 2, 3, 5, 6, 9, 17
Both variance formulae are used in this example, with all the necessary table columns included
for both formulae.
First it is necessary to calculate the mean:
∑x 42
x̅ = = =7
n 6
x̅ 2 = 49
x 𝐱̅ x - 𝐱̅ (x - 𝐱̅²) x²
2 7 -5 25 4
3 7 -4 16 9
5 7 -2 4 25
6 7 -1 1 36
9 7 2 4 81
17 7 10 100 289
Σ 42 0 150 444
∑(x − x̅)2 150

variance (s 2 ) = = = 30
n−1 6−1
or
∑ x 2 − nx̅ 2 444 − 6 × 49 150

variance (s 2 ) = = = = 30
n−1 6−1 5
51 MANCOSA
For grouped data, the original dataset values are changed to the interval midpoints.
Formula: Variance for grouped data:
(∑ fx)2
∑ fx 2 −
variance (s 2 ) = n
n−1
Where:
f is the interval frequency
x is the interval midpoint
n is the sample size
The steps for calculating the variance for grouped data is as follows:
Step 1: Calculate the mean
Step 2: Determine the mid-point
Step 3: Multiple the mid-point by the frequency
Step 4: Square the mid-point
Step 5: Multiply the frequency by the squared mid-point
Step 6: Summate the columns f, fx and fx²
Step 7: Substitute the summed values it into the formula
Whenever you see the ∑ sign in an equation, you will need a column in the
table for the expression immediately following the ∑ sign. Consider also having
columns for each of the components of the expression.
MANCOSA 52
Remember – when dealing with grouped data, it is important to draw a table, as illustrated as follows:
Step 1: Calculate the mean:
∑ 𝑓𝑥
x̅ = 𝑛
Step 4: Square the mid-
5080 point
x̅ = 100 = 50.8 Step 2: Calculate Step 3-5: Multiply out the
the mid-point columns
i Frequency Mid-point fx x² fx²

(f) (x)
1-10 2 5,5 11 30,25 60,5
11-20 2 15,5 31 240,25 480,5
21-30 6 25,5 153 650,25 3901,5
31-40 6 35,5 213 1260,25 7561,5
41-50 30 45,5 1365 2070,25 62107,5
51-60 34 55,5 1887 3080,25 104728,5
61-70 14 65,5 917 4290,25 60063,5
71-80 2 75,5 151 5700,25 11400,5
81-90 3 85,5 256,5 7310,25 21930,75
Summate the
91-100 1 95,5 95,5 9120,25 9120,25
columns
Step 6:
Σ 100 5080 281355,00
Step 7: Substitute the summed values it into the formula

(∑ fx)2
∑ fx 2 −
variance (s 2 ) = n
n−1
(5080)2
∑ 281355 −
variance (s 2 ) = 100
99
variance (s 2 ) = 235.26
3.3.3.2. Standard deviation

The standard deviation is the square root of the variance. It offers a measure of the average deviation from the
mean. It is basically the square root of the variance.
𝑠𝑥 = √𝑠 2
Formula: Standard deviation for ungrouped data:
∑(𝑥 − 𝑥̅ )2
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = √
𝑛−1
or
∑ 𝑥 2 − 𝑛𝑥̅ 2
𝑠𝑡𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = √
𝑛−1
where,
𝑥 is each value of the dataset
𝑥̅ is the mean of the dataset
53 MANCOSA
Activity 3.5
Find the standard deviation of the sample scores in a practical example above.
For grouped data, the original dataset values have been changed to the interval midpoints.
Formula: Standard deviation for grouped data:
2
2 (∑ 𝑓𝑥)
√∑ 𝑓𝑥 − 𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 =
𝑛−1
Where:
f is the interval frequency
x is the interval midpoint
𝑛 is the sample size
Note: Articles in professional journals and reports often use SD for standard deviation and Var for variance.
3.3.4 Coefficient of variation

The coefficient of variation offers a measure of the dispersion relative to the mean. This enables comparison
between datasets with different means, different measurements (dollars vs tonnes) or that differ in size.
Coefficient of variation:
s
coefficient of s, CV = %
x̅
Where:
s is the standard deviation
x̅ is the mean
TIP: In order to express a result as a percentage %, multiply the expression by 100.

The coefficient of variation is therefore:
𝑠
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛, 𝐶𝑉 = × 100
𝑥̅
Practical Application or Examples
Let the mean of the data be 33, and the standard deviation 11.21:
∑ 𝑓𝑥
𝑚𝑒𝑎𝑛, 𝑥̅ = = 33
𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = 11.21
𝑠 11.21
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛, 𝐶𝑉 = %= % = 33.97%
𝑥̅ 33
Interpretation: the data are moderately dispersed around the mean.
MANCOSA 54
All the measures of dispersion described so far have dealt with a single set of data. In practice, it is often
important to compare two or more sets of data with different means, sample sizes or measurement units
and the coefficient of variation can be used to do this.
The higher the coefficient of variation result, the more variability there is in a set of data.
1. A manufacturing company produces a product in two sizes, a 1 000 ml bottle

and a 500 ml bottle. Different filling equipment is used for each size. Because
of mechanical variability in the filling equipment, there is a standard deviation
of 5 ml and 4 ml respectively.
Calculate the coefficient of variation for each filling machine and determine which
machine is more consistent.
𝑠
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 1 000 𝑚𝑙 𝑝𝑟𝑜𝑑𝑢𝑐𝑡, 𝐶𝑉 = %
𝑥̅
5
= % = 0.5%
1 000
𝑠 4
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 500 𝑚𝑙 𝑝𝑟𝑜𝑑𝑢𝑐𝑡, 𝐶𝑉 = % = %
𝑥̅ 500
= 0.8%
Interpretation
Although the machine filling the smaller bottle has a lower standard deviation,
the CVs indicate that the machine filling the larger bottle is relatively more
consistent.
Activity 3.6
Two growers of apples have obtained statistics regarding the mass of their current
crops:
Grower A: x = 300 g with s = 20 g
Grower B: x = 280 g with s = 40 g
Which grower’s apple’s are more uniform in mass?
3.4.5 Interquartile and inter-percentile ranges

In order to eliminate outliers (very low and very high values) and their effect on measures of central location and
dispersion, ranges of a dataset to include mid values are often used:
• The interquartile range excludes the highest and lowest quarters of values.
55 MANCOSA
• An inter-percentile or mid-percentile range excludes a certain percentage of values at the lowest and
highest ends of the dataset.
Interquartile Highest
Lowest range data value
data value
25 25 25 25
% % % %
Minimum 1st quartile 2nd quartile 3rd quartile Maximum

Lower Middle Upper
quartile quartile quartile
25th 50th 75th
percentile percentile percentile
Median
Figure 3.13. Graphical representation of quartiles, and interquartile ranges
Formula: Interquartile range:
𝑖𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1
where,
𝑄3 is the third or upper quartile
𝑄1 is the first or lower quartile

The time taken to complete an assembling task has been measured for 250
employees:
Time taken (minutes) Number of people (𝒇) Cumulative frequency 𝒇(<)
0<5 2 2
5 < 10 2 4
10 < 15 3 7
15 < 20 5 12
20 < 25 5 17
25 < 30 18 35
30 < 35 85 120
35 < 40 92 212
40 < 45 37 249
45 < 50 1 250
Total 250
Calculate the interquartile range (the first and third quartiles were
calculated in a previous self-assessment exercise.
MANCOSA 56
SOLUTION
Quartiles already calculated in a previous self-assessment activity:
1×𝑛
𝑐[ − 𝑓(<)] 5(62.5 − 35)
𝑄1 = 𝐿𝑄1 + 4 = 30 +
𝑓𝑄1 85
= 31.62 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
3×𝑛
𝑐 [ 4 − 𝑓(<)] 5(187,5 − 120)
𝑄3 = 𝐿𝑄3 + = 35 +
𝑓𝑄3 92
𝑖𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 = 𝑄3 − 𝑄1 = 38.67 − 31.62
Formula: Inter-percentile or mid-percentile range:
The mid-percentile range is the percentage of the range exactly in the middle of the dataset.
To calculate the upper and lower percentiles required for the upper and lower limits of the range:
100% − required range percentile
lower percentile of range =
2
upper percentile of range = lower percentile of range + required range percentile
Calculate the required positions and values for these percentiles
interpercentile or mid percentile range
= value of upper percentile of range − value of lower percentile of range
Activity 3.7
Using the data from the example in previous section, calculate the mid-70% range:
Interval Frequency Cum. Freq.
(weight in lbs)
140 < 150 1 1
150 < 160 4 5
160 < 170 8 13
170 < 180 7 20
180 < 190 5 25
Calculate the upper and lower percentiles required for the upper and lower limits
of the range
3.4.6 Quartile deviation

The quartile deviation is half the interquartile range of a dataset. It is a measure of the spread through the middle
half of the dataset. It can be useful because it is not influenced by extremely high or extremely low values.
Formula: quartile deviation:
𝐼𝑄𝑅 𝑄3 − 𝑄1
𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = =
2 2
57 MANCOSA
where,
IQR is the interquartile range
𝑄3 is the third or upper quartile
𝑄1 is the first or lower quartile

Using the data from the example in previous section, calculate the quartile
deviation:
Interval Frequency Cumulative frequency
(weight in lbs)
140 < 150 1 1
150 < 160 4 5
160 < 170 8 13
170 < 180 7 20
180 < 190 5 25
The interquartile range was calculated in example
𝐼𝑄𝑅 = 16.65 𝑙𝑏𝑠
𝐼𝑄𝑅 16,65
𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = = 8.33 𝑙𝑏𝑠
2 2
3.5 The shape of the distribution

The shape of the distribution is the third likely way to summarise findings and understand data as information.
The most common way in which you ascertain this information is by looking at the graphs.
3.5.1. Skewness
If there are large extreme values in the data, the mean is pulled to the right or left and we say that the distribution
exhibits skewness or kurtosis.
For a symmetrical distribution or normal distribution, the mean, median and mode will be about the same.
𝑚𝑒𝑎𝑛 ≈ 𝑚𝑒𝑑𝑖𝑎𝑛 ≈ 𝑚𝑜𝑑𝑒
Symmetrical distribution histogram

27
18 18
12 12
8 8
3 5 5 3
1 2 3 4 5 6 7 8 9 10 11
7Figure 3.14. Histogram illustrating a roughly symmetrical shape
MANCOSA 58
For a distribution that is skewed to the right, the mode will be less than the median and the median will be less
than the mean.
𝑚𝑜𝑑𝑒 < 𝑚𝑒𝑑𝑖𝑎𝑛 < 𝑚𝑒𝑎𝑛
Skewed to the right (positive skew)

68
histogram
65
32
27
15 15
3 8 4 3 1
1 2 3 4 5 6 7 8 9 10 11
8Figure 3.15. Histogram illustrating a right, positive skew
For a distribution that is skewed to the left, the mean is the smallest, followed by the median, while the mode is
the largest.
TIP: A negatively skewed distribution (skewed to the left) has the mean, median and mode in alphabetical order.
Skewed to the left (negative skew)

histogram 68
65
32
27
18
15
12
3 5 8 3
1 2 3 4 5 6 7 8 9 10 11
9Figure 3.16. Histogram illustrating a left, negative skew
As a general rule, the difference between the median and the mode is about twice the difference between the
mean and the median.
If the data are skewed to the left, there are some outliers on the left (small values). If the data are skewed to the
right, then there are some large outliers.
59 MANCOSA
Revision:
For a dataset that is approximately symmetrical with one mode, the mean, median and mode tend to have about
the same value. For a dataset that is obviously asymmetrical, it is preferable to report both the mean and
median. The mean is relatively reliable; that is, when samples are drawn from the same population, the sample
means tend to be more consistent than other averages.
A comparison of the mean and median can reveal information about skewness. Data can be identified as skewed
to the left, symmetrical or skewed to the right. Data skewed to the left will have the mean and median to the left
of the mode:
10Figure 3.17. Graphical illustration of skewness (Source: Groebner et al., 2011)
If we were to extend our (level of measurement permitting) analysis beyond merely descriptive statistics, we
delve into the world of inferential statistics.
3.6. Inferential Statistics

The crux of inferential statistics is to, through the use of a sample, say something about the whole population of
interest. Due to financial, and other resource constraints, we are often not being able to investigate all the
members of a given population. Thus, in order for us to use a sample to say something about every unit under
observation, it is important that we draw representative samples. Representative samples are often randomly
selected to ensure representivity in such a way as to enable us to generalise our findings back to a broader
population. This is an act known as drawing inferences. We infer from a sample of the population of interest
those characteristics of interest.
Why?
Can you imagine having to collect information from the whole population to draw conclusions? This would be
near impossible! Statistics have derived means by which samples can be drawn to facilitate generalisations be
made to populations. In other words, we are able to use sub-sets of the population in order to say something
about that population.
MANCOSA 60
Once you have set up your research questions, it is necessary to ascertain what level of measurement is needed
to yield data of a certain complexity in order to answer them. This should be done from the outset of your
research project.
Inferential statistics include methods for answering questions about cases we have no observations for by using
a sample of that population of interest to make inferences about that population.
Quick Recollection!
Why do we collect samples and not information about the whole population of interest?
1. Time constraints
2. Financial constraints
3. Practically infeasible
4. Populations are just too large – can’t handle and process ALL that data
We can’t just draw the sample in any haphazard way, it needs to occur in such a way that the sample we have
drawn is representative of our population, and therefore generalisable to the larger population, and that the
manner in which we select our sample will relinquish information about the inherent biases of that sample that
may affect our outcomes and conclusions. Sampling to ensure these criteria are met is called probabilistic
sampling, or statistical sampling, and every member of that population has a known and equal chance of being
selected in the sample, and random selection is key. Random assignment helps to ensure that the groups to
which they are randomly assigned are approximately equal with respect to the variables at question. If we were
to take, as a crude example, the population distribution of IQ:
11Figure 3.18 Graphical illustration of a normal distribution using IQ as an example (Source: MANCOSA)
61 MANCOSA
The mean population IQ is 100, with a Standard Deviation of 15. This means that most people will have an IQ
from between 85-115. Similarly, not a lot of people (as demonstrated in the figure above) have an IQ of above
130, or below 50. This means that, if I were to randomly select a sample, the more there is of something, the
more likely they will be included in my sample. Similarly, the less there is of something, the less likely they will be
included in the sample at a high frequency. So the theory is that, if randomly chosen, where each unit in the
population has an equal chance of being selected (randomness), then the more likely the sample will be
representative of the population.
Furthermore, the larger our sample, the more likely we can reduce sampling errors, and thereby ensure that the
difference we are seeing actually exists, as our sample is representative of the population from which it was
drawn.
3.6.1. How Inferential Statistics Work

Population Parameters
These are numbers that describe a population. Parameters are numbers about the characteristics of a given
population, whereas statistics are numbers that summarise and describe samples. Statistics are used to estimate
population parameters. The population we wish to draw conclusions about is known as the target population.
There are two estimates that we can make about population parameters based on statistics. When we decide on
what statistics to produce, and what output is required to yield evidence related to answering our research
questions, achieving our objectives, and ultimately the aim of our study – it is important that we choose the
correct statistical tests to run. The most useful technique for doing so, is through the use of the Decision Making
Tree (Adapted from Tredoux and Durrheim, 2002: 427).
THINK POINT
Now that you are familiar with inferential statistics, can you see why it is important
to have large, representative samples in quantitative research?
The Decision-Making Tree

We use what is known as the decision making tree in order to identify the appropriate statistical tests we need to
run in order to derive accurate, reliable, and valid results in order to answer our research questions.
If the wrong decision is made, and the inappropriate tests are run, we will have erroneous output. As a result,
decisions made based on that output will be problematic.
MANCOSA 62
Decisions regarding what the appropriate test is, is determined by:

 Level of measurement
 Number of groups/samples in the comparison
 Testing for differences, associations, or trends (determined by the purpose of the research)
 Whether or not the data meet the assumptions underlying parametric tests
 Data is single units, or paired
 The number of Dependent and Independent Variables
 Once-off measure, or the same sample is measured numerous times (One-shot vs. Repeated
measures)
3.7 Summary
This chapter covered the fundamentals of business statistics. It covered ways of presenting data, descriptives
(central tendency, dispersion/spread , and the shape of the distribution). It demonstrated, by worked examples,
the calculations for both grouped, and ungrouped data, and the calculations for spread. These statistics are at
the heart of basic reporting and presentation of findings in order to support and make decisions in business
contexts.
Revision exercises
1.1. What is meant by the term “central location”?
1.2. You are given the following marks:
12% 40% 48% 52% 56% 56% 56% 58% 64% 72% 90%
Summarise their central location using:
a) The mode
b) The median
c) The mean
1.3. State which is the most appropriate for this range of scores, and state why.
2. If the data were graphically presented as follows:
63 MANCOSA
a) Comment on the skewness and kurtosis of the distribution
b) Comment on the students overall achievement by looking at the distribution
3. Given the following table:
Interval Frequency (f) Midpoint (x) 𝒙𝟐 𝒇𝒙 𝒇𝒙𝟐

(weight in lbs)
140 < 150 1 145 21 025 145 21 025
150 < 160 4 155 24 025 620 96 100
160 < 170 8 165 27 225 1 320 217 800
170 < 180 7 175 30 625 1 225 214 375
180 < 190 5 185 34 225 925 171 125
∑ 25 825 137 125 4 235 720 425
Calculate the standard deviation

4. The errors in seven invoices are recorded as follows: 120, 30, 40, 8, 5, 20, 29
Calculate the standard deviation.

5. The time (in hours per week) that 50 office staff members spend using personal computers are:
Time (hours per week) Frequency, f

0<3 14
3<6 6
6<9 6
9 < 12 7
12 < 15 14
15 < 18 3
∑ 50

6. Given the table from activity 3.4.
Ages (years) Frequency
1-13 7
14-26 6
27-39 5
40-52 9
53-65 15
66-78 12
MANCOSA 64
7.
Mid-70% Highest
data value
15% 70% 15%
Minimum 15th 85th Maximum

percentile percentile
8. The time taken to complete an assembling task has been measured for 250 employees:
Time taken (minutes) Number of people (𝒇) Cumulative frequency 𝒇(<)
0<5 2 2
5 < 10 2 4
10 < 15 3 7
15 < 20 5 12
20 < 25 5 17
25 < 30 18 35
30 < 35 85 120
35 < 40 92 212
40 < 45 37 249
45 < 50 1 250
Total 250
Calculate the upper and lower percentiles required for the upper and lower limits of the range and
calculate the required positions and values for these percentiles.
9. The time taken to complete an assembling task has been measured for 250 employees:
Time taken (minutes) Number of people (f) Cumulative frequency f(<)

0<5 2 2
5 < 10 2 4
10 < 15 3 7
15 < 20 5 12
20 < 25 5 17
25 < 30 18 35
30 < 35 85 120
35 < 40 92 212
40 < 45 37 249
45 < 50 1 250
Total 250
Calculate the quartile deviation (the interquartile range was calculated in a previous self -assessment
exercise.
65 MANCOSA
READINGS
Prescribed Textbook:
Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western,

Cengage Learning. Chapter 2, pages 16-37, Chapter 3, pages 58-85.
Recommended Reading:
Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses & Conclusions: A

Course in Statistics for the Social Sciences. Cape Town: UCT Press. Tutorials
2,3,4: Pages 18-52.
Solutions to Chapter activities

Activity 3.1
3.1. Frequency distributions indicate the number of instances a variable takes each of its possible values. It is
used to summarise a single categorical variable. We generate frequency tables as a basic descriptor of the
number of times a particular response/outcome occurs. It assist by graphically representing data, and thus to
make it more intelligible.
𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑟𝑎𝑤 𝑑𝑎𝑡𝑎−𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑒

3.2. Approximate Class width =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
3.1.3.a. Crosstabulation
3.1.4a Gender, Colours (purple, red, yellow, green etc)
3.1.4b colour of the wall (red, blue, green, yellow), breed of dog (Collie, Alsatian, Labrador, Staffie etc)
3.1.4c Number of cars sold, units of pens sold, number of people in a classroom
3.1.5 Mutually exhaustive means that belonging to one category automatically discounts the possibility of
belonging to the other. Using the example, if you use the money for a wedding, it automatically discounts that
money spend being allotted to leisure. Exhaustive means that all elements in the population have been
represented. This is accounted for by the summation to 100% of the sample as per breakdown.
3.1.6 I would look at the highest frequencies, I would look at the minimum and maximum activity on which the
money was spent. I would generate a clustered bar chart. I would do a crosstabulation. I would compare males
and females, and their expenditure.
MANCOSA 66
3.1.7
16
14
12
10
8
6 12.5
4
2
0
Males spend more on visiting family, business related shopping, and wellness than females do. Females spend
more on weddings, medical, personal shopping and medical than their male counterparts. Both males and
females spend the highest percentage on visiting friends/family
3.1.8
Histogram – data is continuous/numerical/interval
67 MANCOSA
Activity 3.2.
3.2.1.
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Closing_Price 15 31.69 121.44 64.8141 24.93420
Valid N (listwise) 15
3.2.2.
Statistics
Closing_Price
N Valid 15
Missing 0
Median 65.5000
Activity 3.3.
GROUPED_FREQ Count (Binned)

Percent
Valid 1 30<40 2 13.3 13.3 13.3
2 40<50 3 20.0 20.0 33.3
3 50<60 2 13.3 13.3 46.7
4 60<70 1 6.7 6.7 53.3
5 70<80 3 20.0 20.0 73.3
6 80<90 2 13.3 13.3 86.7
7 90<100 1 6.7 6.7 93.3
10 120<130 1 6.7 6.7 100.0
Total 15 100.0 100.0
3.3.1
WAGE_GRP Count (Binned)

Percent
Valid 2 R5<R7 2 9.1 9.1 9.1
3 R7<R9 3 13.6 13.6 22.7
4 R9<R11 4 18.2 18.2 40.9
5 R11<R13 6 27.3 27.3 68.2
6 R13<R15 5 22.7 22.7 90.9
7 R15<R17 2 9.1 9.1 100.0
Total 22 100.0 100.0
3.3.3. Mean – R11.343

Median – R11.75
MANCOSA 68
3.4.1.
Ages Frequency m fm fm² cf
(years)
1-13 7 7 49 343 7
14-26 6 20 120 2400 13
27-39 5 33 165 5445 18
40-52 9 46 414 19044 27
53-65 15 59 885 52215 42
66-78 12 72 864 62208 54
Σ 54 2497 141655
3.4.2. Mean = (∑fm)/n = 2497/54 = 46.24

46 years
3.1.2. Mode = 53 + = 61.67

61 years
3.5.
𝑠 = √𝑠 2 = √235.26 = 15.34
3.6. SOLUTION TO ACTIVITY
𝑠 20
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑔𝑟𝑜𝑤𝑒𝑟 𝐴, 𝐶𝑉 = %= % = 6.67%
𝑥̅ 300
𝑠 40
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑔𝑟𝑜𝑤𝑒𝑟 𝐵, 𝐶𝑉 = % = % = 14.29%
𝑥̅ 280
Grower A’s apples has the lower CV and therefore is more consistent.
3.7.
100% − 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑟𝑎𝑛𝑔𝑒 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 100% − 70%
𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = = = 15%
2 2
𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = 𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 + 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑟𝑎𝑛𝑔𝑒 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒
= 15% + 70% = 85%
Answers to revision exercises

SOLUTIONS TO REVISION QUESTIONS
1.1. The most frequently occurring score, denotes the centre of the dataset. Central location refers to the most
frequently occurring scores or highest frequencies in a given dataset. It indicates the “average’ or the centre
of a dataset, or group of scores
1.2. a. Mode – 56%
b. Median – 56%
c. Mean – 54.91%
1.3. The median or the mode. The mean is being dragged down by the outlier/extreme score – 12%.
2. a. It is not skewed, and it is rather peaked.
69 MANCOSA
b. Average – and normally distributed – unimodal and bell-shaped. The test was fair as student marks were
normally distributed.
3. . 1.
2 (∑ 𝑓𝑥)2 4 2352
√∑ 𝑓𝑥 − 𝑛 √ 720 425 −
25 = √720 425 − 717 409
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = =
𝑛−1 25 − 1 24
= √125.67 = 11.21 𝑙𝑏𝑠

4.
Number of errors, 𝒙 𝒙 − 𝒙 ̅)𝟐
̅ (𝒙 − 𝒙
120 84 7 056
30 -6 36
40 4 16
8 -28 784
5 -31 961
20 -16 256
29 -7 49
∑ 252 0 9 158
∑(𝑥 − 𝑥̅ )2 9 158
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = √ =√ = 39.07 ≈ 39 𝑒𝑟𝑟𝑜𝑟𝑠
𝑛−1 7−1
5.
Interval Frequency (f) Midpoint (x) 𝒙𝟐 𝒇𝒙 𝒇𝒙𝟐
(time in hours per week)
0<3 14 1.5 2.25 21.00 31.50
3<6 6 4.5 20.25 27.00 121.50
6<9 6 7.5 56.25 45.00 337.50
9 < 12 7 10.5 110.25 73.50 771.75
12 < 15 14 13.5 182.25 189.00 2 551.50
15 < 18 3 16.5 272.25 49.50 816.75
∑ 50 54.0 643.50 405.00 4 630.50
2 (∑ 𝑓𝑥)2 4052
√∑ 𝑓𝑥 − 𝑛 √4 630.50 −
50 = √4 630.50 − 3 280.50
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = =
𝑛−1 50 − 1 49
= √27.55 = 5.25 ℎ𝑜𝑢𝑟𝑠
MANCOSA 70
6.
∑𝑥 252
𝑥̅ = = = 36
𝑛 7
7.
𝑗 × 𝑛 85 × 25
85𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = = = 21.25
100 100
𝑗 × 𝑛 15 × 25
15𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = = = 3.75
100 100
85 × 𝑛
𝑐 [ 100 − 𝑓(<)] 10(21.25 − 20)
𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, 𝑃85 = 𝐿𝑃85 + = 180 + = 182.5 𝑙𝑏𝑠
𝑓𝑃85 5
15 × 𝑛
𝑐 [ 100 − 𝑓(<)] 10(3.75 − 1)
𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, 𝑃15 = 𝐿𝑃15 + = 150 + = 156.88 𝑙𝑏𝑠
𝑓𝑃15 4
𝑖𝑛𝑡𝑒𝑟𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑟 𝑚𝑖𝑑 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒
= 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒
− 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = 182.5 − 156.88 = 25.62 𝑙𝑏𝑠
8.
100% − 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑟𝑎𝑛𝑔𝑒 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 100% − 60%
𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = = = 20%
2 2
𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 = 𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 + 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑟𝑎𝑛𝑔𝑒 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒
= 20% + 60% = 80%
Mid-60% Highest
data value
20% 60% 20%
20th
Minimum percentile 80th Maximum
percentile
Calculate the required positions and values for these percentiles

𝑗 × 𝑛 80 × 250
80𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = = = 200
100 100
𝑗 × 𝑛 20 × 250
20𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 = = = 50
100 100
80 × 𝑛
𝑐 [ 100 − 𝑓(<)] 5(200 − 120)
𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, 𝑃80 = 𝐿𝑃80 + = 35 + = 39.35 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
𝑓𝑃80 92
20 × 𝑛
𝑐 [ 100 − 𝑓(<)] 5(50 − 35)
𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒, 𝑃20 = 𝐿𝑃20 + = 30 + = 30.88 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
𝑓𝑃20 85
71 MANCOSA
𝑖𝑛𝑡𝑒𝑟𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑟 𝑚𝑖𝑑 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒

= 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑢𝑝𝑝𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒 − 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑙𝑜𝑤𝑒𝑟 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑜𝑓 𝑟𝑎𝑛𝑔𝑒
= 39.35 − 30.88 = 8.47 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
9. SOLUTION TO SELF-ACTIVITY
Interquartile deviation already calculated in the previous self-assessment activity:
𝑖𝑛𝑡𝑒𝑟𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 = 7.05 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
𝐼𝑄𝑅 7.05
𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = = = 3.53 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
2 2
MANCOSA 72
Unit
4: Probability and Probability
Distributions
73 MANCOSA
4.1 An introduction probability  Introduces the topic area for the unit
4.2 Probabilities of multiple  Understand the purpose and tenets of probability
outcomes  Understand the basic principles of probability calculations
4.3 Binomial Distribution  Understand the characteristics and purpose of binomial

distributions
 Solving binomial problems
 Distinguish between the binomial and normal distribution
4.4 The Normal Distribution  Understand the underlying characteristics and purpose of
normal distributions
 Estimating probability using normal distributions
 Distinguish between the binomial, normal and standard normal

distribution
4.5 The Standard Normal  Understand the characteristics, theory and application of
Distribution standard normal distributions

 Using standard normal distributions to estimate probability
 Demonstrate the use of z-tables
4.6 Sampling distribution of the  Distinguish between the normal and sampling distribution of the
mean mean
 Demonstrate a theoretical understanding of sampling
 Calculating probability estimates using the sampling distribution

of the mean
4.7 Summary  Summarises topic areas of the unit
Probability - the likelihood with which something will occur

Odds - the likelihood that something will happen
Factorial - the number of possibilities
Permutations - the number of different ways in which objects can be arranged in order
Binomial Distribution - hypothetical frequency distribution that allows us to estimate the probability of one out of
two mutually exclusive and jointly exhaustive possible outcomes
The Normal Distribution - the normal distribution is a smooth continuous curve representing the form a binomial
distribution would take for an infinite number of events with equiprobable outcomes.
MANCOSA 74
The Standard Normal Distribution - the standard normal distribution has standardised z-values. z-scores are
standardised scores.
Sampling distribution of the mean - a variant of the normal distribution. A frequency distribution of sample
means and not individual scores

 Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western,
Cengage Learning.
 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses & Conclusions: A

Course in Statistics for the Social Sciences. Cape Town: UCT Press.
75 MANCOSA
4.1 An Introduction to Probability

Basically, probability refers to the likelihood with which something will occur. It can be calculated from a classical
approach whereby it calculates the proportion of times and event is likely to theoretically expected to occur.
Classical probability formula looks as follows:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑎𝑛 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
Probability = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
So for example, if we were to roll a dice (with 6 sides), the probability of getting a “3” is:
1 1
Probability = 1+1+1+1+1+1 = 6
1
Similarly – the probability of throwing a head when tossing a coin is 2
Although the classic probability theory is pertinent to games of chance, it has limited real-world applications,
where all possible outcomes are not equally likely, or where there is little known about the underlying processes.
That is where the relative frequency approach steps in.
The relative frequency approach is calculated as the proportion of times an event is observed to occur in a very
large number of trials. The formula is as follows:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑎𝑛 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑠
Probability = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠
The assumption underlying this approach is known as the law of large numbers. The law of large numbers
assumes that as you increase the number of trials, the overall probability starts tending towards the actual
probability with which that event is supposed to occur. So for example, if we were to toss a coin, the probability of
getting heads or tails will start to converge towards 0.5 the larger the number of trials.
The term odds is often used to express the likelihood that something will happen. So for example, if someone
said the odds of something is 4 to 1, it is basically saying that the chance of it occurring is four times more likely
than what won’t occur.
Probability and games of chance

Probabilities are based on how frequently something has happened before, and therefore, used to determine the
probability with which it is likely to re-occur. To demonstrate, suppose that 5 out of 7 people drink coffee, simply
5
stated, that is 7 = 0.7142, or 71.42% of people drink coffee. As demonstrated, using decimals and proportions,
proportions are easy to discern and facilitates comparisons. When we ascribe a number to the probability with
which something is likely to occur, then we give a measure of confidence in our assertion. The higher the
probability, the more the confidence.
MANCOSA 76
If we are 100% sure or confident in the outcomes, we will ascribe the probability a 1. Alternatively, if we are
utterly sure that the event will not occur, we ascribe the probability a value of 0. Between 0 – 1, the stronger the
evidence, the closer to 1 the ascribed probability will be. For example, given the ongoing increases in petrol
prices, and given that the current petrol price has hit an all-time high of R15.73 with a pending increase of 71c,
the probability of it dropping to R6.92 (the price as of July 2008) is very small and probably closer to 0. The
probability, however, of it increasing another 20c to 30c within the next few months, given previous trends, is
quite high, with a probability closer to 1.
When discussing probabilities, the probability with which something will occur is denoted a p, whereas the
probability of that thing not occurring is denoted using q.
It can therefore be expressed as:
p+q=1
So,
q=1–p
p=1–q
When we are interested in achieving a particular outcome (p), when we achieve that outcome, it is known as a
success. Sometimes, the probability of success does not hinge on a one-shot attempt at achieving it, but rather a
succession of events. For example, the probability of throwing two heads in three tosses of a coin. In this case,
the number of possible outcomes could be HHT, TTH, HTH, THT, HHH, TTT, HTT, THH. In other words, there
3
are eight possible outcomes when flipping our coin thrice. The probability is thus 8 = 37.5%.
As mentioned – sometimes there are more than two possible outcomes (binomial distribution). If we think of the
example of rolling a dice – there are six possible outcomes. If we think of the probability of selecting a Spade
from a pack of cards, we know there are 52 cards, four suites, which makes it 13 cards for each suite. Therefore,
the probability of selecting a Diamond from a pack of cards is as follows:
𝑎
p=𝑛
Where:
a = number of outcomes counted as a success
n = total number of equally possible outcomes
so,
13
p (drawing a diamond) = 52
p = 0.25, or 25%.
77 MANCOSA
A jar contains 100 marbles, identical except that 30 are red, 20 black, 5 green
and the rest white.
If a marble is taken from the jar at random, what is the probability that the
marble is:
a. red ?
b. black or green?
c. not red?
d. multicolor?
SOLUTION:
A. 3/10
B. ¼
C. 7/10
D. 0
Sometimes though, we need to consider overlapping probabilities, and such need to discern between the
probability laws of conjunctions or distinctions. Essentially, and taking the instance of drawing a card from a deck
of cards. You need to consider whether or not, once drawing the card, you replace it in the pack. If you draw a
card, and put it back afterwards, it is known as sampling with replacement. If you draw a card but do not put it
back in the deck before the next draw, you are sampling without replacement, and you have now altered the
probability of drawing a particular card in the next draw. In this instance, your draws are not independent in that,
through drawing one card, you affect the probability of the next draw.
When you have independent events, the law of conjunctions apply, and you multiply the probabilities. In other
words, you are looking at the probability of two jointly occurring events i.e. a and b.
Alternatively, when you are looking at the probability of either of two independent events, you will concern
yourself with the law of disjunctions, and add the probabilities i.e. a or b.
For example, the probability of drawing two Diamonds in two successive draws. When we replace the card after
having drawn it (sampling with replacement) then the probability will be:
1
p = 0.25 x 0.25 (4 chance of selecting a heart each draw)
p = 6.25%
MANCOSA 78
However, if we were to draw the cards without replacement, then we alter the probabilities:
13
p (first draw) =
52
12
p (second draw) =
51
13 12
p (a and b) = 52 x 51
p (a and b) = 0.059
However, when we are looking at disjunctions, we are trying to estimate the probability of either of two
independent events occurring. The events need to be mutually exclusive (by belonging to one category it
automatically excludes belonging to another) So, for example if you were to try estimate the probability of
drawing a heart or a diamond in two successive draws, the probability would be:
1
p (hearts) = 4
1
p (diamonds) = 4
Therefore:
1 1
p (Heart or Diamond) = 4 + 4
p (Heart or Diamond) = 0.5
Half the faces of a fair die are painted blue, half yellow. The die is rolled twice. What is
the probability the die will turn up blue both times? Can you cite a probability “rule” that
models your answer?
SOLUTION
Additive -
¼ p(A and B) = p (A) p (B) for independent events, A & B
4.2 Probabilities of Multiple Outcomes

Let’s say an event e can occur in n ways, the number of possible outcomes are e x n. So, for example, if you had
two cars, and three different ways of getting to the beach, there are (2 x 4) 8 possible ways in which you can get
to the beach. In certain instances, several events have different possibilities, and when you make a choice, it
limits the number of choices for subsequent events. Lets say for example you had 5 tasks to complete, and need
to choose the order in which to complete them. The number of possibilities then becomes 5!, or 5x4x3x2x1
This is known as a factorial.
Permutations refer to the number of different ways in which objects can be arranged in order. Whilst permuting
possibilities, each item can only appear once, and each order constitutes a separate permutation. So, to
calculate the number of possible outcomes in which a particular event can happen:
𝒏 𝒏!
= 𝒓!(𝒏−𝒓)!
𝒓
79 MANCOSA
Where:
n = number of trials or events
r = number of successes
! = factorial – multiply each number by the number before it e.g. 3! = 3 x 2 x 1
So when you are determining the potential number of possible outcomes, you use the following formula:
The formula for calculating probabilities is thus:

𝒏!
𝒓!(𝒏−𝒓)!
𝑝𝑟 x 𝑞 𝑛−𝑟
So,
Step 1: Work out the total possible number of outcomes
Step 2: Define what is meant by a success, and how many successes in n trials
Step 3: Determine what the probability is associated with each success
Step 4: Substitute it into the probabilities formula
If we had a four-sided dice, each side representing a suite (hearts, diamonds, spades, clubs), then when trying to
determine the probability of obtaining 2 hearts in 6 rolls of the dice:
Step 1: The total number of possible outcomes is:
𝒏 𝒏!
=
𝒓 𝒓!(𝒏−𝒓)!
𝟔 𝟔!
=
𝟐 𝟐!(𝟔−𝟐)!
𝟔 𝟕𝟐𝟎 𝟕𝟐𝟎
= = = 𝟏𝟓 possible outcomes
𝟐 𝟐!(𝟐𝟒)! 𝟐!(𝟐𝟒)!
Step 2: Success is rolling 2 hearts (r = 2, n = 6)

1
Step 3: Probability of success is = 0.25, therefore p = 0.25 and q = 0.75
4
Step 4:
𝒏!
p(2 hearts in 6 rolls) = 𝒓!(𝒏−𝒓)! 𝑝𝑟 x 𝑞 𝑛−𝑟
720
p (2 hearts in 6 rolls) = 2!(24)! 0.252x 0.756−2
p (2 hearts in 6 rolls) = 15 x 0.0625 × 0.316

p (2 hearts in 6 rolls) = 0.296
Please note that the difference between permutation and combination is that permutation the order matters,
whilst for combinations it does not.
MANCOSA 80
4.3. Binomial Distribution
12Figure 4.1 The Binomial distribution
The Binomial distribution is a hypothetical frequency distribution that allows us to estimate the probability of one
out of two mutually exclusive and jointly exhaustive possible outcomes. Although it looks the same as other
distributions, such as the sampling distribution of the mean, and the normal distribution, it has the following
characteristics:
 Events must be independent
 Events must only have two possible outcomes
 Equally probable outcomes or two outcomes of unequal probability
 Discrete variables only
 Mutually Exclusive: Occurrence of one event makes the occurrence of all other events impossible
 Exhaustive: All possible outcomes or states of phenomena are represented
 Estimates r occurrences of successful outcomes in n events
The binomial question is “What is the probability that r successes will occur in n trials of the process
under study?”
There are five things you need to work a binomial story problem:
1. Define Success first - Success must be for a single trial
2. Define the probability of success
3. Find the probability of failure
4. Define the number of trials
5. Define the number of successes out of those trials
6. Plug all values into the formula
81 MANCOSA
A car hire firm rents out only Toyota and VW cars. Experience has shown that one in
four clients choose a VW. If 5 reservations are randomly selected from today’s
bookings, what is the probability that 2 will have requested a VW?
SOLUTION:
One in four clients hire a VW, so:
Step 1: What is the success? In this case it is hiring a VW.
Step 2: Determine the probability of success - Probability of a success outcome p = ¼

= 0.25 (A VW is hired)
Step 3: Probability of a failure - q = ¾ = 0.75 (A VW is not hired)
Step 4 & 5: We want to know the probability of 2 success outcomes, i.e. we require
p(2) and we have 5 observations, thus n = 5.
Step 6:
𝟓 𝟔!
= 0.252 x 0.755−2
𝟐 𝟐!(𝟓−𝟐)!
𝟓
= 0.2637
𝟐
4.4 The Normal Distribution
13Figure 4.2 The Normal Distribution
MANCOSA 82
The normal distribution is a smooth continuous curve representing the form a binomial distribution would take for
an infinite number of events with equiprobable outcomes. The characteristics of a Normal Distribution are as
follows:
 Bell-shaped curve
 Symmetrical
 Unimodal (Mean, Median, Mode all coincide)
 Asymptotic -Tails extend indefinitely to the left
 and right
It is a model of the shape of the frequency distribution of many naturally occurring phenomena and helps us
understand the “relative position” of a case relative to other cases. In other words it allows us to determine where
an individual score lies in relation to other scores. The area under the curve of a normal distribution represents
probability.
Every phenomenon has a different distribution (different means and variances), but all have the same shape
(normal shape). Think for example female shoe sizes. In South Africa, the average (mean) shoe size for females
is size 6, with a standard deviation of 2.5. In China, on the other hand, have an average female shoe size of 3.5
with a standard deviation of 2. If we were to plot them graphically:
14Figure 4.3. Histograms illustrating normally distributed data
You can see from the histograms above that both distributions are normally distributed despite having different
means and standard deviations. Distributions allow us to predict probability or proportion from an individual
score, but in order for us to do so, we need three pieces of information:
 Mean
 Variance
 Shape
83 MANCOSA
But, because there are so many different types of distributions – each distribution has a different proportion of
cases falling below any particular score so it becomes difficult to determine positions when all the distributions
are different. As such, we standardise distributions in order to determine absolute positions.
4.5 The Standard Normal Distribution
Whereas normal distributions have x-

values along x-axis (individual scores);
the standard normal distribution has
standardised z-values. z-scores are
standardised scores. They do not
depict real values of individuals and are
essentially hypothetical values to show
where an individual case lies relative to
other cases. They indicate the number
of
15Figure 4.4 The Standard Normal Distribution
Standard Deviation units a score lies above or below the mean.

You use the z-table in order to determine the proportion of cases above and below a particular score, taking the
mean and standard deviation into consideration when calculating the z-score.
Table 6 Table 4.1 The z-table
MANCOSA 84
z-tables are used to determine the exact proportion of cases falling above and below a particular score, and
contain z-scores and proportions. You will find the z-scores in the horizontal and vertical margins, and their
associated proportions as columns and rows. When we convert the raw scores from the normal distribution into
z-scores we facilitate comparisons across distributions with different means and std deviations, and make real-
world distributions comparable. In order to convert raw scores into z-scores, we use the following formula:
𝑥−µ
z= Ơ
Where:
z = z-score
X = score in real-world distribution
µ = population mean
Ơ = population standard deviation
When dealing with a z-score related problem, you follow the following steps:
Step 1: Determine the mean and standard deviation
Step 2: Plug the statistics into the formula and derive a z-value
Step 3: Draw the standard normal distribution in order to determine the proportion (larger or smaller) you are
interested in
Step 4: Determine the actual proportion using the column determined above using your z-table
Step 5: Conclude with reference to the scenario
In the United States, the average IQ is 100, with a standard deviation of 15. What
percentage of the population would you expect to have an IQ lower than 85?
SOLUTION
Step 1: mean = 100, STD DEV = 15

85−100
Step 2: z = 15
= -1
Step 3: You plot -1, and because its less/smaller than – you look to the left which as
you can see from the shaded area to the left of -1, it is smaller in relation to the
whole. This means you are interested in the smaller p
85 MANCOSA
Step 4: You can ignore the sign (+/-) as the distribution is symmetrical – look in the
z-column value 1, and column smaller p - you will read off a value of 0.15866 =
15.87%.
Step 5: Conclude with reference to the scenario ~ About 16% of the population has
an IQ score lower than 85
Activity 4.1.
4.1.1. Find the area between the mean and:
1. 1.17
2. -0.85
3. 2.07
4. -1.37
4.1.2. The percentage of individuals scoring below:
1. 2.24
2. -1.65
3. 1.47
4. -0.47
4.1.3. The percentage of individuals falling above:

1. 0.24
2. -1.11
3. 1.22
4. -2.07
MANCOSA 86
4.1.4. For the numbers below find percent of cases falling between the two z-
scores:
1. z = -.38 & z = 1.63

2. z = .88 & z = 1.55
3. z = -1.93 & z = 1.09
4. -2.22 & z = - 1.34
4.6. Sampling Distribution of the Mean

The Sampling Distribution of the mean is a variant of the normal distribution. It looks like a normal distribution,
and is defined by mean and the variance. What makes sampling distributions any different from binomial and
standard normal distributions is that they are a frequency distribution of sample means and not individual scores.
It forms the foundation to all inferential statistics. Remember that the point of statistics is make inferences about
populations based on samples. Therefore, it is imperative that the samples we draw, are in fact, representative of
the population we wish to say something about. When we make inferences about population parameters based
on sample statistics, we are making generalisations from a random and representative sample to the larger
population. The accuracy of inferences are critically affected by the sampling procedures used.
Because we use samples to make inferences – we need sample distributions. To draw scientific inferences, we
need to know where a sample mean stands in a distribution relative to other sample means. The mean of the
sampling distribution is equal to the mean of the population. In other words, once we start to sample all of the
available samples, they sample mean will tend towards the population mean.
x̅ 6
x̅ 5
x̅ 4 µ
x̅ 2
x̅ 1
x̅ 3
16Figure 4.6 Graphical representation of the theory underlying the sampling distribution of the mean
87 MANCOSA
When we sample repeatedly from the same population, we expect the means of these samples to be different.
Plotting the means of an infinite number of samples of size n, drawn from a population, will give us a sampling
distribution of the mean. This tendency for repeated samples statistics to tend towards the population parameters
is known as the Central Limit Theorem. The Central Limit Theorem states that
“Given a population with a mean ì and a variance ó2, the sampling distribution of the mean will have a mean
equal to ì and a variance ơ²/n. The shape of the sampling distribution approaches normal as the sample size (n)
increases”.
Therefore, the mean of the sampling distribution of the mean is equal to the population mean:
µx̅ = µ
The variance of the sampling distribution of the mean is equal to the population variance divided by n
ơ²
ơ²x̅ =
𝑛
The distribution will be approximately normally distributed as long as the sample size not too small.
Using the Central Limit theorem concept, we can compute the proportion of cases lying above or below a
specified value, and ask:
What proportion of samples have a mean greater or smaller than a particular value?
Step 1: Transform x bar into z-values
Whereas we previously used the formula:
𝑥− µ
𝑧=
ơ
This changes to:
x̅− µ
𝑧= ơx̅
The variance is not the same between sampling distribution and population, therefore you need to change the
formula for ơx̅
x̅− µ
𝑧= ơ
√𝑛
Step 2: Calculate
MANCOSA 88
From years of testing, we know that IQ scores for individuals are normally
distributed with a mean of 100 and a standard deviation of 15. If we select a
random sample of 10 secondary school pupils, what is the probability that their
mean is less than 95?
SOLUTION
Transform a sample mean of 95 from a sampling distribution of the mean with a

mean of 100 and a standard deviation of 4.744 into a z-score of –1.054 on the
standard normal distribution
x̅− µ
𝑧= ơ
√𝑛
95− 100
𝑧= 15
√10
−5 −5
𝑧= 15 =
4.744
3.162
𝑧 = -1.054
A proportion of 0.146 lies above a z-score of 1.054. This means that a proportion
of 0.146 lies below a z-score of –1.054.
Therefore a proportion of 0.146 lies below a score of 95 on our sampling

distribution of the mean. This means that the probability that our sample would
have a mean lower than 95 is 0.146 (i.e. a 14.6% chance that their mean will be
less than 95).
4.7 Summary
This chapter covered probability. Probability is the likelihood of something occurring. It introduced and explained
the types and use of the theoretical distributions underlying probability calculations. Probability is important to
business in that we are able to estimate the likelihood of certain outcomes occurring, and take the necessary
precautions.
89 MANCOSA
Solutions to activities
4.1.1. 0.37900
4.1.2. 0.30234
4.1.3. 0.48077
4.1.4. 0.41466
4.2. SOLUTION
4.2.1. 0.98745 = 98.75%

4.2.2. 0.04947 = 4.95%
4.2.3. 0.92922 = 92.92%
4.2.4. 0.31918 = 31.92%
4.3. SOLUTION
4.3.1. 0.40517 = 40.52%

4.3.2. 0.86650 = 86.65%
4.3.3. 0.1123 = 11.23%
4.3.4. 0.98077 = 98.1%
4.4. SOLUTION
4.4.1. (0.35197) and (0.94845) = 0.59648 = 59.65%
Both Pos.:
From µ:
MANCOSA 90
7.69 (subtract from larger)
Revision Questions
1. Serena Williams is known to serve an ace at Wimbledon 70% of the time. If she continues to serve at
the same rate for her next match, and serves 5 times, what is the probability that:
1.1. All five serves will be aces
1.2. At least 2 serves will be aces
2. What is the area under the standard normal distribution between z = -0.6 and z = 2.4?
3. Research is shown that children can concentrate on average for four minutes, with a standard deviation
of 1 minute. What is the probability that the child will be able to concentrate for:
3.1. Between 5-6 minutes
3.2. Less than 2 minutes
Answers to revision questions
𝑛! 𝑝𝑟 𝑞 𝑛−𝑟
1.1. P(r) =
𝑟!(𝑛−𝑟)!
Let the successful outcome be that a serve is in (successful serve).

𝑝 = 0.70 and 𝑞 = 0.30. Also 𝑛 = 5.
5!0.705 0.300
P(5) = = 0.1681
5!0!
5!0.700 0.305
1.2. P(0) = = 0.00243
0!5!
91 MANCOSA
5!0.701 0.304
P(1) = = 0.0284
1!4!
P(at least 2 serves in) = 1.0 – P(0) – P(1) = 1.0 – 0.00243 – 0.0284 = 0.969
2. 0.2257 + 0.4918 = 0.7175
1.1. P(5<x<6)=P((5 - 4)/1<z< (6 - 4)/1)
=P(1.00<z<2.00)
=0.4772-0.3413 = 0.1359
=13.59%
1.2. P(x<2)=P(z<(2-4)/1)=P(z<-2.00)
=0.5-0.4772
=0.0228
=2.28%
MANCOSA 92
Unit
5: Index Numbers
93 MANCOSA
5.1 Introduction to index number  Introduce topic area of the units
5.2 Base Period  Defining an index number and its purpose
5.3 Price Index  Differentiate between price and quantity indices
5.4 Quantity Index  Calculate both price and quantity indices
5.5 Summary  Summarises the topic areas of the unit
Index number - relative figure, expressed as a percentage, which is used to measure how much an economic
variable changes over time or differs between two locations. It is a summary measure of the change in the
activity of an item or a collection of items (known as a basket) from one time period to another.
Base Period - Point in time to which the comparison is made
Price Index - Measures the percentage change in price between any two periods of time
Quantity index - Measures the percentage change in consumption level of individual items or baskets of items
from one time period to another
Composite index - Combines both the relative prices and the quantities

Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western,
Cengage Learning. Chapter 18, pages 724-728.
Goodridge, P. (2007). Methods explained: Index Numbers. Available from:
https://www.ons.gov.uk/ons/rel/elmr/economic-and-labour-market-review/no--3--
march-2007/methods-explained--index-numbers.pdf
MANCOSA 94
5.1. Introduction to Index Numbers

An index number is a relative figure, expressed as a percentage, which is used to measure how much an
economic variable changes over time or differs between two locations. It is most often used in economics to
measure trends in, for example, stock market prices, cost of living, imports etc. Index numbers are not directly
measureable, instead, they represent general, relative changes. So therefore, and index number is a summary
measure of the change in the activity of an item or a collection of items (known as a basket) from one time period
to another.
An index is constructed by expressing the value of an item in the current period as a ratio of its value in the base
period. This is then expressed as a percentage.
𝐶𝑢𝑟𝑟𝑒𝑛𝑡 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒

Index Number = x 100
𝐵𝑎𝑠𝑒 𝑝𝑒𝑟𝑖𝑜𝑑 𝑣𝑎𝑙𝑢𝑒
5.2. Base Period

Point in time to which the comparison is made. The choice of a base period will determine the usefulness of any
data. In choosing a base period, the following should be considered:
 the period should be recent enough for comparisons with the base to be meaningful;
 it should be a typical period with respect to the activity of interest, and should be the same period used
by other series with which you are likely to compare your data;
 the period should be one of relative economic stability, without any abnormal influences
The base period is usually ascribed the value of 100.
Each subsequent year will be above or below 100, depending on whether there has been an increase or
decrease in the data compared with the base year
There are generally two major categories of index numbers – price and quantity.
For both, we can use a single or composite index.
Single index
According to the concise Encyclopaedia of Statistics (2008), a simple index number is, “the ratio of two values
representing the same variable, measured in two different situations or in two different periods”. For example,
price. Price will provide an index of the change in price between the current and the reference period. Other
examples of simple index prices include quantity and value.
95 MANCOSA
Complex or Composite index

A composite index combines both the relative prices and the quantities. A composite index number allows us to
“measure, with a single number, the relative variations within a group of variables upon moving from one
situation to another” (Encyclopaedia of Statistics, 2008).
5.3. Price Index

The price index measures the percentage change in price between any two periods of time. You calculate the
price relative in order to determine the relative price change from one time period to another.
𝑝
Price relative = 𝑝1 x 100
0
Where:
𝑝1 = Price in the current period
𝑝0 = Price in the base period
5.4. Quantity Index

This index measures the percentage change in consumption level of individual items or baskets of items from
one time period to another. To do so for dingle items, the relative quantity change from one time period to
another is done by computing its quantity relative.
𝑞
Quantity relative = 𝑞1 x 100
0
Where:
𝑞1 = Quantity in the current period
𝑞0 = Quantity in the base period
Weighted index numbers

Whereas simple index numbers grant equal importance to all items regardless of what share they hold, weighted
index numbers weigh or load items according to their relative importance. For example, when calculating the
price index number if the price of a unit of petrol is fifteen the price of a unit of rice, then the petrol will be
weighed in as ‘15’ whereas rice will be weighed in as ‘1’. It creates a more realistic picture of the real state of
affairs than does simple index numbers.
There are two types of weighted indexes, the fixed weight index, and the simple weighted (aggregative) index.
MANCOSA 96
Fixed weighted index:

This utilises weights that are based on a period/s considered representative. The weight and base prices do not
necessarily have to be drawn from the same period. The formula is as follows:
Price Index=  P w  100

1
P w o
Where w is the weight
Simple Weighted (Aggregative) Index:

This places the base year for both price and quantity in the numerator. It does, however run the risk of lack of
representivity in the base period.
The formula is as follows:
 PQ 1 1
 100
P Q o o
Commonly used composite indexes using weighted aggregate indexes include the Laspeyres index and the
Paasche index. The most commonly used composite index is the Laspeyres Index.
∑(𝑝 𝑥 𝑞0 )
Laspeyres Index = ∑(𝑝1 x 100
0 𝑥 𝑞0 )
Quantities at base period levels are held constant.
The price index indicates an increase in the value of the portfolio if all quantities of shares remain the same.
Alternatively, the quantity index indicates an increase in shares ought since all prices have been kept constant in
the calculation.
Index numbers are based on samples and are thus error prone, and changes in technology, purchasing
behaviours, quality changes etc can create inconsistencies.
97 MANCOSA
Share Base Year 1992 Base Price Quantity
𝒑𝟎 𝒒𝟎 𝒑𝟏 𝒒𝟏 𝒑𝟎 ∗ 𝒒𝟎 𝒑𝟏 ∗ 𝒒𝟎 𝒑𝟎 ∗ 𝒒𝟏
A 70 362 120 305 25340 43440 21350
B 202 250 122 72 50500 30500 14544
C 1280 52 1900 102 66560 98800 130560
TOTAL 142400 172740 166454
7Table 5.1. Table calculating Laspeyres Price and Quantity Index
∑(𝑝 𝑥 𝑞 )
Laspeyres Price Index = ∑(𝑝1 𝑥 𝑞0 ) x 100
0 0
172740
Laspeyres Price Index = 142400 x 100
Laspeyres Price Index = 121.31%

This means that the value of the shares increased by roughly 21.31%
166454
Laspeyres Quantity Index = 142400 x 100
Laspeyres Quantity Index = 116.89%
This means that the number of units of shares held increased by roughly 16.89%
Paasche index
The Paasche index is an example of a weighted aggregate index which uses current time period weights. It is
useful when the relative importance of the items making up the basket of goods is continuously changing due to
a change in the quantity for different each year. It is more accurate than the Laspeyre’s Index as it reflects what
the industry is actually using in the current year, and therefore takes account of the price changes and the
quantity changes.
∑(𝑝 𝑞1 )
The formula for Paasche’s Price Index is ∑(𝑝1 x 100
0 𝑞1 )
∑(𝑝 𝑞1 )
The formula for Paasche’s Quantity Index is ∑(𝑝1 x 100
1 𝑞0 )
MANCOSA 98
The table shows the 2005 and 2006 prices and volumes in millions of shares for Toyota,
VW, and BMW. Calculate the Paasche Index using 2005 as the base period
Toyota VW BMW
Price Quantit Price Quantit Price Quantit
y y y
2005 45.51 0.8 13.17 7 36.81 5.6
2006 61.41 0.2 7.51 10 30.72 6.1
8Table 5.2. Table of prices and quantities in millions of shares for three different car
brands
∑(𝑝 𝑞1 )
The calculation for Paasche’s Price Index is ∑(𝑝1 x 100
0 𝑞1 )
0.2(61.41)+10(7.51)+6.1(30.72)
0.2(45.51)+13.17(10)+36.81(6.1)
x 100
274.774
365.343
x 100
= 75.2
2006 prices represent a 24.8% (100 – 75.2) decrease from 2005 (assuming quantities
were at 12006 levels for both periods).
∑(𝑝 𝑞1 )
The calculation for Paasche’s Quantity Index is ∑(𝑝1 x 100
1 𝑞0 )
∑ 0.2(61.41)+10(7.51)+6.1(30.72)
61.41(0.8)+7.51(7)+30.72(5.6)
x 100
274.774
= 273.73
x 100
= 100.381
Quantity increased by 0.38%
5.5 Summary
This unit covered price and quantity indices used to measure trends, and track changes between time points.
Indices are useful measures of the change in the activity of an item or a collection of items (known as a basket)
from one time period to another.
99 MANCOSA
Revision questions
5.1. The following table represents the portfolio for shares. Use the Laspeyres price and quantity indices to
determine how the shares fared.
Share Base Year 1997
𝒑𝟎 𝒒𝟎 𝒑𝟏 𝒒𝟏
A 60 350 118 299
B 180 221 115 68
C 1113 49 1750 97
5.2. The following table represents the portfolio for shares. Use the Laspeyres price and quantity indices to
determine how the shares fared.
Share Base Year 2000
𝒑𝟎 𝒒𝟎 𝒑𝟏 𝒒𝟏
A 80 300 100 250
B 100 100 120 60
C 200 50 800 85
5.3. Given the following prices and quantities for copper and steel for the following period:
Copper Steel
Period Price Quantity (Tons) Price Quantity (Tons)
Bas
1000 200 130 8700
e
Cur
1010 190 120 9000
rent
Calculate Paasche’s Price and Quantity Index.

SOLUTION
𝒑𝟎 𝒒𝟎 𝒑𝟏 𝒒𝟏 𝒑𝟎 ∗ 𝒒𝟎 𝒑𝟏 ∗ 𝒒𝟎 𝒑𝟎 ∗ 𝒒𝟏
A 60 350 118 299 21000 41300 17940
B 180 221 115 68 39780 25415 12240
C 1113 49 1750 97 54537 85750 107961
TOTAL 115317 152465 138141
152465
Laspeyres Price Index = 115317 x 100 =
132.2138%
138141
Laspeyres Quantity Index = x 100 =
115317
119.7924%
This means that the number of units of shares held increased by roughly 19.79%
MANCOSA 100

𝒑𝟎 𝒒𝟎 𝒑𝟏 𝒒𝟏 𝒑𝟎 ∗ 𝒒𝟎 𝒑𝟏 ∗ 𝒒𝟎 𝒑𝟎 ∗ 𝒒𝟏
A 80 300 100 250 24000 30000 20000
B 100 100 120 60 10000 12000 6000
C 200 50 800 85 10000 40000 17000
TOTAL 44000 82000 43000
82000
Laspeyres Price Index = 44000 x 100 =
186.363%
43000
Laspeyres Quantity Index = 44000 x 100 =
97.727%
This means that the number of units of shares held decreased by roughly 2.272%
5.3. The calculation for Paasche’s Price Index is:

∑(𝑝1 𝑞1 )
∑(𝑝0 𝑞1 )
x 100
1271900
1360000
x 100
= 93.5%
The calculation for Paasche’s Quantity Index is:

∑(𝑝1 𝑞1 )
∑(𝑝1 𝑞0 )
x 100
1271900
1246000
x 100
= 102.079%
101 MANCOSA
Unit
6: Linear Correlation
and Regression
MANCOSA 102
6.1 Introduction to Correlation and  Introduce topic areas for the unit
Regression
6.2 Regression  Describe and understand what is meant by Correlation and

Regression
6.3 Correlation  Practically apply the principles of correlation and regression to

real-world problems
6.4 Summary  Summarise topic areas of the unit
Correlation - on the other hand allow us to gauge the strength and direction of a relationship
Regression - observes the spread of scores to create a mathematical summary of what we think the relationship
between the two variables might be. We can use this mathematical relationship to make predictions


Western, Cengage Learning. Chapter 16, pages 600-634.
 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses &

Conclusions: A Course in Statistics for the Social Sciences. Cape
Town: UCT Press. Tutorial 10, Pages 160-179: Tutorial 11, pages 181-
199.
103 MANCOSA
6.1 Introduction to Correlation and Regression

Up to this point, we have been dealing with unpaired data, or otherwise known as univariate data. When we deal
with Correlation and Regression, we are dealing with bivariate data. Bivariate data is interested is measuring the
relationship between two variables, or paired data. Paired data allows us to measure the relationship between
two measures, where it is collected from two independent measurements. In order to determine the strength and
direction of a relationship between an independent and dependent variable, correlation utilises a correlation
coefficient. Another measure of strength in linear relationships is the coefficient of determination, represented by
an r². it is a value that accounts for the variation in y, as explained by the best-fit equation.
6.2 Regression
Regression presents a refined way of analysing scatterplots, and observes the overall shape of plotted points. It
observes the spread of scores, and creates a best fitting line that can be drawn through the points on a
scatterplot. When the lines that fit the data best are straight – we refer to it as linear regression. When the best fit
line is curved – it is non-linear. A regression equation is essentially a mathematical summary of what we think the
relationship between the two variables might be. We can use this mathematical relationship to make predictions,
though not without some danger of making a mistake
In order to fit a straight line to the data, 2 pieces of information are required:
– Slope
– Intercept: point on graph where crosses the y-axis
The straight line formula is:
𝑦 = 𝑎 + 𝑏𝑥
y represents the percentage of people on the criterion variable
x represents the predictor variable
a and b represent the two pieces of information required to fit the line (i.e. b is the slope, and a is the intercept)
(Regression coefficients)
When calculating regression coefficients:
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑠𝑥𝑦 = 𝑛
𝑛−1
Where:
n: the number of pairs of values
Σx: the sum of the x values
Σy: the sum of the y values
Σx²: the sum of the squares of the x values
Σxy: the sum of x multiplied by y values
MANCOSA 104
These intermediate values are substituted into the following equation to find the covariance, 𝑠𝑥𝑦 , and following
this, the slope, b:
𝑠𝑥𝑦
b=
𝑠²
Having calculated b, we can find the intercept a.

The midpoint of all the points on the scattergraph or scatterplot is the middlemost point in the scatter
Substitute these mean values into the general equation for a line (y = a + bx) and then rearrange to solve for a:
a = ȳ - bx̅
When calculating regression, the following steps ensue:

Step 1: Draw a Scatterplot. Initially drawing a scatterplot allows provides information on possible departures from
important assumptions, e.g. non-linearity of the relationship, strength of relationship etc
Step 2: Draw the table
Step 3: Calculate the covariance
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑠𝑥𝑦 = 𝑛
𝑛−1
Step 3: Calculate the b or slope coefficient
𝑠𝑥𝑦
b= 𝑠²
OR
𝑛(∑ 𝑥𝑦)−(∑ 𝑥)(∑ 𝑦)
b= 𝑛(∑ 𝑥²)−(∑ 𝑥)²
Step 4: Calculate a or the intercept:

a = ȳ - bx̅
OR
(∑ 𝑦)−𝑏(∑ 𝑥)
a=
𝑛
105 MANCOSA
Given the following dataset:

Score A Score B Score A Score B
27 92 46 122
27 84 35 120
23 113 25 102
23 106 27 123
27 96 41 139
18 103 28 99
26 92 44 131
15 82 31 117
22 75 44 120
34 113 45 123
44 128 39 128
19 99 30 101
26 124 28 110
17 99 45 147
30 113 42 127
32 109 30 119
20 93 33 136
40 130 39 127
20 89 36 127
38 128 37 132
Step 1: Draw a scatterplot
MANCOSA 106
Step 2: Draw the table

x y xy x² y²
27 92 2484 729 8464
27 84 2268 729 7056
23 113 2599 529 12769
23 106 2438 529 11236
27 96 2592 729 9216
18 103 1854 324 10609
26 92 2392 676 8464
15 82 1230 225 6724
22 75 1650 484 5625
34 113 3842 1156 12769
44 128 5632 1936 16384
19 99 1881 361 9801
26 124 3224 676 15376
17 99 1683 289 9801
30 113 3390 900 12769
32 109 3488 1024 11881
20 93 1860 400 8649
40 130 5200 1600 16900
20 89 1780 400 7921
38 128 4864 1444 16384
46 122 5612 2116 14884
35 120 4200 1225 14400
25 102 2550 625 10404
27 123 3321 729 15129
41 139 5699 1681 19321
28 99 2772 784 9801
44 131 5764 1936 17161
31 117 3627 961 13689
44 120 5280 1936 14400
45 123 5535 2025 15129
39 128 4992 1521 16384
30 101 3030 900 10201
28 110 3080 784 12100
45 147 6615 2025 21609
42 127 5334 1764 16129
30 119 3570 900 14161
33 136 4488 1089 18496
39 127 4953 1521 16129
36 127 4572 1296 16129
37 132 4884 1369 17424
∑ 1253 4518 146229 42327 521878
107 MANCOSA
Step 3: Calculate the covariance
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑠𝑥𝑦 = 𝑛
𝑛−1
1253 + 4518
146229 − 40
𝑠𝑥𝑦 =
40 − 1
𝑠𝑥𝑦 = 120.8
Step 4: Calculate the b
40(∑ 146229)−(∑ 1253)(∑ 4518)

b= 40(∑ 42327)−(∑ 1253)²
b = 1.528
Step 5: Calculate a
(∑ 4518)−1.53(1253)
a=
40
a = 65.072
Equation: y = 65.07 + 1.53x
How it works:
So for the scores 37, 21, and 48, it would be calculated as follows:
Score b’ = 65.07 + 1.53 x 37 = 121.68 ~ 122
Score b’ = 65.07 + 1.53 x 21 = 97.2 ~ 97
Score b’ = 65.07 + 1.53 x 48 = 138.51 ~ 139
Whilst using the regression line is a useful statement of the underlying trend, but it tells us nothing about the
strength of the relationship. Correlation is a measure of the strength of linear association between two variables.
6.3. Correlation
Correlations on the other hand allow us to gauge the strength and direction of a relationship. Correlations are
calculated on the basis of how far the points lie from the ‘best-fit’ regression line. Correlations are measured
using the correlation coefficient, and symbolised by the small letter r.
r will fall with in the range –1 to +1:
 -1 means a perfect negative correlation (a perfect inverse relationship, where, as the value of x rises, so
the value of y falls)
 +1 means a perfect positive correlation (where the values of x and y rise or fall together)
 An r of 0 means zero correlation, which means that there is no relationship between x and y
MANCOSA 108
When calculating the correlation coefficient, the formula is as follows:

𝑠
r = 𝑠 𝑥𝑦
𝑠
𝑥 𝑦
OR
𝑛(∑ 𝑥𝑦)−(∑ 𝑥)(∑ 𝑦)
r=
√(𝑛 ∑ 𝑥 2 −(∑ 𝑥)²)(𝑛 ∑ 𝑦 2 −(∑ 𝑦)²)
x is the variable on the horizontal axis

y is the variable on the vertical axis
𝑠𝑥 and 𝑠𝑦 are the standard deviations of x and y, respectively
𝑠𝑥𝑦 is the covariance between x and y
The resultant r score will fall between -1 and 1, where, a general rule,
 0.0 to 0.2 indicates a very weak to negligible correlation
 0.2 to 0.4 indicates a weak, low correlation (not very significant)
 0.4 to 0.7 indicates a moderate correlation
 0.7 to 0.9 indicates a strong, high correlation
 0.9 to 1.0 indicates a very strong correlation
Activity 6.1
Given the following scores:

x y
5.77 385
6.55 321
9.9 265
7.21 256
6.37 287
6.51 309
5.77 370
1. Display the information on a scatterplot
2. Draw the table, and calculate 𝑠𝑥𝑦 and state the linear regression equation
3. Calculate r, or the correlation coefficient
6.4 Summary
This unit provided an overview of how we analyse bivariate, or paired data using correlation and regression.
Whereas regression can help to make predictions, correlations allow us to observe the relative strength and
direction of the relationship between variables. These measures are useful in order to determine the relationship
between variables, and how strong the relationship is. Just remember though, that just because there’s a
relationship, it does not infer causality.
109 MANCOSA
Answers to activities
6.1. Draw the scatterplot
450
400
350
300
250
200
150
100
50
0
0 2 4 6 8 10 12
2. Draw the table and calculate the covariance
x y xy x² y²
5.77 385 2221.45 33.2929 148225
6.55 321 2102.55 42.9025 103041
9.9 265 2623.5 98.01 70225
7.21 256 1845.76 51.9841 65536
6.37 287 1828.19 40.5769 82369
6.51 309 2011.59 42.3801 95481
5.77 370 2134.9 33.2929 136900
48.08 2193 14767.94 342.4394 701777
∑𝑥∑𝑦
∑ 𝑥𝑦 −
𝑠𝑥𝑦 = 𝑛
𝑛−1
(48.08)(2193)
14767.94 −
𝑠𝑥𝑦 = 7
7−1
𝑠𝑥𝑦 = −49.14
MANCOSA 110
Step 3: Calculate b
7(∑ 14767.94)−(∑ 48.08)(∑ 2193)
b= 7(∑ 342.4394)−(∑ 48.08)²
−2063.86
b= 85.3894
b = -24.169
Step 4: Calculate a
(∑ 2193)−(−24.169(48.08))
a= 7
a = 479.29
y = a + bx
y = 479.29 - 24.169x
Calculate r
𝑛(∑ 𝑥𝑦)−(∑ 𝑥)(∑ 𝑦)
3. r =
√(𝑛 ∑ 𝑥 2 −(∑ 𝑥)²)(𝑛 ∑ 𝑦 2 −(∑ 𝑦)²)
7(∑ 14767.94)−(∑ 48.08)(∑ 2193)
r=
√(7 ∑ 342.44−(∑ 48.08)²)(7 ∑ 701777−(∑ 2193)²)
r = −0.695
Revision Questions
1. You are interested in investigating the relationship between hours spent studying, and test performance. You
collect data from 5 of your fellow students, and it looks as follows:
Hours spent Test

studying Performance
8 65
2 30
4 45
12 70
9 67
1.1. Draw a scatterplot depicting the relationship between the variables

1.2. Calculate the Pearson’s correlation coefficient
1.3. Determine the linear regression equation
111 MANCOSA
2. An estate agent is interested in determining the relationship between the average distance a house (in KM’s)
is from the Durban CBD, and the average rent paid (in thousands of rands). He samples houses from five
different sub-areas throughout Durban’s surrounds.
Hours spent Test

studying Performance
20 12
16 10
30 15
25 16
34 22
2.1. Draw a scatterplot depicting the relationship between the variables

2.2. Calculate the Pearson’s correlation coefficient
2.3. Determine the linear regression equation
3. What is the difference between correlation and Regression?

1.1.
MANCOSA 112
1.2.
x y xy x2 y2
8 65 520 64 4225
2 30 60 4 900
4 45 180 16 2025
12 70 840 144 4900
9 67 603 81 4489
35 277 2203 309 16539
𝑛∑𝑥𝑦− ∑𝑥∑𝑦
𝑟=
√(𝑛∑𝑥 2 − (∑𝑥)2 ) √(𝑛∑𝑦 2 –(∑𝑦)2 )
(5 × 2203)− (35)(277)
= = 𝟎. 𝟗𝟓𝟓
√(5 × 309 − 352 )(5 × 16539 – 2772 )
𝑛∑𝑥𝑦− ∑𝑥∑𝑦 (5 × 2203)− (35)(277)

1.3. 𝑏= = = 𝟒. 𝟏𝟐𝟓
𝑛∑𝑥 2 − (∑𝑥)2 (5 × 309 − 352 )
∑𝑦−𝑏∑𝑥 277−(4.125 × 35)

𝑎= = = 𝟐𝟔. 𝟓𝟐𝟓
𝑛 5
Regression equation : 𝒚 = 𝒂 + 𝒃𝒙 = 𝟐𝟔. 𝟓𝟐𝟓 + 𝟒. 𝟏𝟐𝟓𝒙
2.1.
x y xy x2 y2
20 12 240 400 144
16 10 160 256 100
30 15 450 900 225
25 16 400 625 256
34 22 748 1156 484
125 75 1998 3337 1209
𝑛∑𝑥𝑦− ∑𝑥∑𝑦
2.2. 𝑟=
√(𝑛∑𝑥 2 − (∑𝑥)2 ) √(𝑛∑𝑦 2 –(∑𝑦)2 )
113 MANCOSA
(5 × 1998)− (125)(75)
= = 𝟎. 𝟗𝟐𝟏𝟕
√(5 × 3337 − 1252 )(5 × 1209 – 752 )
𝑛∑𝑥𝑦− ∑𝑥∑𝑦 (5 × 1998)− (125)(75)

2.3. 𝑏= = = 𝟎. 𝟓𝟖𝟎𝟐
𝑛∑𝑥 2 − (∑𝑥)2 (5 × 3337 − 1252 )
∑𝑦−𝑏∑𝑥 75− (0.5802 × 125)

𝑎= = = 𝟎. 𝟒𝟗𝟓𝟑
𝑛 5
Regression equation : 𝒚 = 𝒂 + 𝒃𝒙 = 𝟎. 𝟒𝟗𝟓𝟑 + 𝟎. 𝟓𝟖𝟎𝟐𝒙
3. The difference between correlation and regression is that although they both deal with binary data, and the
relationship between them, regression uses the relationship to make predictions, whereas correlations allow
you to observe the strength and direction of the relationship.
MANCOSA 114
Unit
7: Time Series Forecasting
115 MANCOSA
7.1 Introduction to times series and  Introduce topic areas for the unit
forecasting
7.2 Decomposition and Smoothing  Determine the need for, and uses of time series forecasts
 Accounting for variation by looking at the four underlying

contributors
 Define and understand smoothing
 Competent handling and calculation of smoothing

techniques
7.3 Seasonal Analysis  Understanding the need, a purpose of seasonal

decomposition
 Calculating seasonal demands
7.4 Summary  Summarises the topic areas for the unit
Exponential Smoothing - allows us to calculate a “smoothed average” which consists of two parts:
– The most recent demand (new information) and
– The historical smoothed average (old information)
Moving average - removes the short-term fluctuations in a time series by taking successive averages of groups
of observations
Seasonal Analysis - deseasonalising data by removing seasonal fluctuations or patterns in the data in order to
make predictions about potential future values


Western, Cengage Learning. Chapter 18 pages 688-715.
 Brownlee, J. (2016). What is Time Series Forecasting?

https://machinelearningmastery.com/time-series-forecasting/
 Yaffee, R.A. & McGee, M. Introduction to Time Series Analysis and
Forecasting: https://core.ac.uk/download/pdf/44191640.pdf
MANCOSA 116
7.1 Introduction to Times Series and Forecasting

The ability to be able to make predictions, and forecasts with a degree of certainty is pivotal for business, and
making business related decisions. It facilitates our ability to plan, provided we have taken into consideration the
types and trends across different types of data. Invariable, forecasting includes methods that are better suited
than others, depending on what type of trends exist across the data. Whereas we have been dealing with both
univariate and bivariate data, time-series looks at longitudinal data, or data that has some degree of temporality.
Time series uses different models to predict future values, based on past observed values. Take for example the
petrol price. Based on a sequence of well-defined data points measured at consistent time intervals over a period
of time, a time series analysis can be used to extract meaningful statistics and characteristics about the data,
and make future predictions regarding the price of petrol.
17Figure 7.1 Graphical representation of a Time Series
Forecast
Model Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1
2016 2016 2016 2016 2017 2017 2017 2017 2018
VAR00002- Forecast 50.37 50.76 51.14 51.53 51.91 52.30 52.68 53.07 53.45
Model_1 UCL 51.92 52.63 53.43 54.32 55.28 56.29 57.35 58.46 59.61
LCL 48.83 48.89 48.86 48.74 48.55 48.31 48.02 47.68 47.30
For each model, forecasts start after the last non-missing in the range of the requested estimation period, and end at the
last period for which non-missing values of all the predictors are available or at the end date of the requested forecast
period, whichever is earlier.
9Table 7.1. Table illustrating the forecasted figures
As you can see from the above graphical model utilising the Holt Model, and the associated forecast table, SPSS
used the existing data collected every quarter from 2004 – 2015 in order to make forecasts into the first quarter
of 2018. It also produced the upper and lower class limits, but we are most interested in the actual forecasted
figures. Using those forecasts, we are then able to make decisions regarding expected growth, resource
117 MANCOSA
requirements to accommodate the growth, and so forth. This is a typical example of a time series. It assumes
that actual values of a random variable in a time series are influenced by a variety of environmental forces
operating over time. As such, it attempts to isolate and quantify the influence of these different environmental
forces operating on the time series into a number of different components. There are four underlying forces
individually and collectively determine the random variables value
– Trend (T)
– Cyclical Variations (C)
– Seasonal Variations (S)
– Random (irregular) variation (R)
Each of these account for a type of variation that causes a fluctuation in the data.
7.1.1 Trend
Trend is denoted T, and is defined as a long-term smooth underlying movement in time series, and describes the
effect that long-term factors have on the series. These long-term factors tend to operate fairly gradually and in
one direction for a long period of time, and for a duration of longer than a year. These trends may be linear, or
non-linear.
18Figure 7.2. Graphical representation of trend
7.1.2. Cyclical Variation (C)

Cycles are medium to long term deviations from the trend. They reflect alternating periods of relative expansion
and contraction. They are wave-like, quasi-regular movements in a time series that can vary in duration and
amplitude, and last longer than a year.
MANCOSA 118
19Figure 7.3 Graphical representation of cyclical variation
7.1.3. Seasonal Variation (S)

Seasonal variations are fluctuations that are repeated periodically, usually within a year (i.e. daily, weekly,
monthly or quarterly) and are attributable to the effect of the seasons per year or systematic patterns within each
week or within each day. Seasonal fluctuations are regular wave-like fluctuations, of more or less constant
length, caused by re-occurring events such as climatic conditions, special occurring events (e.g. Easter,
Christmas) and religious, public and school holidays, but do not repeat themselves for longer than a year.
20Figure 7.4 Graphical representation of seasonal variation
119 MANCOSA
7.1.4. Random (irregular) Variation (R)

These are caused by unpredictable occurrences, which may be evident or sometimes not so evident. They are
deviations away from the observed time series, and underlying pattern. Examples of evident events include
natural disasters such as floods, droughts or fires and man-made disasters such as strikes, protests and the sort.
The four components of the time series, that being the trend, seasonal, cyclical random variations) combine in
different ways. Using time series analysis, we try to isolate the influence of each of the four components in the
series.
There are two models for doing so, the additive and the multiplicative models.
Additive:
Y=T+S+C+R
In additive models, the seasonal, cyclical and random variations are absolute deviations from the trend, and do
not depend on the level of the trend.
Multiplicative:
Y=TxSxCxR
In multiplicative models, the seasonal, cyclical and random variations are relative deviations from the trend, thus,
the higher the trend, the more intensive the variations.
To illustrate:
MANCOSA 120
21Figure 7.5. Illustration of the additive model 22Figure 7.6. Illustration of the multiplicative model
Both of these time series have a general upward trend, but the fluctuations around the additive model have
roughly the same intensity, whereas the fluctuations around the multiplicative model become increasingly more
intensive.
There are generally two types of methods to identify the underlying pattern, namely the Smoothing and
Decomposition.
7.2 Decomposition and Smoothing

Smoothing allows us to reduce the random fluctuations so as to more easily identify the other components.
Smoothing techniques include Moving Averages, and Exponential Smoothing. Some texts have cited Regression
analysis as another method to isolate trends.
7.2.1. Moving Average

The moving average removes the short-term fluctuations in a time series by taking successive averages of
groups of observations.
Let’s say we sold 30 widget spinners during the month of September. We want to estimate
what our sales will be for October. Our best guess might be that we will sell 30 widgets
during October – basically we have used a “one month moving average” as our forecast.
When we want to forecast for November, we may want to take into account what happened
during September and October. Let’s say we had sales of 40 during October. If we took a
two-month moving average, our forecast for November would be:
(𝐴𝑐𝑡𝑢𝑎𝑙)𝑆𝑒𝑝𝑡 +(𝐴𝑐𝑡𝑢𝑎𝑙)𝑂𝑐𝑡
𝐹𝐶𝑁𝑜𝑣 = 2
30 +40
FC = 2
FC = 35
121 MANCOSA
If we sold 30 in November, we will have three types of moving averages available to us were
we try to make the next prediction:
December would thus be:
Option 1: 1 month moving average

𝐹𝐶𝐷𝑒𝑐 = (𝐴𝑐𝑡𝑢𝑎𝑙)𝑁𝑜𝑣 = 30
(𝐴𝑐𝑡𝑢𝑎𝑙)𝑂𝑐𝑡 +(𝐴𝑐𝑡𝑢𝑎𝑙)𝑁𝑜𝑣
𝐹𝐶𝐷𝑒𝑐 = 2
40+30
(𝐴𝑐𝑡𝑢𝑎𝑙)𝑆𝑒𝑝𝑡 +(𝐴𝑐𝑡𝑢𝑎𝑙)𝑂𝑐𝑡 +(𝐴𝑐𝑡𝑢𝑎𝑙)𝑁𝑜𝑣
30 +40+30
𝐹𝐶𝐷𝑒𝑐 = 33.3
MANCOSA 122
Activity 7.1.
7.1. 1. You are trying to estimate the number of earphones using the moving averages
method. From the following amounts, draw a table to estimate the figures at 1, 3 and 5
months, then draw and comment on the graph:
Months Earphones
1 30
2 42
3 50
4 43
5 42
6 48
7 52
8 36
9 40
10 37
11 40
12 53
13 52
14 51
15 55
16 58
17 52
18 53
19 59
20 61
7.1.2. Can you see any problems inherent in this technique?
The moving average technique has the advantage that it is simple to use and easy to understand. Two of its
major disadvantages, however, are:
• The forecast always lags the actual

• No account is taken of the error in previous forecasts
7.2.2. Exponential Smoothing

The exponential smoothing technique allows us to calculate a “smoothed average” which consists of two parts:
– The most recent demand (new information) and
– The historical smoothed average (old information)
123 MANCOSA
The formula for exponential smoothing is:
𝐹𝑡+1 = 𝐹𝑡 + 𝑎 (𝐷 − 𝐹𝑡 )
Where:
𝐹𝑡+1 = Forecast for the next period (t + 1)
𝐹𝑡 = Forecast for the latest period (t)
𝑎 = Smoothing coefficient
𝐷 = Actual demand for period t
The object is to select a value for the smoothing coefficient – the error to test such would be the Mean Absolute
Deviation (MAD) calculated such that:
∑𝑛𝑡=1 |𝐷 − 𝐹|
𝑀𝐴𝐷 =
𝑛
Where:
|𝐷 − 𝐹| = absolute value of the error
𝑛 = number of periods reviewed
Let’s say you estimated that you would sell 100 fidget spinners, but actually only sold 90.
You chose the smoothing coefficient to equal 0.2.
𝐹𝑡+1 = 𝐹𝑡 + 𝑎 (𝐷 − 𝐹𝑡 )
𝐹𝑡+1 = 100 + 0.2 (90-100) = 98
𝐹𝑡 = 90
𝑎 = 0.2
𝐷 = 100
∑𝑛𝑡=1 |100 − 90|
𝑀𝐴𝐷 =
1
𝑀𝐴𝐷 = 10
MANCOSA 124
The best way in which to forecast using different coefficients is to create a table:
Given the following dataset:
Month Actual Demand
1 22
2 18
3 23
4 21
5 17
6 24
7 20
8 19
9 18
10 21
Step 1: Create a first forecast – we usually do this using the first known measure
Step 2: You will use the previous estimate, add it to (alpha*(previous demand-the previous exponential
smoothing number) (𝐹𝑡+1 = 𝐹𝑡 + 𝑎 (𝐷 − 𝐹𝑡 ))
Step 3: Work out the error ( | D – F | )
125 MANCOSA
Step 4: Square the error
Step 5: Calculate the mean squared average
Activity 7.2.
Using the smoothing average, calculate the forecast for alpha = 0.2 for the following values,
include the forecast for the 11 month:
Month Actual Demand

1 20
2 30
3 36
4 40
5 42
6 41
7 38
8 42
9 43
10 45
MANCOSA 126
7.3 Seasonal Analysis
To date, we have covered the two ways in which we can decompose the various elements (trend, cyclical,
seasonal and random) from data in order to discover underlying trends. Seasonal indices is one manner in which
we can deseasonalise data by removing seasonal fluctuations or patterns in the data in order to make
predictions about potential future values. We develop a seasonal index as a ratio of the demand for a particular
season to the demand for an average season.
If demand is for 100 units in an average season and demand for the summer season is 80, the summer season
index is 80 / 100 = 0.8.
There are four seasons in a year. Thus, the mean seasonal demand is calculated using a four-period moving
average, and is centred on the middle of a given season, that is a month and a half into the season. It includes
demands going back six months and forward six months from that point.
You own a Beach store in Scottburgh, and you sell mainly beachwear, especially costumes.
You are interested in determining how many costumes you wish to sell going forward into
2019, and look back on sales data since inception of your store in 2015.
Season Actual Demand
Spring 2015 100

Summer 2015 120
Autumn 2015 80
Winter 2015 78
Spring 2016 88
Summer 2016 105
Autumn 2016 89
Winter 2016 81
Spring 2017 87
Summer 2017 96
Autumn 2017 90
Winter 2017 78
Spring 2018 88
Summer 2018 102
Autumn 2018 98
Winter 2018 87
Using the data presented above, we can create a forecast to 2019.
127 MANCOSA
1
Step 1: Create a mean seasonal demand by adding 2 of the first season + the middle 3 into
half of the last season. Your first mean will fall in the middle season e.g.
Half-way between the seasonal cycle

1 1
2
Spring + Summer + Autumn + Winter + 2 Spring
One full seasonal cycle of four months
Calculate as follows:
MANCOSA 128
Step 2: Create a seasonal index by dividing the actual demand by the mean seasonal
demand:
Step 3: Rearrange the data to a single index per season
You do this by adding together the index scores for each season, and dividing by the total
number of seasons included in that period. In this case, it was 3. So you average the index
score to generate an index for the year you wish to forecast to. The resultant table looked as
follows:
129 MANCOSA
Spring Summer Autumn Winter
2015 0.860215 0.870293
2016 0.990155 1.161826 0.982069 0.906294
2017 0.984441 1.089362 1.024182 0.878873
2018 0.972376 1.101215
2019 0.982324 1.117467 0.955489 0.885153
Step 4: Based on previous estimates, you approximate based on the mean of previous
years’ demands that you are going to sell roughly 356 costumes in 2019, an average of
89.94 costumes a season = 90.
Then you multiply an average of which is the average number of costumes each season by
the index number to give it a weighting.
The resultant number of costumes per season are as follows:
MANCOSA 130
The resultant forecasts for 2019 are as follows:

Avg sold per TOTAL 2019
season
Spring 2019 88.40915 89
Summer 2019 100.5721 101
Autumn 2019 85.99398 86
Winter 2019 79.66379 80
356
Activity 7.3.
You are the owner of a painting company. Painting is best done when conditions are dry, and
thus favours Autumn/Winter seasons. You have owned the business since 2015, and wish to
make projections to 2019 for paint sales. Using a seasonal analysis, calculate the forecasted
seasonal paint estimates for 2019.
Season Actual Demand
Spring 2015 110

Summer 2015 112
Autumn 2015 256
Winter 2015 300
Spring 2016 120
Summer 2016 134
Autumn 2016 274
Winter 2016 321
Spring 2017 115
Summer 2017 120
Autumn 2017 265
Winter 2017 301
Spring 2018 112
Summer 2018 134
Autumn 2018 256
Winter 2018 294
131 MANCOSA
When we are dealing with quarters, rather than actual seasons, it is not imperative that we deal with half seasons
as per the examples above. If we plot the data, and it is apparent that there is a trend in the data that occurs the
repeatedly in each quarter every year, seasonality is present. For example:
10Table 7.2 Table of demand

quarterly, for three consecutive years
Year Quarter Demand

Year 1 Quarter 1 145
Year 1 Quarter 2 185 Demand
Year 1 Quarter 3 132 200
Year 1 Quarter 4 94
150
50
Year 2 Quarter 4 90
Year 3 Quarter 4 95
23Figure 7.7 Graphical representation of seasonal demand
You can see that the data varies with the same/similar patterns annually. In other words, it dips every fourth
quarter, and peaks every second quarter across each year from year 1 to year 3. In these instances, you do not
have to find the mean seasonal demand, but rather the mean of each quarter, which simply involves averaging
quarters 1 – 4.
Year Quarter Actual Demand

Year 1 Quarter 1 72
Year 1 Quarter 2 64
Year 1 Quarter 3 63
Year 1 Quarter 4 75
Year 2 Quarter 1 75
Year 2 Quarter 2 66
Year 2 Quarter 3 64
Year 2 Quarter 4 89
Year 3 Quarter 1 76
Year 3 Quarter 2 68
Year 3 Quarter 3 67
Year 3 Quarter 4 95
MANCOSA 132
If you were to plot it using a scatterplot:
Actual Demand
100
90
80
70
60
50
40
30
20
10
0
You can see a general trend – as the data peaks around quarter 4, and dips around quarter 3.
There is a similar trend across the four quarters between years 1-3. Data is seasonal.
Step 1: Find the average or mean for each quarter
Year Average
2015 68.5
2016 73.5
2017 76.5
133 MANCOSA
Step 2: Workout the proportion or index for each quarter by dividing the actual demand by the
average/mean demand for that year
This yeilds and index for each quarter for each year:
Year Quarter Actual Demand Seasonal Index
Year 1 Quarter 1 72 1.051094891
Year 1 Quarter 2 64 0.934306569
Year 1 Quarter 3 63 0.919708029
Year 1 Quarter 4 75 1.094890511
Year 2 Quarter 1 75 1.020408163
Year 2 Quarter 2 66 0.897959184
Year 2 Quarter 3 64 0.870748299
Year 2 Quarter 4 89 1.210884354
Year 3 Quarter 1 76 0.993464052
Year 3 Quarter 2 68 0.888888889
Year 3 Quarter 3 67 0.875816993
Year 3 Quarter 4 95 1.241830065
You can see that any number over 1 indicates an increase above the average for that quarter.
Calculate the annual proportion for each year.
Step 3: Calculate overall seasonal index for each quarter

Overall seasonal index for each quarter
Q1 Q2 Q3 Q4
1.021656 0.907051547 0.888758 1.182534977
Your seasonal indices will always add up to the number of time periods i.e. quarters = 4,
Months = 12 etc.
For example – if we were to add the seasonal indices for the above they tally to four.
MANCOSA 134
Step 4: Calculate the deseasonalised values. This is done by dividing the actual value by the
seasonal index.
To yield the following scores:

Year Q1 Q2 Q3 Q4
Year 1 70 71 71 63
Year 2 73 73 72 75
Year 3 74 75 75 80
When you plot the seasonal data against the deseasonalised data, it looks as follows:
Scatterplot of Seasonal Data

100
90
80
70
60
50
40
30
20
10
0
Seasonal Demand Deaseasonalised
135 MANCOSA
From the graph above it is evident that once you remove the seasonality, the line is a lot
smoother. When the data has the seasonality, there are rather large fluctuations, with high
peaks and troughs because of its seasonality. When you take the seasonality out of the data,
you get to see the overall trend. From the above, you can see a slight upward trend, noting a
dip in quarter 4, but only in the first year. This means that given that time of the year – sales
were relatively low.
Step 5: Calculate the forecast to the first quarter in the fourth year assuming the expected
sales will be 312
79*1.021
= 79.68914476
Activity 7.4.
You are given the following dataset:
Year Quarter Demand

Year 1 Quarter 4 94
Year 2 Quarter 4 90
Year 3 Quarter 4 95
1. Is this data seasonal? If so, why?

2. Calculate the mean for each year, for each quarter
3. Calculate the Seasonal Index for each year, for every quarter
4. Calculate the average of each quarter
5. Let’s say the predicted value for year 4 was expected to be around 570, calculate the
predicted values for each quarter.
MANCOSA 136
7.4 Summary
This final unit explored the various ways in which we can analyse temporal data, often to make predictions based
on past values. It demonstrated the various ways in which we can decompose data in order to discover
underlying trends. This included seasonal decomposition for seasonal data. Time Series Forecasting present a
very powerful way of analysing and using historical data in order to plan for the future.
Answers to acitivities
7.1.1.
Months Earphones Moving Average

1 Month 3 Month 5 Month
1 30
2 42 30
3 50 42
4 43 50 41
5 42 43 45
6 48 42 45 41
7 52 48 44 45
8 36 52 47 47
9 40 36 45 44
10 37 40 43 44
11 40 37 38 43
12 53 40 39 41
13 52 53 43 41
14 51 52 48 44
15 55 51 52 47
16 58 55 53 50
17 52 58 55 54
18 53 52 55 54
19 59 53 54 54
20 61 59 55 55
137 MANCOSA
MOVING AVERAGES AT MONTHS 1,

3, AND 5
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
7.1.2. The problem with moving averages is that they will always lag – this means there will be no estimate
until after 2 months for the 1 month, 4 months for the 3 month and 6 months for the 5-month moving
average. This means that there will be no data available from months 1 – 5 for the 5-month moving
average, similarly, no data for months 1-4 for the 3-month forecast.
The more the number of months included in the estimate – the smoother the curve
7.2.1.
Alpha Exp Smoothing

Month Actual Demand 0.2 ERROR Squared
1 20 70
2 30 60.00 -30.00 900.00
3 36 54.00 -18.00 324.00
4 40 50.40 -10.40 108.16
5 42 48.32 -6.32 39.94
6 41 47.06 -6.06 36.68
7 38 45.84 -7.84 61.54
8 42 44.28 -2.28 5.18
9 43 43.82 -0.82 0.67
10 45 43.66 1.34 1.80
11 43.93
MANCOSA 138
7.3.
Season Actual Demand Mean Seasonal Seasonal Index

Demand
Spring 2015 110
Summer 2015 112
Autumn 2015 256 195.75 1.307790549
Winter 2015 300 199.75 1.501877347
Spring 2016 120 204.75 0.586080586
Summer 2016 134 209.625 0.639236732
Autumn 2016 274 211.625 1.29474306
Winter 2016 321 209.25 1.534050179
Spring 2017 115 206.375 0.557238038
Summer 2017 120 202.75 0.591861899
Autumn 2017 265 199.875 1.325828643
Winter 2017 301 201.25 1.495652174
Spring 2018 112 201.875 0.554798762
Summer 2018 134 199.875 0.670419012
Autumn 2018 256
Winter 2018 294
Spring Summer Autumn Winter
2015 1.307791 1.501877

2016 0.586081 0.639236732 1.294743 1.53405
2017 0.557238 0.591861899 1.325829 1.495652
2018 0.554799 0.670419012
2019 0.566039 0.633839214 1.309454 1.510527
Estimated projection
Spring 2019 114.0569 115
Summer 2019 127.7186 128
Autumn 2019 263.855 264
Winter 2019 304.3711 305
812
139 MANCOSA
7.4.1. You can see a general trend – as the data peaks around quarter 4, and dips around quarter 3. There is a
similar trend across the four quarters between years 1-3. Data is seasonal.
Year Year 1 Year 2 Year 3
Q1 145 140 145
Q2 185 190 188
Q3 132 135 130
Q4 94 90 95
TOTAL 556 555 558
AVG 139 138.75 139.5

Q1 1.0432 1.0090 1.0394
Q2 1.3309 1.3694 1.3477
Q3 0.9496 0.9730 0.9319
Q4 0.6763 0.6486 0.6810

Q1 1.0432 1.0090 1.0394 1.0305
Q2 1.3309 1.3694 1.3477 1.3493
Q3 0.9496 0.9730 0.9319 0.9515
Q4 0.6763 0.6486 0.6810 0.6686
TOTAL 4.0000
Year 4 Quarter 1 146.8510

Revision Questions
1. You are a distributor of pop grips, and wish to determine how many pop-grips you will need to order from
China. You decide to use the moving average in order to estimate different numbers using a 1 month, 3
month and 5 month moving average.
MANCOSA 140
Months Pop-Grips
1 30
2 42
3 50
4 43
5 42
6 48
7 52
8 36
9 40
10 37
11 40
12 53
13 52
14 51
15 55
1.1. Draw a temporal chart to illustrate the trend.

1.2. Draw a table to estimate the 1, 3, and 5 month moving average.
2. Name and elaborate on the components of a time series.
1.1.
POP GRIPS AS PER 1, 3, 5 MONTH

MOVING AVERAGE
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16
141 MANCOSA
1.2.
Trend (T)
Trend is defined as a long-term smooth underlying movement in time series. It describes the effect that long-term
factors have on the series. These long-term factors tend to operate fairly gradually and in one direction for a long
period of time.
Cyclical Fluctuation (C)

Cycles are medium to long term deviations from the trend. They reflect alternating periods of relative expansion
and contraction. They are wave like movements in a time series that can vary greatly in duration and amplitude.
They are difficult to measure statistically and their use in statistical forecasting is limited.
The most common form of cycle is the business cycle between periods of relatively good economic
activity to poor economic activity. The causes of these are difficult to determine. Action by government, trade
unions and world organisations induce levels of pessimism and optimism into the economy which are reflected in
changes in the time series levels. Index numbers are used to describe cyclical fluctuations.
Seasonal Variations (S)

Seasonal variations are fluctuations that are repeated periodically, usually within a year (i.e. daily, weekly,
monthly or quarterly). They are readily isolated through statistical analysis. Seasonal fluctuations are caused by
re-occurring events such as climatic conditions, special occurring events (e.g. Easter, Christmas) and religious,
public and school holidays.
Random Fluctuations (I)

These are caused by unpredictable occurrences, which may be evident or sometimes not so evident. Examples
of evident events are natural disasters such as floods, droughts or fires and man-made disasters such as strikes
or boycotts.
These variations follow no specific pattern, and cannot be analysed statistically, and thus cannot be incorporated
into forecasts.
MANCOSA 142
References
Brownlee, J. (2016). What is Time Series Forecasting? Accessed September 26, 2018, from:
https://machinelearningmastery.com/time-series-forecasting/
Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses & Conclusions: A Course in Statistics for the
Social Sciences. Cape Town: UCT Press.
Durrheim, K., and Tredoux, C. (2012). Numbers, Hypotheses & Conclusions: A Course in Statistics for the
Social Sciences. (2nd ed.). Cape Town: UCT Press.
GAO. (1992). Quantitative Data Analysis: An Introduction. Accessed February 4th, 2018, from:
http://archive.gao.gov/t2pbat6/146957.pdf
Goodridge, P. (2007). Methods explained: Index Numbers. Available from:

https://www.ons.gov.uk/ons/rel/elmr/economic-and-labour-market-review/no--3--march-
2007/methods-explained--index-numbers.pdf
Groebner, D.F., Shannon, P.W., Fry, P.C., and Smith, K.D. (2011). Business Statistics: A Decision-Making
Approach. 8th ed. Boston: Prentice Hall.
Wegner, R. (2012). Applied Business Statistics Method and Excel-Based Applications. (3rd ed.). Cape Town: Juta
& Company Limited.
Wegner, T. (2015). Applied Business Statistics. (4th ed.). Juta: Cape Town
Weiers, R. M. (2011). Introduction to Business Statistics. 7th ed. South Western, Cengage Learning. Chapter 18
pages 688-715.
Yaffee, R.A. & McGee, M. Introduction to Time Series Analysis and Forecasting:
https://core.ac.uk/download/pdf/44191640.pdf
143 MANCOSA
APPENDICES
APPENDIX 1 – z-table
MANCOSA 144
Source: Tredoux, Durrheim (2002, p. 485)
145 MANCOSA
MANCOSA 146

Statistical Techniques in Business

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Techniques in Business

Uploaded by

Copyright:

Available Formats

STATISTICAL TECHNIQUES IN BUSINESS

 Advanced Diploma in Business Management

Unit 1: Introduction to Business Statistics ......................................................................................................... 9

Unit 2: Types of Data ....................................................................................................................................... 14

Unit 3: Management Statistics ........................................................................................................................ 28

Unit 4: Probability and Probability Distributions............................................................................................... 73

Unit 5: Index Numbers ..................................................................................................................................... 93

Unit 6: Linear Correlation and Regression ......................................................................................................102

Unit 7: Time Series Forecasting ......................................................................................................................115

Table 3.1 Grouped frequency table of number of years in an organisation .......................................................... 31

Table 3.2: Crosstabulation of task grade and level of achievement ...................................................................... 33

Table 3.5: Suggested statistics for varying levels of measurement ...................................................................... 45

Table 6 Table 4.1 The z-table ............................................................................................................................... 84

Table 7.1. Table illustrating the forecasted figures.............................................................................................. 117

List of Figures and Illustrations

Figure 3.3: Scatterplot of productivity by year ....................................................................................................... 36

Figure 4.5. Scatterplot depicting a non-linear, inverse relationship ....................................................................... 37

Figure 6:3.10.: ....................................................................................................................................................... 47

Figure 5:3.11. ........................................................................................................................................................ 47

Figure 3.14. Histogram illustrating a roughly symmetrical shape .......................................................................... 58

Figure 3.15. Histogram illustrating a right, positive skew ...................................................................................... 59

Figure 3.16. Histogram illustrating a left, negative skew ....................................................................................... 59

Figure 4.1 The Binomial distribution...................................................................................................................... 81

Figure 4.2 The Normal Distribution ....................................................................................................................... 82

Figure 4.3. Histograms illustrating normally distributed data ................................................................................. 83

Figure 4.4 The Standard Normal Distribution ........................................................................................................ 84

Figure 7.1 Graphical representation of a Time Series........................................................................................ 117

Figure 7.2. Graphical representation of trend ..................................................................................................... 118

Figure 7.3 Graphical representation of cyclical variation..................................................................................... 119

Figure 7.4 Graphical representation of seasonal variation ................................................................................. 119

Figure 7.5. Illustration of the additive model ........................................................................................

Figure 7.6. Illustration of the multiplicative model .................................................................................. 121

Figure 7.7 Graphical representation of seasonal demand ....................................................................... 132

We hope you enjoy the module.

C. Learning Outcomes and Associated Assessment Criteria of the Module

LEARNING OUTCOMES OF THE MODULE ASSOCIATED ASSESSMENT CRITERIA OF THE MODULE

 Explain why quantitative techniques  Understand what is implied by business statistics

 Be able to manipulate data in optimal ways to inform

 Be able to calculate descriptive statistics

 Utilise learned statistical techniques to generate and

 Determine the appropriate statistical techniques

 Understand what is implied by validity and reliability

 Be able to interpret and apply the concepts

E. Learning Outcomes of the Units

F. How to Use this Module

H. Prescribed and Recommended Textbook/Readings

The prescribed and recommended textbooks/readings for this module is:

Special Feature Icon Explanation

The Associated Assessment Criteria is the evaluation of the students’

You may come across Knowledge Check Questions at the end of

You may come across Revision Questions that test your

Case Studies are included in different sections in this Module Guide.

You may come across links to Videos Activities as well as instructions

Unit Learning Outcomes

CONTENT LIST LEARNING OUTCOMES OF THIS UNIT:

1.1. Introduction  Introduce topic areas for the unit

1.2. What is statistics?  Define what is meant by statistics

 Discern the need for business statistics

1.4. Conclusion  Summarises topic areas of units

Unit One: Introduction to Business Statistics

 Durrheim, K., and Tredoux, C. (2002). Numbers, Hypotheses &

1.1 Introduction to Business Statistics

1.2 What is Statistics?

Why statistics in important for researchers and managers