Professional Documents
Culture Documents
BACHELOR OF BUSINESS
ADMINISTRATION
MODULE GUIDE
Copyright © 2022
REGENT BUSINESS SCHOOL
All rights reserved; no part of this book may be reproduced in any form or by any means, including
photocopying machines, without the written permission of the publisher.
Table of Contents
BIBLIOGRAPHY ..........................................................................................175
Figure.5.1: Scatter plot for the number of Hi-Fi sold and the number of number of
advertisements placed in local newspapers ........................................................... 136
Figure 5.2: Fitted regression line for the number of Hi-Fi sold and the number of
number of advertisements placed in local newspapers .......................................... 137
Figure 5.3: Interpretation of correlation coefficients................................................ 139
Figure 5.4: Perfect positive correlation ................................................................... 139
Figure 5.5: Perfect negative correlation .................................................................. 140
Figure 5.6: Positive (direct) linear correlation ......................................................... 140
Figure 5.7: Negative (indirect) linear correlation ..................................................... 141
Figure 5.8: Positive linear correlation ..................................................................... 141
Figure 5.9: Negative linear correlation.................................................................... 142
Figure 5.10: No linear correlation ........................................................................... 142
Figure 5.11: Regression output for Example 5.1 using the Data Analysis add-in ... 145
Figure 5.12: Scatter plot for example 5.1 ............................................................... 145
Figure 6.1: Times series for number of costumers served ..................................... 154
Figure 6.2: Trend line ............................................................................................. 154
Figure 6.3: Moving average .................................................................................... 158
Figure 6.4: Time series plot and the fitted trend line .............................................. 162
1. Introduction
Business statistics is a very dynamic and challenging field. Therefore, the learning
materials, exercises, and self-study questions in this manual will give you the chance
to examine the most recent advancements in this area and help in your discovery of
the field of Business Statistics as it is currently used.
2. Module Overview
Davis, G. W., Pecar, B., & Santana, L. (2017). Business statistics using Excel: A first
course for South African students. Oxford University Press Southern Africa.
Croucher, J. (2016). Introductory Mathematics and Statistics for Business 6th Edition.
McGraw-Hill.
Wegner T. (2016) Applied Business Statistics 3rd edition. Juta and Co, Ltd.
This module should be studied using the recommended and prescribed textbook/s and
the relevant sections of this module. You must read about the topic that you intend to
study in the appropriate section before you start reading the textbook/s in detail.
Ensure that you make your own notes as you work through both the textbook/s and
this module. You will find a list of objectives and outcomes at the beginning of each
section. These outline the main points that you need to understand when you have
completed the section/s. The purpose of this guide is to help you study. It is important
for you to work through all the tasks and self-assessment exercises as they provide
guidelines for examination purposes.
6. Navigational Icons
Think Point
When you see this icon, you should think about and reflect on the
issues/challenges/themes presented.
Tasks
When you see this icon, you will know that you are required to perform
a task to gauge how well you remember or understand what you have
read or how good you are at applying what you have learnt.
Definitions
This icon will alert you to a specific definition related to the topic under
discussion.
Case Studies
Case studies are often used to illustrate a concept within the setting
of a real-life scenario. Answer the questions that follow to ensure that
you have a proper understanding of what has been discussed.
CHAPTER ONE
STATISTICS DEFINITIONS
Learning Outcomes
Statistics is the scientific method that enables us to collect, organise, analyse, and
interpret data in order to make decisions as responsibly as possible. Statistics can also
be defined as ―a way to get information from data, (Devis, Pecar and Santana, 2017).
The study of statistics has two major branches: descriptive statistics and inferential
statistics. Figure 1.1 presents these two main branches.
Statistics
Descriptive statistics is the meaningful presentation of data such that its characteristics
can be effectively observed.
Statistical inference relates to decision-making and is the subject that leads to future
action rather than an inspection of the past.
EXAMPLE 1.1
In a recent study, volunteers who had less than 6 hours of sleep were four times more
likely to answer incorrectly on a science test than participants who had at least 8
hours of sleep. Decide which part is the descriptive statistic and what conclusion
might be drawn using inferential statistics.
Solution
The statement “four times more likely to answer incorrectly” is a descriptive statistic.
An inference drawn from the sample is that all individuals sleeping less than 6 hours
are more likely to answer science question incorrectly than individuals who sleep at
least 8 hours
1.4.1 Data
Data is a ‘scientific’ term for facts, figures, information, and processing. Data are the
raw materials for data processing e.g.
1.4.2 Information
Information is data that has been processed in such a way as to be meaningful to the
person who receives it. Information is anything that is communicated.
performed on the data. A report which summarises the results and discusses their
significance is sent to the company that commissioned the survey.
Individually, a completed questionnaire would not tell the company very much, only
the views of one consumer. In this case, the individual questionnaires are data. Once
they have been processed, and analysed, the resulting report is information. The
company will use it to inform its decisions regarding the product. If the report revealed
that consumers disliked the product, the company would be wrong, and poor decisions
would be made.
Quantitative data is data that can be measured. It will be assigned a numerical value
called a variable. Qualitative data is data that cannot be measured but which reflects
some quality of what is being observed. The data are said to have (non-numeric)
attributes.
Primary data is collected especially for the purpose of whatever survey is being
conducted. Raw data is primary data which has not been processed at all, and which
are still just a list of numbers. Secondary data is data which has have already been
collected elsewhere, for some other purpose, but which can be used or adapted for
the survey being conducted.
Discrete data is data which can only take took on a finite or countable number of values
within a given range e.g. number of goals scored by Arsenal could be 0,1,2 or 4 goals
1 1
but not 1 or 2 . Continuous data is data which can take on any value. They are
2 2
measured rather than counted e.g. heights of all members of your family as these can
take on any finite value i.e.1.542m, 1.639m and 1.492m.
These terms are used to denote the nature of data and the measurement level at which
such data has been acquired.
This is the weakest level of measurement. Such a level entails the classification of
data qualitatively by name - hence the term “nominal”. For example the labelling of
data into two categories “men” and “women”, these two categories can be known only
by name. Meat can be classified as “fresh” and “stale”. Names like Caroline, Khumalo
Kadija, and so on, are classifications on the nominal scale. If you classify cats as
“black” and “white”, you are measuring them using the nominal scale.
Analysis and manipulation of this data requires those statistical techniques which can
handle names and nominal data. The Chi-Square statistic in Chapter five of this
module is one of the few applications of this level of data.
This is the kind of data which is categorized using those qualities one can differentiate
with size. In other words, data is amenable to be Transitive, that is: with magnitude
and direction. Thus, data classified as big, bigger, biggest, or: large, larger, largest,
and similar qualities, is data which has been acquired and arranged in an ordinal
manner. It is ordinal data, and the level of measurement for such data is ordinal. Chi-
Square and Analysis of Variance (to be learned later), together with any other
measures and statistics that can handle this type of data, are used to manipulate and
to make deductions with this kind of data.
This is data acquired through a process of measurement where equal measuring units
are employed. Such data has magnitude and direction (is transitive) and the size of
the interval between each observation and the one above it is the same for all
observations. Equal measuring units are employed. This data therefore contains all
the characteristics of nominal data, and ordinal data. In addition, the scale of
measurement moves uniformly in equal intervals up and down the respective sizes of
the data; in equal intervals - hence the name “interval” data. The only weakness with
this kind of data is that the position of zero is not clear, unless it can be assumed.
Thus, data like 2001, 2002, 2003. and so on, is interval data. The zero year then can
be assumed as 2001. Data like temperature readings have absolute zero so far that it
is not practical to find it and use it in every-day data manipulation. The same applies
to time in hours or even in minutes, and so on. The statistic used for analysis is such
measures as: analysis of variance, and regression-correlation. However, ratios are
difficult to compute.
This is the highest level of measurement, with transitivity - magnitude and direction;
equal interval qualities; and the zero can be identified and used conveniently. It is
possible to perform all mathematical manipulations using this data, whereas with other
data such exercises are not possible due to lack of zero levels. Division and ratio
computation between one group of observations and another is possible - hence the
use of the word ratio. All the known statistical techniques are useful with this kind of
data. This is the kind of data most people can handle with ease, because the
observations are countable and divisible.
Table 1.1 summarizes the different measurement scales with examples provided of
these different scales.
1.6 Sampling
It is rare to have access to all the information that we would like to know about a given
situation. Usually, we need to examine some portion (a sample) of the total system (a
population), and then extend our knowledge of that portion to the total system
(Inference)
It may not be practical to analyse all the given data in a particular set since:
EXAMPLE 1.2
A company that manufactures matches claims that in each box the ‘average’ content
is 50 matches. Suppose you wish to test whether such a claim is true. It would certainly
be impractical to take every box of matches, count the contents of each and arrive at
an average figure. Instead, you might draw a sample of (say) 100 boxes from the
population of all the boxes manufactured and measure the mean contents of these
100 boxes. You could then decide based on your results as to whether the claim is
valid. In this case, you would be using a sample to make an inference about the
population.
1.6.2. Bias
Example .1.3
Consider the situation of having homes close to an airport. If a survey was conducted
whether the noise levels were intolerable or not would depend entirely on what sample
of people are interviewed. Say, for example, people working the whole day away from
the airport and hence would not be affected by the noise were interviewed. They would
probably say the noise level is tolerable simply because they don’t hear most of it
being away from their homes so much. Solely making a judgment on these people’s
views would be totally biased and the result of the survey worthless.
There are several rules that can be used to help eliminate bias in sampling. These
include:
Even if a sampling process were completely free of bias, there would still be
fluctuations due to naturally occurring variation. In general, no two samples will be
identical, and it is necessary to assess how much variation can be expected to occur
from one sample to another. This variation between samples means that information
from any one sample will not be an exact representation of the population.
It is often assumed that samples are drawn randomly. Choosing a random sample
from a population means that, each member of the population has the same chance
of being selected. No one member is being favoured over the other. This is intended
to ensure that the sample is unbiased, although of course there will still be sampling
error resulting from random fluctuations.
When selecting any sample, you must answer several questions before taking any
action:
1. What precisely are the members from which the sample is to be chosen (i.e. the
population)?
Sample size depends on the type of problem and the desired accuracy of any
conclusion.
Example 1.4
A clothing designer has designed a new style of men’s socks and would like to test the
possible market reaction. She selects a random sample of adult males and finds that
a large proportion of those chosen state that they wear such socks. However, when
the product comes onto the market, very few pairs are sold. The problem here may be
one of population identification. In practice, although men will be the ones who wear
the socks, they may not be the one who buy them (it could be their wives or girlfriends).
In this case, the user is not the one who makes the purchase.
Most firms would find it either impractical or too expensive to survey all their customers
or to carefully examine every item that flows from their production line. Instead, they
usually resort to selecting a sample from the whole group or, as it is often called, the
population. We then go on to consider the distribution or sample means which, as we
shall see, can under certain conditions be regarded as being normally distributed. This
is used as the basis for testing various theories or hypotheses.
The idea behind this type of probability sampling is random selection. More
specifically, each sample from the population of interest has a known probability of
selection under a given sampling scheme.
There are four categories of probability samples described, as illustrated in Figure 1.2
Probability sampling
Simple
Systematic Stratified Cluster
random
The most widely known type of a random sample is the simple random sample. This
is characterized by the fact that the probability of selection is the same for every case
in the population. Simple random sampling is a method of selecting ‘n’ samples from
a population of size ‘N’ such that every possible sample of size ‘n’ has equal chance
of being drawn.
Example 1.5
Consider the situation that a marketing researcher will experience when selecting a
random sample of 200 shoppers who shop at a supermarket during a particular time
period. The researcher notes that the supermarket would like to seek the views of its
customers on a proposed re-development of the store and the total footfall (the number
of people visiting a shop or a chain of shops in a period of time is called its footfall)
within this time period is 10,000.
With a footfall (or population) of this size we could employ a number of ways to select
an appropriate sample of 200 from the potential 10,000. For example, we could place
10,000 consecutively numbered pieces of paper (1–10000) in a box, draw a number
at random from the box, shake, and select another number to maximize the chances
of the second pick being random, shake, and continue the process until all 200
numbers are selected. These would then be used to select a customer entering the
store with the customer chosen based upon the number selected from the random
process. To maximize the chances that customers selected would agree to complete
the survey we could enter them into a prize draw. These 200 customers will form our
sample with each number in the selection having the same probability of being chosen.
When undertaking the collection of data via random sampling we generally find it
difficult to devise a selection scheme to guarantee that we have a random sample. For
example, the selection from a population might not be the total population that you
wish to measure or, during the time period when the survey is conducted, we may find
that the customers sampled may by unrepresentative of the population as a result of
unforeseen circumstances.
With systematic random sampling we create a list of every member of the population.
From the list, we randomly select the first sample element from the first n number
values on the population list. Thereafter, we select every nth number value on the list.
This method involves choosing the nth element from a population list as follows:
1. Step 1: Divide the number of cases in the population by the desired sample size.
2. Step 2: Select a random number between one and the value attained in step 1. For
3. Step 3: Starting with case number chosen in step 2, take every twenty-eighth
record, as per this example.
With stratified random sampling, the population is divided into two or more mutually
exclusive groups, where each group is dependent upon the research area of interest.
The sampling procedure is to organize the population into homogenous subsets before
sampling, then draw a random sample from each group. With stratified random
sampling the population is divided into non-overlapping groups (subpopulations or
strata) where all the groups together would comprise the entire population. As an
example, suppose we conduct a national survey in the Netherlands. We might divide
the population into groups (or strata) based on the regions of the Netherlands. Then
we would randomly select from each group (or strata). The advantage of this method
is the guarantee that every group within the population is selected, and it provides an
opportunity to undertake group comparisons.
Example 1.7
To illustrate, consider the situation where we wish to sample the views of graduate job
applicants to a major financial institution. The nature of this survey is to collect data on
the application process from the applicants’ perspective. The survey will therefore
have to collect the views from the different specified groups within the identified
population.
For example, this could be based on gender, race, type of employment requested (full-
or part-time), or whether an applicant is classified as disabled. If we use simple random
sampling it is possible that we may miss a representative sample from one of these
groups as a result, for example, of the relative size of the group relative to the
population. In this case, we would employ stratified random sampling to ensure that
appropriate numbers of sample values are drawn from each group in proportion to the
percentage of the population as a whole. Stratified sampling offers several advantages
over simple random sampling:
(i) It guards against an unrepresentative sample (e.g., all male from a predominately
female group).
(ii) It provides sufficient group data for separate group analysis; it requires a smaller
sample.
(iii) Greater precision is achievable compared with simple random sampling for a
sample of the same size.
Stratified random sampling nearly always results in a smaller variance for the
estimated mean or other population parameters of interest. The main disadvantage of
a stratified sample is that it may be more costly to collect and process the data
compared with a simple random sample. Two different categories of stratified random
sampling are available:
After clusters are selected, then all data points within the clusters are selected. No
data points from non-selected clusters are included in the sample. This differs from
stratified sampling, in which some data values are selected from each group. When
all the data values within a cluster are selected, the technique is referred to as one-
stage cluster sampling. If a subset of units is selected randomly from each selected
cluster, it is called two-stage cluster sampling. Cluster sampling can also be made in
three or more stages: it is then referred to as multistage cluster sampling. The main
reason for using cluster sampling is that it is usually much cheaper and more
convenient to sample the population in clusters rather than randomly. In some cases,
constructing a sampling frame that identifies every population element is too
expensive or impossible. Cluster sampling can also reduce cost when the population
elements are scattered over a wide area.
In many situations it is not possible to select the kinds of probability samples used in
large-scale surveys. For example, we may be required to seek the views of local,
family-run businesses that have experienced financial difficulties during the bank credit
crunch of 2007–2012. In this situation there are no easily accessible lists of businesses
experiencing difficulties, or there may never be a list created or available. The question
Non-probability
Convenience Purposive
Quota Snoball
Purposive sampling is a sampling method in which elements are chosen based on the
purpose of the study. Purposive sampling may involve studying the entire population
of some limited group (accounts department at a local engineering firm) or a subset of
a population (chartered accountants). As with other non-probability sampling methods,
purposive sampling does not produce a sample that is representative of a larger
population, but it can be exactly what is needed in some cases—study of organization,
community, or some other clearly defined and relatively limited group. Examples of
two popular purposive sampling methods include quota sampling and snowball
sampling.
Quota sampling is designed to overcome the most obvious flaw of convenience (or
availability) sampling. Rather than taking just anyone, quotas are set to ensure that
the sample you get represents certain characteristics in proportion to their prevalence
in the population. Note that for this method you have to know something about the
characteristics of the population ahead of time. There are two types of quota sampling
(1) proportional and (2) non-proportional.
In proportional quota sampling you want to represent the major characteristics of the
population by sampling a proportional amount of each. For instance, if you know the
population has 25% women and 75% men, and that you want a total sample size of
400, you will continue sampling until you get those percentages and then you will stop.
So, if you’ve already got 100 women for your sample, but not 300 men, you will
continue to sample men, even if legitimate women respondents come along—you will
not sample them because you have already ‘met your quota’.
The primary problem with this form of sampling is that even when we know that a quota
sample is representative of the particular characteristics for which quotas have been
set, we have no way of knowing if the sample is representative in terms of any other
characteristics. If we set quotas for age, we are likely to attain a sample with good
representativeness on age, but one that may not be very representative in terms of
gender, education, or other pertinent factors.
In non-proportional quota sampling you specify the minimum number of sampled data
points you want in each category. In this case you are not concerned with having the
correct proportions, but with achieving the numbers in each category. This method is
the non-probabilistic analogue of stratified random sampling in that it is typically used
to assure that smaller groups are adequately represented in your sample. Finally,
researchers often introduce bias when allowed to self-select respondents, which is
usually the case in this form of survey research. In choosing males, interviewers are
more likely to choose those that are better-dressed, or who seem more approachable
or less threatening. That may be understandable from a practical point of view, but it
introduces bias into research findings.
In snowball sampling, you begin by identifying someone who meets the criteria for
inclusion in your study. You then ask them to recommend others who they may know
who also meet the criteria. Thus, the sample group appears to grow like a rolling
snowball.
This sampling technique is often used in hidden populations, which are difficult for
researchers to access, including fi rms with financial difficulties or students struggling
with their studies. The method creates a sample with questionable representativeness,
and it can be difficult to judge how a sample compares with a larger population.
Furthermore, an issue arises in who the respondents refer you to, for example, friends
will refer you to friends but are less likely to refer to ones they don’t consider as friends,
for whatever reason. This creates a further bias within the sample that makes it difficult
to say anything about the population.
SUMMARY
In this chapter, we discussed key concepts in statistics, and introduced data as the
raw material of statistics. We distinguished between primary and secondary data. We
also looked at the different data types and different types of measurement scales.
Finally, we covered the methods of sampling.
CHAPTER TWO
VISUAL REPRESENTATION OF DATA
Learning Outcomes
Before presentation always check the source of the data to ensure that the:
• Data has been accurately transcribed.
• Figures are relevant to the problem.
2.3.1 Tables
When constructing a table, it is important to show which relationships are being
emphasised. It is often wise to have more than one table relating the same set of
data. Some of the important points that should be followed when constructing a table
are:
• Every table should have a clear label, such as Table 1, or Table 1.1, depending
on the table number and the chapter in which it appears.
• Every table should be properly titled, that describes the type of information
given in the actual table itself.
• Rows and columns should be precise, and the units of the values included.
• Categories should not overlap, i.e., the same item should not appear in more
than one category.
• The correctness of any calculations should be verified. For example, check
that the sum of the column totals equals the sum of the row totals.
• Omit any unnecessary and/or irrelevant data.
• Clearly state the units.
• Use your imagination and common sense.
• Compute and show percentages and ratios where appropriate.
Example 2.1
Below is information concerning participation in soccer, rugby, and tennis by South
African adults all under the age of 50 years. South Africa has a total of 2 055 000
adults under 50 years of age who participate in soccer, rugby, and tennis, with 1 700
000 soccer players, 270 000 rugby players and 85 000 tennis players. The actual
breakdown of these numbers (in different age groups) is 600 000 soccer players, 120
000 rugby players and 30 000 tennis players between 18 and 25 years: 800 000 soccer
players, 100 000 rugby players and 40 000 tennis players between 26 and 35 years
and 300 000 soccer players, 50 000 rugby players and 15 000 tennis players between
36 and 49 years. Present the data in a tabular form.
Solution
It is difficult to gain any meaningful information from this narrative. Instead, we can
construct a table from the above data as shown below.
The numbers of adults in South Africa under the age of 50, who participate in soccer,
rugby, or tennis, are shown in Table 2.1 below.
Table 2.1: Sport discipline in South Africa and the age groups
The table is much more compact and useful than the narrative and helps us
distinguish patterns, namely that, in general, the most active adult South Africans
are between 26 and 35 years of age.
2.3.2 Graphs
Graphs are picture representations of data, and better depict relationships between
variables than a table does. It is also possible to depict more than one set of data on
the same graph.
When constructing a graph illustrating data:
• Give the graph a clear and appropriate title.
• Label the axes (x and y) clearly, with the units of measurement of the
quantities.
• Do not plot too many curves on the same set of axes, as it will confuse the
relationships between the variables.
• It is recommended to accompany the graph with the table of data; and,
• Try, as much as possible to show the zero on each scale.
A pie chart is often used to give a visual presentation of data to indicate the
proportions that make up a given total. It is one of a number of so called area
diagrams, which consists of geometrical figures (example square, rectangle). The pie
chart is a circle that is divided into sectors by lines, in such a way that the area of
each sector is proportional to the size of the quantity represented by that sector,
(Croucher, 2016).
Example 2.2
Consider the previous table. Draw a pie chart that illustrates the percentage of
the total number of South Africans between 18 and 25 years that play soccer, rugby
and tennis.
Solution
600 000
Percentage that play soccer = 100 = 80%
750 000
120 000
Percentage that play rugby= 100 = 16%
750 000
30 000
Percentage that play tennis = 100 = 4%
750 000
These percentages can be represented on a pie chart as follows:
4%
16%
Soccer
Rugby
Tennis
80%
Figure 2.1: South Africans between 18-25 years old that play soccer, rugby, and tennis
The sector for soccer is 80 % of the total pie. The sector for rugby is 16 % of the
total pie. The sector for tennis is 4 % of the total pie. The sum of the individual
percentages must be 100 % (i.e. the whole pie).
• Make the widths of each bar equal, since it is only the lengths which are being
compared.
• Clearly label the axes.
• Include footnotes, sources of data and tables.
EXAMPLE 2.3
Table 2.2 below shows the total number of points scored by four rugby teams
in ten matches. Draw a bar chart to represent the given information.
Solution
The points are represented on a bar chart in Figure 2.2 below:
Figure 2.2: Bar chart of points scored by rugby teams in ten matches
EXAMPLE 2.4
The number of points conceded by each team is shown in Table 2.3 below. Using
information from Table 2.2 in Example 2 above, draw a multiple bar chart
to represent the points scored and those conceded by the four rugby teams.
Table 2.3: Points scored and those conceded by the four rugby teams
Solution
Figure 2.3: Points scored and those conceded by the four rugby teams
Source: Croucher, 2016
From this multiple pie chart, it is easy to see that the Stormers and the Sharks
are the better performing sides with both teams scoring more points than conceding
them. The Bulls are the worst performing team conceding more than double the
number of points that they score.
2.3.3 Pictograms
A pictogram is a graph in which data is displayed using pictures, rather than
traditional methods discussed earlier. The pictures are indicative of the type of data
being presented.
EXAMPLE 2.5
A survey was conducted by a local pharmaceutical company to try and determine the
most common type of medication people use to cure a common cold. The survey
was conducted on a group of 36 randomly chosen South Africans. Research
resulted in the data in table 2.4 below:
Panado
Degoran
Flutex
Other
EXAMPLE 2.6
An educational economist wants to establish the relationship between an individual's
income and education. She takes a random sample of 10 individuals and asks for their
income (in R ´000s) and education (in years).
11 25
12 33
11 22
15 41
8 18
10 28
11 32
11 24
17 53
11 26
If we feel the value of one variable (such as income) depends to some degree on
the value of the other variable (such as years of education), the first variable
(income) is called the dependent variable and is plotted on the vertical or y-axis.
The second variable is the independent variable and is plotted on the x-axis.
TIP: Think of the independent variable (x-axis) as the ‘cause’ and the dependent
variable (y-axis) as the ‘effect’.
The scatter diagram allows us to observe two characteristics about the relationship
between education (x) and income (y). As these two variables move together, i.e.
their values tend to increase together and decrease together; there is a positive
relationship between the two variables. The relationship between income and
years of education appears to be linear, since we can imagine drawing a straight
line (as opposed to a curved line) through the scatter diagram that approximates
the positive relationship between the two variables. The pattern of a scatter
diagram provides us with information about the relationship between two
variables.
If two variables move in opposite directions and the scatter diagram consists of
points that appear to cluster around a straight line, then the variables have a
negative linear relationship (Figure 2.6).
Often, in practice, data is collected and is initially presented with the observations
(i.e. sample values) in some random order. Various techniques are used for
condensing data into comprehensible form.
Example 2.4
The number of transactions per month at an Automated Teller Machine (ATM) for 10
account holders at a particular bank is as follows:
05 01 09 17 11 05 10 03 19 04
2.4.1 Array
The raw data from above can be arranged in some meaningful order. As it stands,
not much information can be gauged from the data, except that the total number of
transactions is below 20.
One such method is to arrange the data in increasing order of magnitude i.e. from
smallest to largest (ascending order).
01 03 04 05 05 09 10 11 17 19
any pattern in the data. The solution is to summarise the data by creating tables that
report how often certain sections of the data appear in the data set. We do this by
drawing a frequency table.
Presenting raw data in the table can make even the most comprehensive collection
of data more readily understandable. Apart from taking up less room, a table allows
us to locate figures more quickly, make it easy to compare different classes and
reveal patterns that we might otherwise not have noticed.
Frequency tables come in a variety of formats and range from a simple table that
contain frequencies of categories to frequency tables for continuous data that contain
grouping of values (or classes). Tables allow us to summarise data in a form that
allows us to access important information (Davis, Pecar and Santana, 2014).
The class midpoint represents the middle value of a class interval. This component
of a grouped frequency distribution table only applies when we use quantitative data
in a table with class intervals.
The frequencies (or the count) represent the number of times that an observation
falls within a specified class. Each observation must fall into exactly one class.
Hence, the sum of these frequencies is equal to the total sample size. When the
frequencies are expressed in percentage of the total sample size (i.e.,
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
× 100), then these are called the percentage frequencies. It is worth noting
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
that the percentage frequencies will add up to 100%. The class frequencies
(percentage frequencies) are denoted fj (pj). The total number of observations is
denoted by n =∑𝑘𝑗=𝑖 𝑓𝑗
The general form of frequency table is presented in Table 2.6. below.
NOTE
When constructing a frequency table, always check that the total frequency equals
the number of observations in the raw data, n=10 in this case.
01 1
03 1
04 1
05 2
09 1
10 1
11 1
17 1
19 1
k
K=9 = 10
i =1
NOTE
In constructing a frequency table from continuous data, also known as a grouped
frequency table, there are several rules that should be followed, they are listed
below:
• The class intervals must never overlap.
• In most instances, the intervals should be of the same width.
• The first and/or last class interval could be open-ended but avoided as much
as possible. e.g., open ended-ness of the type, [100; 200), ‘100 under 200’
or ‘100 over 200‘.
Table 2.8 shows an example of a grouped frequency distribution. The data were the
total sales registered by 15 till points operate by cashiers working in a large
supermarket. In this case the number of classes is k = 5
variable
When constructing a frequency table for continuous data, we need to set up class
intervals in order to group the values into classes. Naturally, the number of classes will
affect the size of each of the class intervals and vice versa, as each class is one part
of the whole observed range.
The ease of reading the table depends on choosing the right number and width of
classes. Too many classes and information is not much better than raw data, whereas
too few classes drastically reduces the amount of meaningful information that we can
glean from the table.
The following is a list of steps involved in constructing a frequency table form a set of
raw continuous data.
Step 2: Determine the difference between the largest observed value in the data
set (xmax) and the lowest observed value in the data set (xmin). The
difference xmax - xmin is the range of the data. Note that it is easier to work
with ‘round’ numbers when constructing a frequency table, but the values
of xmax and xmin are often decimals. To overcome this, we often replace
xmin with the largest ‘round’ number smaller than xmin and we replace xmax
with the smallest ‘round’ number larger than xmax. We then calculate
the (approximate) range from these two new rounded numbers.
of 2, 5 or 10 units
Step 5: Next we determine the class limit for each class. We can do this in one
of the two ways:
(i) Calculate or choose the number of classes, k (using the guidelines
in the steps above). From this value calculate the appropriate class,
xmin, add the value of CW to obtain the first class, i.e., the first class will
be xmin to (xmin + CW). Each subsequent class is then created by adding
the class width to the upper boundary of the previous class. For example,
suppose that for a raw data set we choose k = 4, xmin =10 and xmax= 34.
Using the values, we find that CW = 6. The k = 4 classes that we
construct will then be [10; 16), [16;22), [22; 28) and [28; 34).
(ii) Alternatively, you may decide that the class width must equal a
specific value. In this case, you will determine the class width, CW, and
then reverse the equation for CW to get a value for k:
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝑘=
𝐶𝑊
Then, carry on with the procedure as above, i.e., start with the smallest
chosen value, xmin, add the value of CW to obtain the first class, and so
on, so that each subsequent class is then created by adding the
class width to the upper boundary of the previous class. For example,
suppose that we choose CW =25, xmin =150 and xmax= 275. The number
of classes would then be k = 5, and the classes would be [150; 175),
[175;200), [200; 225), [225; 250) and [250; 275).
NOTE : The square brackets represent ‘inclusion’ and the round brackets
represents exclusion’ i.e. the class [10; 16) represent all the values starting with
and including 10, up to, but excluding ,16. In doing this we are able to avoid
overlap in classes.
Step 6: The class midpoints for each class interval are equal to the average of
the lower and upper stated class limits, i.e. the class midpoint for the jth
class is:
𝑈𝐶𝐵𝑗 − 𝐿𝐶𝐵𝑗
𝑚𝑗 =
𝐶𝑊
where 𝑈𝐶𝐵𝑗 and 𝐿𝐶𝐵𝑗 are upper class and lower class bounds of the
jth class
Step 7: Finally, we use the practical limits to determine the frequencies for
each class by tallying the observations from the raw data set that fall
within the class limits
Example 2.5
Consider the following data set in Table 2.9 depicting the weight of 20 teenagers
(sorted from smallest to largest).
Using the seven steps given above, we can construct the frequency table.
log (20)
𝑘 = 1+ ≈5
log (2)
𝑟𝑎𝑛𝑔𝑒 100
Step 4: The class width is calculated 𝐶𝑊 = = = 20
𝑘 5
Step 5: The classes for the table are then [40; 60), [60; 80), [80; 100), [100;
120) and [120; 140).
Step 6: The class midpoints are the average of the lower and the upper class
(40+60)
limits, for example, =50. The midpoints are then 50, 70, 90, 110
2
and 130.
Step 7: The frequencies are then tallied for each class and given in Table 2.10
Example 2.6
A histogram for the grouped frequency distribution for the weight of students in Table
2.10 is shown in Figure 2.10 below.
Weight of Teenagers
8
7
Number od students
6
5
4
3
2
1
0
[40; 60) [60; 80) [80; 100) [100; 120) [120; 140)
weight in Kgs
Example 2.7
A frequency polygon for the grouped frequency distribution for the weight of students
in Table 2.9 is shown in Figure 2.11 below.
4
3
2
1
0
0 20 40 60 80 100 120 140
Weight
Step 3: Plot points that represent the upper-class boundaries and their
corresponding cumulative frequencies.
Step 4. Connect the points in order from left to right.
Step 5. The graph should start at the lower boundary of the first class
(cumulative frequency is zero) and should end at the upper boundary
of the last class (cumulative frequency is equal to the sample size).
Example 2.8
Construct an ogive for the weight of teenagers shown in Table 2.11.
Step 2: Plot the values of the cumulative frequency class bounds against their
cumulative frequencies:
• Plot the point 40 on the horizontal axis against the point 0 on the
vertical axis (where the ogive is connected to the x-axis).
• Plot the point 60 on the horizontal axis against the point 2 on the
vertical axis.
• Plot the point 80 on the horizontal axis against the point 5 on the
vertical axis.
• Plot the point 100 on the horizontal axis against the point 12 on
the vertical axis.
• Plot the point 120 on the horizontal axis against the point 17 on
the vertical axis.
• Plot the point 140 on the horizontal axis against the point 20 on
the vertical axis.
Step 3: Connect the points with a straight line. The results are shown in
Figure 2.12
20
15
10
0
0 20 40 60 80 100 120 140 160
Weight
The tools available in Excel for creating summary tables and charts are outlined
below.
• From the PivotTable Field List box, drag the categorical variable to the Row
Labels box (or the Column Labels box) and then again drag it to the Σ Values
box. The one-way pivot table is constructed as the variable is dragged to each
box in turn.
• Check that the Σ Values box displays Count of <variable name>. If not, click
the down arrow in the Σ Values box and select Count from the Value Field
Settings dialog box.
The PivotTable Field List dialog box is used to create the category frequency table
(one-way pivot table) as shown in Figure 2.13 below:
To construct a cross-tabulation table (or two-way pivot table), follow the same steps
as for a one-way pivot table, but drag one of the categorical variables to the Row
Labels box and the other categorical variable to the Column Labels box. Then drag
either of these two variables to the Σ Values box, and again check that the Count
operation is displayed.
To use any of the statistical tools within the data analysis add-in, select the Data tab
and then the Data Analysis option in the Analysis section of the menu bar.
Figure 2.14 and Figure 2.15 show all 19 statistical techniques available in Data
Analysis
Figure 2.15: Excel’s Data Analysis dialog box (remaining nine techniques)
The Histogram option within Data Analysis is used to create numeric frequency
distributions, histograms, ogives (both count and percentages) and the Pareto curve.
To apply the histogram option, first create a data range consisting of a label heading
and the upper limits of each interval in column format. Excel calls this data range of
interval upper limits a Bin Range. Then complete the data input preparation dialog box
for the histogram is shown in Figure 2.16.
Conclusion
This Chapter outlined different modes of displaying data and conveying the information
from statistical analyses. Charts such as the pie and bar charts vividly display data
associated with qualitative (categorical) random variables. Examples of how data may
be presented using bar charts, pie charts, histograms, scatter plots, frequency tables
and ogive curves were shown in this chapter.
Additionally, this chapter covered how to use pivot tables, or summary tables, in Excel
to display data graphically. In conclusion, when presenting statistical findings to
management, graphical representations should always be taken into account.
Compared to written reports and tables, a graphical depiction encourages quicker
assimilation of the information to be communicated.
Self-Assessment Questions
2.1 Consider the following data set of the number of calls received per day by a
call centre for motor vehicle insurance claims during the month of December.
31 13 21 25 61
14 19 17 30 72
14 13 21 246 298
8 29 21 217
9 25 17 80
7 17 30 118
17 25 37 80
2.1.1 According to Sturge’s rule of thumb, how many class intervals should
be used when summarising this data set?
2.1.2 Determine the class width and where the first-class interval should
start.
2.1.6 Use the table from question 2.1.5 to draw a frequency polygon.
Table 2.12
AREA PERCENTAGE (%)
North 31.7
South 22.4
East 17.3
West 28.6
Total 100.0
2.2.2 There was 2000 employees. Convert the data into a frequency form.
2.3. The number of Namibians aged 15 years and older attending a sporting
event or competition increased from 6.1 million in 1995 to 6.4 million in 1999.
The table 2.13 below shows the six sports events that had the largest
change in attendance during that period along with their attendance rates:
ATTENDANCE RATE
SPORT 1995 1999
Soccer 29.6 32.3
Rugby 11.4 19.7
Cricket 9.8 12.2
Tennis 6.3 7.1
Golf 6.1 6.6
Bowls 2.1 1.4
Total 65.3 79.3
2.4. The number of burglaries reported in a major city each day over a period of
20 days is shown below:
17 16 26 36 19 8 41 32 26 27
8 38 41 19 26 17 54 16 36 26
2.5 Discuss the two main differences between a bar chart and a histogram.
2.6 The following table shows the number of passenger cars sold by each
manufacturer in each half-year (first and second half) of last year.
2.6.1 Construct a multiple bar chart showing the number of new car sales by
the manufacturers between the first and the second half of last year.
(Use Excel’s Column (Bar) chart option in the Insert > Chart tab.)
2.6.2 By inspection of the multiple bar chart, identify which car manufacturers
performed better in terms of new car sales in the first half of the year
compared to the second half of the year.
2.6.3 Also by inspection of the multiple bar chart, identify which car
manufacturer showed the largest percentage change (up or down) in
sales from the first half to the second half of the year.
CHAPTER THREE
DATA DESCRIPTORS
Learning Outcomes
The notations for the mean depend on whether the observations under consideration
are from a sample or a population. When the mean is calculated from sample
observations, it is denoted by x (pronounced ‘x bar’) and when it is calculated from
the entire population, it is denoted by μ (pronounced ‘mu’).
x i
= i =1
N
μ = Population mean.
N = Population size.
xi = The ith observation for the random variable x.
x
i=1
i = Sum of all the observations starting from the first (i=1) to the last (i = N).
Sample mean, x
x
i =1
i
x=
n
Where:
x = Population mean.
N = Population size.
xi = The ith observation for the random variable x.
x
i=1
i = Sum of all the observations starting from the first (i=1) to the last (i = N).
Example 3.1
The number of workdays lost due to illness in a business per week is given below
(for a 10-week period):
36 28 33 29 28 32 33 33 34 32
Calculate the mean number of days lost per week during the above period.
Solution
=
x i
=
36+28+33+29+28+32+33+33+34+32
=
320
= 32
N 10 10
Therefore, the mean working days lost per week due to illness are about 32 days.
In this case we assume that the observations are spread evenly throughout each class
interval. This essentially means that the calculations are based on the assumption that
all observations occur at the midpoint (m) of their class, so the formula for the
calculation of the mean from a frequency distribution may be used. That is, in the
equation for calculating the mean from a frequency distribution, we replace x by fm.
This yields the n formula:
f m i i
x = i =1
k
f
i =1
i
x = Sample mean.
mi = The midpoint of the ith class interval.
fi mi = The product of the midpoint mi of the ith class interval and the
frequency, f of the ith class interval.
k
f m
i =1
i i = Sum of all the observations in the sample.
f
i =1
i = Sum of the frequencies.
Example 3.2
In mid-winter 2018 in Eswatini, all common colds on a group of locals wore off in less
than 10 days. A summary of the duration of these colds are shown in the grouped
frequency distribution table 3.1 below:
[0, 1) 0
[1; 2) 3
[2; 5) 8
[5; 10) 4
Total 15
Solution
[2; 5) 8 3.5 28
Total 15 f m i i
= 62.5
i =1
f m i i
0 + 4.5 + 28 + 30 62.5
x= i =1
k
= = = 4.2
15 15
f i =1
i
Using the equation for calculating the mean from a grouped frequency distribution,
Thus, the common cold lasted a mean of 4.2 days.
Under certain conditions, other types of means are used. For example, the geometric
mean is used in economic data to average ratios or rates of change. On some
occasions, when we are dealing with quantities that change over a period of time, we
would like to know the rate of change. Examples may include the mean growth rate of
savings over several years and the ratios of annual price fluctuations. In these
circumstances, the arithmetic mean would be misleading and an alternative measure,
known as the geometric mean should be used.
The geometric mean of n observations is the n the root of their product.
If there are n observations x1 , x2 , x3 , ..., xn , the geometric mean is given by:
1
= (𝑥1 × 𝑥2 × 𝑥3 × ⋯ × 𝑥𝑛 )𝑛
Example 3.3
The estimated change in population as at 30 June in the five years from 1998 to
2002 for South Africa is shown in Table 3.3:
Solution
n x1 x2 x3 x 4 .... xn
NOTE:
The mean is very popular measure of central tendency and is widely used
because it is easy to understand and easy to calculate.
• The mean can only be used for quantitative data: We can only calculate
the mean of numerical, quantitative data. It does not apply to qualitative
data, or even qualitative data that has been numerically coded.
• The mean makes use of all observations: When calculating a mean,
whether it is a sample mean or a population mean, we make use of every
available observation value. This implies that the mean makes use of all
available information to determine the central tendency of the data (in fact,
of the three measures of central tendency, it is the only one that makes use
of all available information).
• Each variable in a data set will have only one mean value: The mean is
unique to a collection of data, in that there can only be one mean value per
variable in an observed data set.
• The mean is sensitive to outliers: Unfortunately, since the mean is based
on all data points, it can easily be influenced by values that have been
erroneously recorded or are simply unusually large or small. These
extreme values are called outliers and they have a tendency to skew the
data distribution. In data sets that have outliers, we would use a different
measure to calculate the value of central tendency. For example, in the
number of workdays example if we included an eleventh observation, and
that the number of days as recorded was 182 (which is very large
compared to other data entries), we would find that the new dataset mean
would be:
N
x i
= i =1
=
36+28+33+29+28+32+33+33+34+32+182
=
502
= 45.63 ≈ 46
N 10 11
Clearly, introducing this outlier significantly increase the sample mean, (Davis,
Pecar and Santana, 2014).
The mode is defined as the value that occurs most frequently in a data set. It can be
used for both quantitative and qualitative data variables
Finding the entry that appears the most often is all that is required to determine the
mode of ungrouped data. If two entries occur with same greatest frequency each
entry is a mode. This is known as bimodal. If no entry is repeated data set has no
mode.
Example 3.4
Solution
3 6 4 12 5 7 9 3 5 1 5
The most frequently occurring number is 5 (which occurs three times). Hence the
mode, Mo = 5.
Sometimes, a set of data may have two modes if there are two numbers that appear
more than any of the others and they appear the same number of times. For example,
if the number 3 was added to the above set, the new set would look like:
3 6 4 12 5 7 9 3 5 1 5 3
The numbers 5 and 3 appear three times. In this particular case, the modes M0 = 5
and 3.
It is not possible to calculate the exact value of the mode of the original data in a
grouped frequency distribution since information is lost when the data are grouped.
However, it is possible to make an estimate of the mode.
The class frequency with the largest frequency is called the modal class. The
estimate of the mode itself is given by the equation below:
𝑓𝑚𝑜𝑑𝑒 − 𝑓𝑏𝑒𝑙𝑜𝑤
𝑀𝑜 = 𝐿 + ( )𝐶
2𝑓𝑚𝑜𝑑𝑒 − 𝑓𝑏𝑒𝑙𝑜𝑤 − 𝑓𝑎𝑏𝑜𝑣𝑒
Where:
Mo = Mode.
The modal formula weights (‘pulls’) the modal value from the midpoint position
towards the adjacent interval with the higher frequency count. If the interval to the
left of the modal interval (preceding the modal interval) has higher frequency count
than the interval to the right of the modal interval (following the modal interval),
then the modal value is pulled down below the midpoint value, and vice versa,
(Croucher,2016).
Example 3.5
A courier company recorded 30 delivery times (in minutes) to deliver parcels to their
clients from its depot. The data is summarised in the grouped frequency distribution
below. Calculate the mode.
Time FREQUENCY
(Days) ( fi )
[10; 20) 3
[20; 30) 5
[30; 40) 9
[40; 50) 7
[50; 60) 6
k
Total
f
i =1
i = 30
Solution
From the frequency distribution, the modal interval (interval with the highest
frequency) is [30; 40) minutes. The midpoint of 35 can be used as an approximate
modal courier delivery time.
To calculate a more representative modal value, apply the modal formula with:
𝐿 = 30
c = 10
𝑓𝑚𝑜𝑑𝑒 = 5
𝑓𝑏𝑒𝑙𝑜𝑤 =5
𝑓𝑎𝑏𝑜𝑣𝑒 = 7
𝑓𝑚𝑜𝑑 − 𝑓𝑏𝑒𝑙𝑜𝑤
𝑀𝑜 = 𝐿 + ( )𝐶
2𝑓𝑚𝑜𝑑𝑒 − 𝑓𝑏𝑒𝑙𝑜𝑤 − 𝑓𝑎𝑏𝑜𝑣𝑒
(9 − 5)
𝑀𝑜 = 30 + × 10 = 30 + 6.67 = 36.67 minutes
(2(9) − 5 − 7)
Thus, the most common courier delivery time form depot to customers is 36.67
minutes.
The method of calculating the median from a set of raw data (once it has been
arranged in ascending order of magnitude) is as follows:
𝑛+1 𝑡ℎ
If n is the number of observations, the median is the value of the ( ) observation
2
𝑛 𝑡ℎ
( 2 + 1) observation, i.e., adding these numbers and dividing by two.
Example 3.6
The manager of a hairdressing salon was interested to know how long her customers
had to wait before getting their hair cut and styled. On a certain day she recorded the
waiting times (in minutes) of 15 randomly chosen customers. They times were as
follows:
14 28 36 15 29 16 9 40 16 21 36 17 4 15 22
Solution
4 9 14 15 15 16 16 17 21 22 28 29 36 36 40
NB: In this case, n =15 (this is an odd number), the middle value is the 8th observation
which is 17 minutes.
Example 3.7
For the data recorded in example 3.6 above, suppose, for some reason that the
manager realized that the last waiting time was not worthy to be recorded. What is the
median waiting time for the 14 observations recorded?
Solution
14 28 36 15 29 16 9 40 16 21 36 17 4 15
4 9 14 15 15 16 16 17 21 28 29 36 36 40
n + 1 14 + 1 15
th th th
16 + 17
x= = 16.5
2
Therefore, the median waiting time is 16.5 minutes
th th
n n
NB: The median is the mean of the observation and the + 1 observation
2 2
(𝑛 + 1)
𝐶( 2 − 𝐹𝑏𝑒𝑙𝑜𝑤 )
𝑋̃ = 𝐿 +
𝑓𝑚𝑒𝑑
Where:
𝐹𝑏𝑒𝑙𝑜𝑤 = The cumulative frequency below the median class (i.e. the number of
observations less that of the lower bound of the ‘median class’).
𝐹𝑏𝑒𝑙𝑜𝑤 can also be described as the cumulative frequency count of all intervals before
the median class interval.
Example 3.8
Refer to example 3.5 for the problem description and Table 3.5 for the sample data
of 30 delivery times that have been summarised into a grouped frequency distribution.
Find the median delivery time of parcels to client from the courier service’s depot.
f
i =1
i = 30
Estimate the median delivery time of parcels to clients by this courier company.
Solution
30+1 𝑡ℎ
Since n = 30, the median delivery time will lie in the ( ) = 15.5𝑡ℎ ordered data
2
position. The 15th data value falls in the [30; 40) minutes interval. An approximate
median delivery time for parcels is therefore 35 minutes (the interval midpoint).
However, a more representative median value can be found by using the formula of
the median for grouped frequency distribution data where:
L = 30 minutes
C = 10 minutes
n = sample size (number of observations)
𝑓𝑚𝑒𝑑 = 9 deliveries
𝐹𝑏𝑒𝑙𝑜𝑤 = 8 deliveries
(𝑛 + 1)
𝐶( − 𝐹𝑏𝑒𝑙𝑜𝑤 )
2
𝑋̃ = 𝐿 +
𝑓𝑚𝑒𝑑
30 + 1
10 ( − 8)
𝑥̃ = 30 + 2 = 30 + 8.33 = 38.33 minutes
9
Thus, the median parcel delivery time is 38.33 minutes. This means that half the
deliveries occurred within 38.33 minutes while the other half took longer than 38.33
minutes.
Quartiles divide data into four equal parts and are often used with scores for aptitude
tests, examinations and other testing situations. They are also used in commerce and
industry when a large number of observations are involved.
If the data are divided into four equal parts, the points of separation are:
1. First quartile — Q1
2. Second quartile — Q2
Q2 is obviously the median mean, and hence the former name is seldom
used.
3. Third quartile — Q3
There are 75% of observations below Q3 and 25% above Q3 .
Q3 is often called the upper quartile.
The following rules should be applied when calculating quartiles in small
samples:
Let n = the number of observations. The positions of the quartiles are given
as follows:
th
n + 1
Q1 is the position
4
2 (n + 1)
th
Q2 is the position
4
3 (n + 1)
th
Q3 is the position
4
To determine these positions, first sort the data in ascending order. The values
of the positions can take either whole numbers or decimal numbers.
4. If the value of the position is a whole number, simply take the observation
with that position.
5. Suppose the positions for Q1, Q2 and Q3 are not a whole number. Then
Similarly, the percentiles divide a sample into 100 equal-sized groups and the xth
percentile is defined as the value in the data set such that x% of the observations are
smaller that it, and (100 - x)% of the observations are larger than it. It is denoted by
Px. For example, the 33% percentile is denoted by P 33.
The quartiles (and median) are simply special cases of percentiles. For example, the
50% percentile is the same as the median (or second quartile). i.e., P50 = Q2 = 𝑥̃.
As we can express all of these values in terms of percentiles, we will only focus on
calculating the percentile values and then use the above relationships to determine
To determine the value of the xth percentile in a data set, it is important to fist sort or
order the data set form the smallest value to the largest value. Once sorted, we can
obtain the xth percentile, Px using the expression:
𝑥
𝑃𝑥 = 100 (𝑛 + 1)𝑡ℎ value in the ordered data set.
15
For example, in a sample of n =30, the 15% percentile, P15 will be the 100 (30 + 1) =
𝑥
As with the median, the index value (𝑛 + 1) can result in a fraction. In order to
100
determine the correct value associated with the index, you will need to carry out further
steps.
𝑥
The index value (𝑛 + 1) is a fraction value number that can be broken down into
100
two parts:
• A ‘whole number’ part (for example 0,1,2,3,… etc.) denoted by w.
• A ‘decimal’ part (for example, 0.12, 0.67, 0.78, etc.) denoted by d.
For example, the index number 7.85 can be broken up into w = 7 and d =0.85 (note
that w +d =7 +0.85 =7.85). We can use this partitioning of the index value to find the
percentile value for a set of ordered data value X1,X2,X3,….,Xn as follows.
𝑥 𝑥
If index number (𝑛 + 1) is equal to fraction with value w +d, then the (𝑛 + 1)𝑡ℎ
100 100
value in the ordered data set (i.e. xth percentile) is given by the following:
𝑃𝑥 = 𝑋𝑤 + 𝑑 × (𝑋𝑤+1 − 𝑋𝑤 )
Where”
𝑥
w and d = the whole number and decimal part of the index number 100 (𝑛 + 1),
respectively
Example 3.9
Solution
First the data is ordered form the smallest value to the largest value.
24 27 36 48 52 52 53 53 59 60 85 90 95
25
We obtain Q1 from this sorted data set by first calculating the index (𝑛 +
100
25 25
1)=100 (13 + 1) = 100 (14) = 3.5
Note that we can express this index in terms of a whole number, w =3 and the decimal
number d =0.5.
Now, since w =3 and w +1 = 4, we must find the values 𝑋𝑤 =𝑋3 = 36 and 𝑋𝑤+1 = 𝑋4 =
48. From the formula, the 25th percentile is then:
𝑄1 = 𝑃25 = 𝑥𝑤 + 𝑑 × (𝑋𝑤+1 − 𝑋𝑤 )
= 𝑋3 + 0.5 × (𝑋4 − 𝑋3 )
= 36 + 0.5 × (48 − 36) =42
Similarly, to obtain the 33rd percentile, we first determine the index value
33 33 33
(𝑛 + 1)= (13 + 1) = (14) = 4.62. The index number expressed in terms of a
100 100 100
whole number and a decimal number is w +d =4 +0.62, i.e w =4 and d =0.62. Note that
𝑋𝑤 =𝑋4 = 48 and 𝑋𝑤+1 = 𝑋5 = 52. Then 33rd percentile is:
𝑃33 = 𝑥𝑤 + 𝑑 × (𝑋𝑤+1 − 𝑋𝑤 )
= 𝑋4 + 0.62 × (𝑋5 − 𝑋4 )
= 48 + 0.62 × (52 − 48) = 50.48
To estimate the value of the percentile, Px, begin by first determining the ‘percentile
𝑥
class’ i.e., the class that contains (𝑛 + 1)𝑡ℎ value of the data set. Once you have
100
found the percentile class, you can use the formula below to estimate the value of the
xth percentile in the frequency table.
𝑥
(𝑛 + 1) − 𝐹𝑏𝑒𝑙𝑜𝑤
𝑃𝑥 = 𝐿 + ( 100 )×𝐶
𝑓𝑥
Example 3.10
To calculate the 10th percentile for example 3.8 we find that x =10, n = 30 and the
10
position of the 10th percentile is the 100 (30 + 1)𝑡ℎ = 3.1𝑡ℎ number in the data set. The
percentile class is then [20; 30) and the elements of the formula above are then L =20,
C = 10, 𝐹𝑏𝑒𝑙𝑜𝑤 = 3 and 𝑓𝑥 = 5. The 10th percentile is then:
𝑥 10
(𝑛 + 1) − 𝐹𝑏𝑒𝑙𝑜𝑤 (30 + 1) − 3
𝑃10 =𝐿+(100 ) × 𝐶 = 20 + (100 ) × 10 = 20.2
𝑓𝑥 5
A frequency distribution can assume any one of a large number of shapes. The
skewness or the shape of distributions can be measured by comparing the relative
positions of the mean, median and mode.
The most common shapes of a frequency distribution are:
• Symmetric.
• Uniform (or rectangular).
• Skewed.
A symmetric distribution is identical on both sides of its central point. If the distribution
is symmetrical:
A skewed distribution is non-symmetrical with the tail on one side longer than the
tail on the other side. It can be skewed to the left or to the right.
Previously, the concept of central location was introduced. The variability among data
is one characteristic to which averages are not sensitive. It is possible to have two
datasets with identical measures of central location but with wider spreads of data.
Once again, the level of measurement will determine how we gauge the spread of a
distribution. Table 3.6 summarises the levels of measurement and their measures of
dispersion.
The best ways to analyse the spread of the distribution for each level of measurement
is as follows:
Level of Representation
measurement
Nominal Table or frequency distribution showing frequencies
Ordinal Tables/frequency distribution, but choosing a single measure is
problematic. Use interquartile range if single measure chosen.
Interval/Ratio Graphic dispersion, standard deviation provided, cases have an
approximately normal distribution.
When there is a possibility that the underlying distribution may not be normal,
interquartile range is a good alternative.
EXAMPLE 3.11
Consider two groups of data:
Dataset A Dataset B
65 42
66 54
67 58
68 62
71 67
73 77
74 77
77 85
77 93
77 100
Computed measures of central
Mean = 71.5locationMean = 71.5
Median = 72 Median = 72
Mode = 77 Mode = 77
Although there is no difference in the computed central measures between the two
groups, the scores of dataset B are much more widely scattered than those of dataset
A. The measures that are used to measure dispersion are:
• Range.
• Variance.
• Standard deviation.
• Interquartile range.
• Quartile deviation.
The simplest measure of dispersion is the range, which is defined as simply the
difference between the largest and smallest values in a set of data.
The range is useful for situations like daily temperature fluctuations or share price
movements. It is considered primitive as it considers only the extreme values which
may not be useful indicators of the bulk of the population.
The formula is:
Note: If the largest observation = the smallest observation, the range is zero.
Example 3.12
The observed temperatures (in degrees Celsius) in a certain country for a week
are given in Table 3.8 below:
Calculate:
3.12.1. The mean week temperature
3.12.2. The range of the temperatures
Solution
Standard deviation is the measure of spread most commonly used in statistics when
the mean is used to calculate central tendency. The variance and standard deviation
provide a measure of how dispersed the data values (x) are about the mean value (𝑋̅).
Because of its close links with the mean, the standard deviation can be greatly affected
if the mean gives a poor measure of central tendency.
If we calculated for each data value, (𝑋 − 𝑋̅) then some would be positive and some
negative. Thus, if we were to sum all these differences then we would find that
∑(𝑋 − 𝑋̅ ) i.e., the positive and negative values would cancel out. To avoid this
The aim is basically to find an ‘average’ measure of each observation away from the
mean of the set of observations. If the actual residuals are used (ie. retaining their
signs), their mean is always zero and therefore quite useless. If we take the mean of
the absolute values of the residuals, we obtain the mean deviation with the
troublesome absolute value signs.
An alternative approach is to work with the squares of the residuals. This will
eliminate the effect of the signs, since squares of numbers cannot be negative. A
first step is therefore to find the sum of the squares of the residuals, and then to find
the mean. To take into account the fact that we have squared the residuals, we take
the square root of this mean. The result of this technique is to find the population
standard deviation, which is denoted by the Greek symbol sigma, denoted . In
most cases the standard deviation is derived from the variance ( 2 ) which is given
by:
∑𝑁
𝑖=1(𝑥𝑖 − 𝜇)
2
𝜎2 =
𝑁
∑𝑁
𝑖=1(𝑥𝑖 − 𝜇)
2
𝜎=√
𝑁
= √𝜎 2 = 𝜎 = √𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 variance
Where:
2 = Variance.
xi = ith observation.
= Population.
i =1
= Sum of, starting from i =1 up to i = N.
N = Population size.
The population standard deviation is the square root of the population variance.
NOTE: The square root of a given number is the same as raising that number to
1
1
( ) = ( )
0.5
the power of = 0.5, = 2 = 2 2 2
In practice, most populations are very large, and it is more common to calculate the
sample standard deviation (denoted by s) rather than the population standard
deviation.
The formula is quite similar to that for calculating the population standard deviation.
However, instead of dividing by N as above, we divide by n-1, i.e. we divide by one
less than the total number of sample observations. For the mean we now use 𝑥̅
instead of 𝜇.
As in the population, the sample standard deviation (s) is also derived from the
sample variance (s2) which is given by:
n
(x − xi )
2
i
s2 = i=1
n−1
x i
2
− nx 2
= i=1
n−1
Where:
s2 = Sample variance.
xi = ith observation.
xi = Sample mean.
i =1
= Sum of, starting from i up to i = 1 up to I = n.
The simplest way to calculate the variance is to use the statistical mode of your
scientific calculator and then recall the values and substitute them into the following
formula:
x i
2
− nx 2
s2 = i=1
n−1
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̄ 𝑖 )2
2
𝑠 =
𝑛−1
1
(𝑠𝑎𝑚𝑝𝑙𝑒 variance)2
i.e. Find ( x i − xi )
3. Square the residuals ( xi − xi )
Find ( xi − xi )
2
i.e.
(x − x )
2
i.e. Find i i
i =1
(x − xi )
2
i
i.e. Find i =1
n−1
6. To find the standard deviation take the square root of the above quantity
(Step 5)
n
(x − x )
2
i i
i.e., Find i=1
n −1
Example 3.13
89 72 77 78 82 94 76 78 73 80 88 85]
Calculate the:
Solution
x 1
x= i=1
n
89 + 72 + 77 + 78 + 82 + 94 + 76 + 78 + 73 + 80 + 88 + 85 972
= = = 81
5 12
Therefore the mean price is 81 cents.
3.13.2 To find the range it is wise to first arrange the values in ascending
order as follows:
72 73 76 77 78 78 80 82 85 88 89 94
3.13.3 To calculate the mean deviation and the standard deviation, we use
either of the following two methods:
Method 1:
We construct a Table 3.9 as follows:
Table 3.9: Calculation of the standard deviation for raw data first method
xi − x ( xi − x )
2
xi
89 8 64
72 -9 81
77 -4 16
78 -3 9
82 1 1
94 13 169
76 -5 25
78 -3 9
73 -8 64
80 -1 1
88 7 49
85 4 16
n n n
xi ( xi − x ) (x − x)
2 2
=0 i = 504
i =1 i=1 i=1
(x − xi )
2
i
64 + 81 + 16 + .. + 1 + 49 + 16 504
s2 = i=1
= = = 45.818
n −1 12 − 1 11
(x − x )
2
i i
s= i=1
n −1
= s2
= 45.818 = 6.769
Method 2:
x i
2
− nx 2
s2 = i=1
n −1
2
Add a column of xi as shown in Table 3.10 below:
Table 3.10: Calculation of the standard deviation for raw data using the second method
xi xi2
89 7 921
72 5 184
77 5 929
78 6 084
82 6 724
94 8 836
76 5 776
78 6 084
73 5 329
80 6 400
88 7 744
85 7 225
n n
x
i=1
i = 972 x
i=1
i
2
= 79 236
x i
2
− nx 2
s2 = i=1
n −1
79 236 -12 ( 81)
2
=
12 − 1
504
= = 45.818
11
x i
2
− nx 2
s =
2 i=1
n −1
s2 = 45.818 = 6.769
As with the previous examples, we start by calculating the variance of the grouped
data and then find the square root to obtain the standard deviation.
The formula for estimating the variance from a grouped frequency distribution is
given by:
f (m − xi )
2
i i
s2 = i=1
k
fi=1
i −1
f (m − xi )
2
i i
s= i=1
k
f
i=1
i −1
fm i i
2
− nx 2
s2 = i=1
k
f
i=1
i −1
fm i i
2
− nx i2
s= i=1
k
f
i=1
i −1
Example 3.14
One of the ways in which the market value of a used motor vehicle can be estimated
is by using the prices that vehicles of the same type bring in at auctions. Each vehicle
may be classified to be in one of several conditions (e.g. quite good; good; very
good). Much of the guesswork is taken away from car sale yards and insurance
companies, since all they have to do is consult a book (which is usually updated
each month) to find the market value for any make, model and year of any vehicle
in a particular condition. Suppose that we had the task of estimating the current
market value of a six-year-old small sedan of which 50 had been sold at auction the
previous month.
Solution
3.14.1 From the equation for calculating the mean from a grouped
frequency distribution. The required calculations are summarized in the
Table 3.12 below. Note that the midpoints of each class
i nterval (mi) are found by finding the mean of the two endpoints
of the interval
k k
fi = 50
i=1
fm
i =1
i i = 462 500
fm i i
462 500
x= i=1
k
= = 9 250
50
f
i =1
i
3.14.2 Method 1:
The required calculations for the standard deviation using the first
method are summarized in the Table 3.13 below.
Let us calculate the grouped variance using the first method which is given by
formula:
Table 3.13: Calculation of the standard deviation from a grouped data using
the first method
f (m − x )
2
i i i
s2 = i=1
k
f −1
i=1
i
f (m − x )
2
From the table i i i = 12 500 000 hence,
i=1
f (m − x )
2
i i i
12 500 000
s2 = i=1
= = 255102.040 8
k
50 − 1
f −1
i=1
i
f (m − x )
2
i i i
12 500 000
s= i=1
= = s2 = 255102.040 8 = 505.076
k
50 − 1
f −1
i=1
i
Method 2:
Calculate the sample variance using this method, add another column for mI2 as
shown it Table 3.14.
Table 3.14: Calculation of the standard deviation from a grouped data using the second
method
fi = 50 x= i =1
k
= 9 250 fm i i
2
= 4 290 625 000
i=1
f
i =1
i
i=1
f (m − x )
2
i i i = 12 500 000
i=1
fm i i
2
− nx 2
4 290 625 00 - 50 ( 9 250 )
2
s =
2 i=1
= =
k
50 − 1
f −1
i=1
i
12 500 000
= = 255 102.040 8
49
The sample standard deviation is given by:
k
fm i i
2
− nxi2
s= i=1
k
= s2 = 255 102.040 8 = 505.076
f −1
i=1
i
Example 3.15
Suppose that the mean and standard deviation of last year‘s mid-term test marks
are 70 and 5, respectively. If the histogram is bell-shaped then we know that:
• Approximately 68% of the marks fell between 65 and 75 (1 standard
deviation from the mean, 70 ± 5).
• Approximately 95% of the marks fell between 60 and 80 (2 deviations
from the mean, 70 ± (2 x5)).
• Approximately, 99.7% of the marks fell between 55 and 85 (3 deviations
from the mean, 70 ± (3 x 5)).
The standard score (z-score) represents the number of standard deviations a given
value (x) falls from the mean (µ).
x−
Z = (value – mean) / (standard deviation) =
EXAMPLE 3.16
In 2007, Forest Whitaker won the Best Actor Oscar at age 45 for his role in the movie
The Last King of Scotland. Helen Mirren won the Best Actress Oscar at age 61 for
her role in The Queen. The mean age of all best actor winners is 43.7, with a
standard deviation of 8.8. The mean age of all best actress winners is 36, with
standard deviation of 11.5, (Larson and Farber, 2019).
Find the z-score that corresponds to the age for the actor and actress. Then
compare the results.
Solution
• Forest Whitaker
x − 45 − 43.7
Z= = 0.15
8.8
This is 0.15 standard deviations above the mean.
• Hellen Mirren
x − 61 − 36
Z= = 2.17
11.5
This is 2.17 standard deviations above the mean.
The z-score corresponding to the age of Helen Mirren is more than two standard
deviations from the mean, so it is considered unusual. Compared to other Best
Actress winners, she is relatively older, whereas the age of Forest Whitaker is only
slightly higher than the average age of other Best Actor winners, (Larson and Farber,
2019).
S
Coefficient of variation= 100%
x
Where:
x = Sample mean.
s = Sample standard deviation.
EXAMPLE 3.17
Calculate the coefficient of variation if the mean price of a sample of pet food in
R 9250 and the standard deviation is R 505.076.
Solution
Excel functions and the Data Analysis add-in can be used to compute all the
descriptive statistical measures of location, non-central location, spread and shape.
For example:
=GEOMEAN (1.12,1.08,1.16) = 1.1195 (i.e. an 11.95% average increase).
The QUARTILE function can be used to compute all the values of the five-number
summary (i.e., min, Q1, median, Q3, max) by assigning different numeric codes to the
‘quart’ term in the function as follows: 0 = minimum data value; 1 = Q1; 2 = median; 3
= Q3; and 4 = maximum data value.
The Descriptive Statistics option in the Data Analysis add-in as shown in Figure 3.6
will compute all the above descriptive measures, except the quartiles.
Conclusion
This chapter discussed two categories of data descriptors namely: measures of central
tendency and measures of dispersion. In the former category were the mean, median,
and mode the latter were the range, interquartile range, and standard deviation.
Measures of central tendency describe a typical or representative score, while
measures of variability describe the spread or dispersion of scores about a central
measure. Most descriptive measures can be computed in Excel by using either
appropriate function keys or the Descriptive Statistics option in the Data Analysis add-
in.
Self-Assessment Questions
3.1. A supporter of bicycle lanes around the city recorded the number of
bicycles that passed by her house between 7 am and 9 am for each of 10
successive weekdays. The results were:
15 18 23 22 21 18 14 20 25 12
3.1.1 Find the mean, median and range of these data.
3.1.2 Find the mean deviation.
3.1.3 Find the standard deviation.
3.2 The systolic blood pressure (in mm) was recorded for 40 salespeople at 5 pm
on a Friday afternoon. The results are shown in Table 3.15 below:
3.3. A set of data has a standard deviation 1.5 times that of the mean. What is
the value of the coefficient of variation?
3.4. A company secretary has recorded the amount of petty cash used by
the records department for each of the past 10 weeks. The amounts were (in
dollars):
3.4.1 Range.
3.4.2 Mean deviation.
3.4.3 Interquartile range.
3.4.4 Quartile deviation.
3.4.5 Standard deviation.
3.4.6 Coefficient of variation.
58 50 33 51 38 43 60 55 46 43
51 47 40 37 43 48 61 55 44 35
CHAPTER FOUR:
ELEMENTARY PROBABILITY
Learning Outcomes
We are now going to move away from the summary and analysis of data and look
at a new topic, probability. ‘The likelihood of rain this afternoon is fifty percent’ warns
the weather report from your radio alarm clock. ‘There’s no chance of you catching
that bus’ grunts the helpful soul as you puff up the hill. The headline on your
newspaper screams ‘Odds of Rainbow Party winning the election rises to one in
four’. There are a number of ways of analysing uncertainty. Underlying all of these
methods is, however, one concept: probability. An understanding of the concept of
probability is vital if you are to take account of uncertainty.
From Figure 4.1 we observe that the probability values lie between 0 and 1, with 0
representing there is no possibility of the event occurring and 1 representing the
probability that the event is certain to occur.
In reality, the value of the probability will lie between 0 and 1. In order to determine
the probability of an event occurring data has to be obtained. This can be achieved
through, for example, experience or observation, or empirical methods.
The procedure or situation that produces a definite result (or outcome) is termed a
random experiment. Tossing a coin, rolling a dice, recording the income of a factory
worker, and determining defective items on an assembly line are all examples of
experiments. The characteristics of the random experiment are:
The set of all possible outcomes is defined as the sample space. For example, the
experiment of rolling a dice could produce the outcomes 1, 2, 3, 4, 5, and 6, which
would thus define the sample space.
Another basic notion is the concept of an event and is simply a set of possible
outcomes. This implies that an event is a subset of the sample space. For example,
take the experiment of rolling a dice, the event of obtaining an even number would
be defined as the subset {2, 4, and 6}.
Finally, two events are said to be mutually exclusive if they cannot occur together.
Thus, in rolling a dice, the event ‘obtaining a two’ is mutually exclusive of the event
‘obtaining a three’. The event ‘obtaining a two’ and the event ‘obtaining an even
number’ are not mutually exclusive as both can occur together, i.e. {2} is a subset
of {2, 4, 6}.
Definitions
Any probability can be dissected into its basic elements. Data can be obtained through,
for instance, experimentation, observation, or experience. A random experiment is a
process or circumstance that produces clearly defined results, and the results of these
experiments are their clearly defined results, which are referred to as outcomes. The
word “experiment” may be deceptive in this context since it can refer to a wide range
of methods or circumstances in which the results are not known in advance. For
example, it does not only relate to experiments carried out by researchers.
Definition
An exhaustive list of all possible outcomes of a certain type for a given experiment is
called the sample space of an experiment. We use the symbol Ω to denote this
complete sample space. For example, consider a simple random experiment in which
we toss 2 coins. The outcome measured in the experiment is all the values that appear
face up of the two coins after they landed. The complete set of possible outcomes for
this experiment is given by:
This event contains three of four outcomes that appeared in the sample space Ω.
4.3.2 An event
interested in is if at least one head appeared in the two coins. This set of events can
be expressed as follows:
If A is an event, the probability that it occurs is denoted by P(A). The probability (or
chance) that an event A occurs is the proportion of possible outcomes in the sample
that yield the event A. That is:
EXAMPLE 4.1
What is the probability of rolling an odd number on a dice? (i.e. rolling a 1,3 or 5)?
Solution
Number of outcomes that yield event A 3 1
P(A) = = =
Number of possible outcomes 6 2
EXAMPLE 4.2
Solution
In a deck of 52 cards, there are 4 aces. Out of those 4 aces there is just
one ace of spades. Let A be the event of drawing an ace of spades. Then
1
Therefore, the probability of drawing ‘an ace of spades’ is
52
Two events A and B are said to be mutually exclusive if they cannot occur
simultaneously (at the same time).
S
Y N
As you can see from Figure 4.2 above, Y and N are mutually exclusive events,
i.e., Y and N, have no points in common.
Example
A coin is tossed. If events A and B are defined as follows:
A = Outcome is a head.
B = Outcome is a tail.
Are events A and B mutually exclusive?
Solution
When you toss a coin there are just two possibilities at a time. That is, one can get
a head or tail and never both on one toss.
Since A and B cannot occur at the same time, these events are mutually exclusive.
EXAMPLE 4.3
A person is selected at random from the population. Define events A and B to be:
A = Person is a vegetarian.
B = Person is a non-vegetarian.
Are these two events mutually exclusive?
Solution
Once a person is said to be vegetarian, then already we cannot call the same person
a non-vegetarian. In other words, we cannot call one person a vegetarian and a non-
vegetarian at the same time.
Since A and B cannot occur at the same time, then the events are mutually
exclusive.
Suppose that A1; A2; A3;...; An are n mutually exclusive events. Then the following
relationship holds:
P ( A1 or A 2 or A 3 or.... or A n ) = P ( A1 ) + P ( A 2 ) + P ( A 3 ) + ... + P ( A n )
EXAMPLE 4.4
Solution
The events A and B are clearly mutually exclusive since they cannot both occur
at the same time.
We have:
12 3
P(A) = =
52 13
16 4
P (B ) = =
52 13
Hence,
3 4 7
P ( A or B ) = P ( A ) + P (B ) = + =
13 13 13
Therefore, the probability that the card is a face or has points value less than six is
7
13
EXAMPLE 4.5
B = Card is 7 or 8.
C = Card is a face card.
Calculate P (A or B or C).
Solution
Two events A and B are independent if the occurrence of one does not alter the
likelihood of the other event occurring. Events that are not independent are called
dependent.
Think Point
• When two events are mutually exclusive, they cannot occur simultaneously.
• When two events are statistically independent, they can occur together, but
the occurrence of one does not affect the occurrence of the other, (Wenger,
2017).
EXAMPLE 4.6
Two coins are tossed independently. Find the probability that both coins show
heads.
Solution
EXAMPLE 4.7
Solution
Since these events are independent and to find the probability that all three occur,
we use:
1 1 1 1
P ( A and B and C ) = P ( A ) P (B ) P ( C ) = =
2 6 4 48
1
Hence, the probability that all three events will occur is
48
The complements of an event are those outcomes of a sample space for which the
event does not occur. Two events that are complements of each other are said to be
EXAMPLE 4.8
If two events A and B are mutually exclusive and are complementary, then:
P ( A ) + P (B ) = 1
P ( A and B )
P ( A B) =
P (B )
or
P(A B )
P ( A B) =
P (B )
P ( A or B ) = P ( A ) + P (B) − P ( A and B)
P ( A U B ) = P ( A ) + P (B) − P ( A B)
Sample spaces and events are often presented in a visual display called a Venn
diagram. While there are several variations as to how these diagrams are drawn,
we will use the following conventions.
1. A sample space is represented by a rectangle.
2. Events are represented by regions within the rectangle. This is usually
done using circles (or parts of circles).
The union of two events A and B is the set of all outcomes that are in event A or
event B . An example of a union event is shown in diagram 4.4 below:
The intersection of two events A and B is the set of all outcomes that are in both
event A and event B.
Venn diagrams are used to assist in presenting a picture of the union and
intersection of events, and in the calculation of probabilities.
EXAMPLE 4.9
Solution
(ii)
(iii)
(iv)
(v)
(vi)
There is the reverse way of looking at probabilities. Instead of asking what probability
there is that an isolated may occur, one is interested in the number of alternative ways
a set of events may turn out. Of course, if the total number n of alternative ways that
an event can turn out is known, then the probability that one of those ways turns out
is 1/n. Consider the twenty-six letters of alphabet. In this regard, one may be interested
in the alternative ways of arranging them. Computation using modern computers
shows that there are 400 trillion ways of arranging these letters. The way which we
have memorized since we were small is just one of these. The probability that such an
1
order is selected can be said to be . Such problems and many more are
400 trillion
One of the methods of approaching similar problems to the simpler types of the
example given is by means of decision trees. This involves dividing the problem into
many events (or stages) and drawing the various interconnections between them. If,
for example, there is one production line of car assembly that produces three types of
cars of three types of engine capacities 1100cc, 1500cc, and 2000 cc. Then if among
the various engine capacities there are three types of colour configuration – like yellow,
blue and cream, and then if there are three body-build alternatives Saloon, Pick-Up
and Station-Wagon; one may wonder what is the probability of obtaining one car,
which is 1500cc, yellow in colour and saloon. This is a typical simple problem which
can be solved by means of a decision tree, like shown in Figure 4.6
Physically one can count the ultimate ends of the decision tree and find that there are
27 alternatives available. We may conclude that the probability of obtaining one trail
of alternatives; like 1500cc, yellow and then saloon is exactly 1/27, because there are
these 27 alternatives.
Decision trees are convenient only for evaluating simple alternatives like the one
given. The more stages of alternatives available, we find that it is impossible to keep
4.8.1 Permutations
This is one of those techniques which is available and can save us from the pain of
drawing decision trees always. It attempts to answer similar problems which arise from
the desire to know how many arrangements of n-objects, r taken at a time, are
possible. In permutations the order of arrangement is important. Therefore, we have
the following formula for the algorithm used to compute permutations:
n!
Prn =
(n − r )! .
The formula reads: The permutation of n objects, r taken at a time is equal to the ratio
of n! to the difference (n – r)! In that case, n is the total number of objects. The symbol
r represents equal groups of objects that are handled each time to affect the required
arrangement. P is the permutation or another name for numerous alternative
arrangements.
EXAMPLE 4.10
Find the number of all possible arrangements of five objects, taking three at a time,
where objects must be arranged in a sequential manner.
Solution
Here a specific arrangement is prescribed. This means that we can use only a
permutation computational algorithm to solve this problem. Therefore, we use the
formula:
n!
Prn =
(n − r )!
And insert the figures which we are given in the problem. Accordingly, we have the
following arrangement which includes the given figures:
n! 5! 5 4 3 2 1
Prn = = P35 = =
(n − r )! (5 − 3)! (5 − 3 )!
5 4 3 2 1
= = 5 4 3 = 60 .
2!
This means you can arrange five objects taking two objects at a time in 60 different
ways.
Learners must try as many problems as possible to familiarize themselves in the use
of permutations. The probability of obtaining any of these possible arrangements is
1/60 = 0.016667.
4.8.2 Combinations
n!
Cnr =
r ! (n − r ) !
EXAMPLE 4.11
Find the number of all possible arrangements of five objects, taking two at a time,
where objects must be arranged in a no specific order.
Solution
Here no specific arrangement is prescribed. This means that we can use only a
combination computational algorithm to solve this problem. Therefore, we use the
n!
formula Cnr = . If we now insert the given figures into this formula, we have:
r ! (n − r ) !
n! 5! 5 4 3 2 1 5 4
Cnr = = C52 = = = = 10
r! ( n − r ) ! 2! ( 5 − 2 ) ! 2 1 ( 3 2 1) 2
This means that five objects can be arranged ten times if we take two objects at a time
when we are shuffling these objects. The probability of obtaining any of these possible
combinations is therefore 1/10 = 0.1.
4.9 SUMMARY
Self-Assessment Questions
4.2. A six-sided die is rolled 3 times. Find the probability that all 3 outcomes are
greater than 4.
4.3. A box contains 3 red balls, 5 black balls and 8 blue balls.
Find the probability that a ball chosen at random is:
4.3.1 Red.
4.3.2 Blue.
4.3.3 Black.
4.3.4 Blue or red.
4.4.1 A B
4.4.2 A B
4.5. Suppose we randomly select two persons from the members of a club and
observe whether the person selected each time is a man or a woman. Write
all the outcomes for this experiment. Draw the Venn and tree diagrams for
this experiment.
CHAPTER FIVE
LINEAR CORRELATION AND REGRESSION
Learning Outcomes
Sales performance analysis and sales forecasting are two very important activities
within the business management function. Each involves examining the impacts of the
various elements of the marketing mix (price, promotion, place, and product) on the
level of sales volumes. If the relationship between sales volumes and the various
marketing mix elements can be measured and quantified, marketing managers will
have a powerful tool at their disposal to influence sales volumes through their
decisions on pricing.
This unit examines the likely relationship between numeric random variables (e.g sales
volume and advertising expenditure). Understanding and quantifying such
relationships can considerably enhance the planning function of marketing
management. Regression analysis and correlation analysis are two statistical methods
which collectively describe and measure the strength of possible relationships
between business management related variables.
The marketing mix variables, such as price, in-store promotion expenditure, number
of advertisement insertions, size of advertisements, level of advertising expenditure.
Consumer attributes such as personal disposal income, family size, age of
breadwinner; or such as the various economic indicators (e.g., GNP, CPI,
export/import volumes).
The other random variable is called the dependent variable and is represented by the
symbol y. The dependent variable is assumed to be influenced (or determined) by the
independent variable.
In all the above examples, the independent variable, x is considered to influence the
outcome of the dependent variable, y.
The equation of the straight line that fits through the scatter plot of two variables is
called the regression equation. The equation can take quadratic, cubic or exponential
functions. In this module we will focus on straight line regression equations which is
given as follows:
y = a + bx
Where:
y = Dependent variable.
x = Independent variable.
Due to the fact that the regression coefficients a and b will be estimated, the values
of y will also be estimates and are denoted by ŷ instead of y. The regression equation
then becomes:
ŷ = a + bx
The regression coefficients are obtained by the method of least squares (derivations
beyond the scope of this module) and given as follows:
𝑦̂ = 𝑎 + 𝑏𝑥
𝑎 = 𝑦̅ − 𝑏𝑥̅
𝑆𝑥𝑦
𝑏=
𝑆𝑥𝑥
∑ 𝑥𝑦−𝑛𝑥𝑦
̅̅̅̅
Or 𝑏=
̅2
∑ 𝑥2 −𝑛𝑥
The values of a and b found from the above formulae define the best-fit linear
regression line. This means that no other straight line can be found that will give a
better fit than the regression line.
EXAMPLE 5.1
Music Centre, an electronics retail company in Durban, has kept records of the number
of Hi-Fi systems sold within a week of placing advertisements in local newspapers.
Table 5.2 shows the number of Hi-Fi systems sold each week and the corresponding
number of advertisements placed in local newspapers for 12 periods.
Table 5.2: The number of Hi-Fi sold and the number of number of advertisements placed in
local newspapers
Solution
Identify the dependent and independent variables first from the problem description.
This self-assessment, sales of hi-fi systems are dependent upon the number of
advertisements placed. Hence sales of hi-fi systems are the dependent variable y i,
and the number of advertisements placed is the independent variable xi.
a) Scatter plot
30
20
10
0
0 1 2 3 4 5 6
Figure.5.1: Scatter plot for the number of Hi-Fi sold and the number of number of
advertisements placed in local newspapers
b) Table
∑ 𝑥𝑖 48
𝑥̄ = = =4
𝑛 12
∑ 𝑦𝑖 336
𝑦̄ = = = 28
𝑛 12
𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 𝑛𝑥𝑦
̅̅̅ = 1393 − 12(4)(28) = 49
𝑆𝑥𝑦 49
𝑏= = = 4.9
𝑆𝑥𝑥 10
𝑦̂ = 𝑎 + 𝑏𝑥 = 8.4 + 4.9𝑥
35
30
Hi-Fi systems sold
y = 8.4 + 4.9x
25
20
15
10
0
0 1 2 3 4 5 6
Number of Adverisements places
Figure 5.2: Fitted regression line for the number of Hi-Fi sold and the number of number of
advertisements placed in local newspapers
A business manager can have great confidence in estimates based on the regression
line if there is a strong relationship between the x and the y random variables. A
strong relationship will produce a more accurate and reliable estimate of y, meaning
that the estimated y value which will result for a given value of x is likely to be close
to the y value.
There are two types of correlation coefficients namely Pearson’s correlation coefficient
and Spearman’s rank correlation coefficient. Their interpretation is the same.
Figure 5.3 below gives a rough guide to the interpretation of the correlation
coefficients:
Any interpretation should take the following two points into account:
• A low correlation does not necessarily imply that the variables are unrelated,
but simply that the relationship is poorly described by a straight line.
• A correlation does not necessarily imply a cause-and-effect relationship,
merely an observed association.
The following diagrams illustrate how scatter plots can be used to interpret the
correlation coefficient.
Figure 5.4 below shows a perfect positive linear correlation ( r = +1 ). All the data points
of a scatter plot will lie on a positive straight line.
Figure 5.5 shows a perfect negative linear correlation r = -1. All the data points will
again lie on a straight line, but in an inverse direction (i.e. as x increases, y decreases
and vice versa).
Figure 5.6 shows positive (direct) linear correlation 0 < r < +1 with r approaching +1.
An increase (decrease) in x results in an increase (decrease) in y (a direct
relationship).
g) No linear correlation
Fig 5.10 shows no linear correlation ( r = 0). The values of x are of no value in
estimating values of y. The data points are randomly scattered. If such a scenario
exists between sales and a particular marketing variable, a marketer should seek other
numerically scaled marketing mix, consumer behaviour, financial or economic
variables x that are more likely to have an association with the dependent variable y.
∑ 𝑥𝑦 − 𝑛𝑥𝑦
̅̅̅̅
𝑟𝑝 =
̅ 2 ) (∑ 𝑦2 − 𝑛𝑦
√(∑ 𝑥2 − 𝑛𝑥 ̅ 2)
Where
𝑠𝑥𝑦
𝑟𝑝 =
√𝑠𝑥𝑥 ×𝑠𝑦𝑦
Where:
𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 𝑛𝑥𝑦
̅̅̅
𝑆𝑦𝑦 = ∑ 𝑦 2 − 𝑛𝑦̅ 2
Pearson’s correlation coefficient formula is derived from the least squares regression
approach; hence its formula has similar terms to the regression coefficients. The
calculation of Pearson’s correlation coefficient and its interpretation is illustrated in the
following two self-assessments.
EXAMPLE 5.2
Using information given in Table 5.2 from example 5.1 for the number of Hi-fi systems
that Music Centre can expect to sell in a given week.
Solution
NowL:
𝑆𝑥𝑦 = ∑ 𝑥𝑦 − 𝑛𝑥𝑦
̅̅̅ = 1393 − 12(4)(28)=49
𝑆𝑥𝑦 49
𝑟= = = 0.8056
√𝑠𝑥𝑥 ×𝑠𝑦𝑦 √10×370
Regression and correlation analysis can be performed in Excel using the Regression
option in the Data Analysis add-in module. The output for Example 5.1 is shown in
Figure 5.11, with the regression equation coefficients (a and b), the correlation
Regression Statistics
Multiple R 0.805555
R Square 0.648919
Adjusted R Square 0.613811
Standard Error 3.604164
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 240.1 240.1 18.48345 0.001563
Residual 10 129.9 12.99
Total 11 370
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept 8.4 4.676163 1.796345 0.102662 -2.01914 18.81914 -2.01914 18.81914
X Variable 1 4.9 1.139737 4.299238 0.001563 2.360508 7.439492 2.360508 7.439492
Figure 5.11: Regression output for Example 5.1 using the Data Analysis add-in
A scatter plot between x and y can also be produced using the Chart – Scatter option
in the Insert tab. A scatter plot of the x–y data in Example 5.1 is shown in Figure 5.12.
30 R² = 0,6489
25
20
15
10
5
0
0 1 2 3 4 5 6
Number of insertions
In addition, the regression equation can be computed and superimposed on the scatter
plot, together with the coefficient of determination. To do this, right-click on any scatter
point, select Add Trendline from the drop-down option list. Then select the Linear
option, and tick the boxes Display Equation on the chart and Display R-squared on the
chart. The graph, regression equation and R2 are shown in Figure 5.12. Insert the y-
axis label ‘Number of units sold per week’ and the x-axis label ‘Number of weekly
advertisements placed’.
Conclusion
Self-Assessment Questions
QUESTION ONE
5.1 A property analyst is examining the relationship between the town council’s
valuation of residential properties in Krugersdorp and the market value (selling
price) of the properties. A random sample of 12 properties was examined. The
data is shown in Table 5.4 below:
5.2 The managers for an insurance firm are interested in finding out if the number
of new clients a broker brings into the firm affects the sales generated by the
broker. They sample 12 brokers and determine the number of new clients they
have enrolled in last year and their sales amounts in thousands of Rand. The
data are presented in the table that follows:
1 22 52
2 6 37
3 37 64
4 28 55
5 10 29
6 10 34
7 20 58
8 25 48
9 12 31
10 17 38
CHAPTER SIX
TIMESERIES ANALYSIS AND FORECASTING
Learning Outcomes
A time series is a variable that is measured and recorded at equally spaced intervals
of time. Inflation is a nice illustration. Inflation can be tracked on a monthly, quarterly,
or annual basis. Each of the three data sets is a time series. In other words, it is
irrelevant what time units we use as long as they are sequential and consistent, at
which point we will have time series data. We define consistency as the prohibition
against combining different time units (daily with monthly data or minute with hourly
data, for example). And by sequential, we mean that for this particular point in time,
zeroes or empty values are not allowed and that no data points may be skipped. If
this happens, we can attempt to estimate the missing value by finding the average of
the two numbers closest to it or by using any other suitable technique. What does
time series analysis seek to accomplish? Well, predicting a variable's future
movements is the primary goal of the bulk of time series analysis techniques. In other
words, the primary concern is predicting future values. A number of other auxiliary
approaches have been developed to determine whether the right forecasting method
has been applied. They're all examples of time series analysis. Forecasting is still the
primary goal, though.
Definition
Figure 6.1 illustrates a time series plot, and what jumps out at us immediately is that
one of the time series seems to be moving upwards and the other one is following
some horizontal line. The first is called a non-stationary time series, while the second
one, following a horizontal line, is called a stationary time series.
Definition
Non-stationary time series: Is a time series that does not have a constant mean
and oscillates around this moving mean.
Stationary time series: Is a time series that does have a constant mean and
oscillates around this mean.
In general, all-time series will fall in to the first or the second category. A variety of
methods have been invented to handle either the stationary or non-stationary time
series.
Think Point
Visualization and charting of a time series is not an optional extra, but one of the
most essential steps in time series analysis. You can learn a lot about a variable
just by looking at the time series graph.
Time series analysis assumes environmental forces, individually and collectively and
to determine the value of a time series random variable (such as sales, share price)
in any time period. These environmental influences are known as:
• Trend (T).
• Cyclical influences (C).
• Seasonal influences (S).
• Irregular or Random effects (I).
• Population growth.
• Urbanisation.
• Technological improvements.
• Economic advancements and developments.
• Consumer shifts in habits and attitudes.
Trend analysis is the statistical technique used to isolate this underlying long-
term movement.
a) Sales of ice cream will be higher in summer than in winter, and sales of
overcoats will be higher in autumn than in spring.
b) Shops might expect higher sales shortly before Christmas.
c) Sales might be higher on Friday and Saturday than on Monday.
d) The telephone network may be heavily used at certain times of the day (such
as mid morning and mid afternoon) and much less used at other times (such as
in the middle of the night).
EXAMPLE 6.1
The number of customers served by a company of travel agents over the past four
years is shown in Table 6.1 below:
Solution
A trend line in this case will show the general direction of the number of customers
served and, in this case, a general increase in the number of customers as shown in
Fig 6.2 below:
In this example, there would appear to be large seasonal fluctuations in demand, but
there is also a basic upward trend.
In weekly or monthly data, the cyclical component describes any regular fluctuations.
It is a non-seasonal component which varies in a recognizable cycle. Cyclical
variations are longer term than seasonal variations. Cyclical variations describes any
regular fluctuations with a period of more than one year.
The irregular component is what is left over when the other components of the series
(trend, seasonal and cyclical) have been accounted for.
The components of a time series can be summarized by the following equation, which
is called an additive formula since it a sum of the components
Y=T+S+C+I
Where:
Though you should be aware of the cyclical component, you will not be expected to
carry out any calculations connected with it. The mathematical model which we will
use, the additive model, therefore excludes any reference to C. The model will
therefore be as follows:
Y=T+S+ I
The main problem we are concerned with in time series analysis is how to identify
the trend and seasonal variations.
a) A line of best fit (the trend line) can be drawn on a graph. The concept is the
same as that used for finding a regression line (chapter 5).
b) A statistical technique known as linear regression by the least squares method
can be used (chapter 5).
c) A technique known as moving average can be employed to make the long-term
trends of a time series clearer by smoothening the data.
Example 6.2
The data below shows the number of sales units recorded by a manufacturing
company manager.
a) What determines the period over which a moving average should be taken?
b) Calculate the moving averages of the annual sales over a period of three years.
c) Draw a time series plot of the original data and the moving averages on the
same axis.
Solution
The moving average which is most appropriate will depend on the circumstances of
the nature of the time series.
The time series plot of the original data and the moving averages are shown in
Figure 6.2 as follows
600
500
400
300
200
100
0
2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
sales units 3 point moving average
I. The moving average series has five figures relating to the years from 2001
to 2005. The original series had seven figures for the years from 2000 to
2006.
II. There is an upward trend in sales, which is more noticeable from the series
of moving averages than from the series of actual sales each year.
The above example averaged over a three-year period and some of the points
to consider include the following:
A trend line isolates the trend (T) component only. It shows the general direction
(upward, downward, or constant) in which the series is moving. It is therefore best
represented by a straight line. The method of least squares from regression analysis
is used to find the trend line of best fit to a time series e.g., sales, while the
independent variable t is time. To use time as an independent variable in regression
analysis, it must be coded, example 1 = 2014; 2 = 2015, etc.
The trend line is estimated the same way as in Chapter 5. The only difference is that
the independent variable is now denoted by t instead of x. Therefore, the regression
equation is given by:
y = a + bt where
y = Dependent variable.
a = Regression coefficient (constant or y intercept).
b = Regression coefficient (slope or gradient of line).
y = a + bt
a = y − bt
S ty
b=
S tt
where
S ty = ty − nty
and
S tt = t 2 − nt 2
or
ty − nty
b= 2
t − nt
2
Each time period t of the time series y is assigned an integer value beginning with
1 for the first time period, 2 for the second time, 3 for the third time, etc.
Example 6.3
The number of houses sold quarterly by Valley Estates in the Cape Peninsula is
recorded for the period 1996 to 1999 as shown in Table 6.3 below. The sales
director has requested a trend analysis of this sales data to determine the general
direction (trend) of future quarterly housing sales.
a) Find the trend line equation for the quarterly house sales data from 1996 to
1999 using the sequential method for coding the time variable.
b) Draw a time series plot and fit the trend line on to the graph.
Example 6.4
Sort the sales values to be in one column as shown in Table 6.4 below. Add
columns for ti , t 2 and ty
b) A
Figure 6.4: Time series plot and the fitted trend line
For any number of time periods, the sequence is consistent with incremental
steps of +2. This zero-sum coding scheme simplifies the formulae for
computing the trend line since one of the terms in the formulae, namely ti 0
Example 6.5
Using information given in Table 6.4. example 6.4, find a trend line equation to the
house sales data of valley Estates using the zero-sum coding scheme for the time
variable x
Where x = -1 in 1997 Q4
+1 in 1998 Q1
+3 in 1998 Q2
There are two major categories of index numbers – price and quantity. In both cases,
a single or composite index may be used.
A price index measures the percentage change in price between any two periods of
time. For a single item, the relative price change from one time period to another is
found by computing its price relative:
𝑝1
𝑃𝑟𝑖𝑐𝑒 𝑟𝑎𝑙𝑎𝑡𝑖𝑣𝑒 = × 100%
𝑝0
Where:
𝑞1
𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑟𝑎𝑙𝑎𝑡𝑖𝑣𝑒 = × 100%
𝑞0
Where:
Whereas simple index numbers grant equal importance to all items regardless of what
share they hold, weighted index numbers weigh or load items according to their
relative importance. For example, when calculating the price index number if the price
of a unit of petrol is fifteen times the price of a unit of rice, then the petrol will be
weighed in as ‘15’ whereas rice will be weighed in as ‘1’. It creates a more realistic
picture of the real state of affairs than simple index numbers.
There are two types of weighted indexes namely, the fixed weight index, and the
simple weighted (aggregative) index.
This utilises weights that are based on a period/s considered representative. The
weight and base prices do not necessarily have to be drawn from the same period.
The formula is as follows:
∑ 𝑝1 𝑤
𝑃𝑟𝑖𝑐𝑒 𝐼𝑛𝑑𝑒𝑥 = × 100%
∑ 𝑝𝑜 𝑤
This places the base year for both price and quantity in the numerator. It does,
however run the risk of lack of representation in the base period. The formula is as
follows:
∑ 𝑃1 𝑄1
× 100%
∑ 𝑃0 𝑄0
A composite index combines the relative prices and quantities. A commonly used
composite index is the Laspeyres index and Paasche index.
∑ 𝑝1 𝑞0
𝐿𝑝 = × 100%
∑ 𝑝0 𝑞0
Where quantities at base period levels are held constant. The Laspeyres quantity
index is given by:
∑ 𝑝0 𝑞1
𝐿𝑞 = × 100%
∑ 𝑝0 𝑞0
Example 6.6
Using 1986 as the base year calculate the Lasperyers price and quantity indices,
and interpret you answer for the portfolio of shares provided below:
Solution
Using 1986 as base year, the Laspeyres composite indices are calculated as follows
163550
Laspeyres price index: 𝐿𝑝 = 133750 × 100% = 122.3%
157500
Laspeyres quantity index: 𝐿𝑞 = 133750 × 100% = 117.8%
Interpretation
This implies that the value of share units (price) increased, on average, by 22.3%.
The price index indicates the increase in the value of the portfolio if all quantities of
shares remain the same. Conversely, the quantity index indicates the increase in the
number of shares bought if all prices are held constant. Index numbers are generally
based on samples of items. Hence sampling errors are introduced. Furthermore,
technological changes, product quality changes and changes in consumer purchasing
patterns can individually and collectively make comparisons over time unreliable.
The Paasche index is an example of a weighted aggregate index which uses current
time period weights. It is useful when the relative importance of the items making up
the basket of goods is continuously changing due to a change in the quantity for each
year. It is more accurate than the Laspeyre’s Index as it reflects what the industry is
actually using in the current year, and therefore takes account of the price changes
and the quantity changes.
∑ 𝑝1 𝑞1
𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑃𝑟𝑖𝑐𝑒 𝐼𝑛𝑑𝑒𝑥 =
∑ 𝑝0 𝑞1
∑ 𝑝1 𝑞1
𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝐼𝑛𝑑𝑒𝑥 =
∑ 𝑝1 𝑞0
Example 6.7
Calculate Paasche’s price and quantity indices for the same share portfolio in example
6.6 and interpret the results.
Solution
∑ 𝑝1 𝑞1 230700
𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑃𝑟𝑖𝑐𝑒 𝐼𝑛𝑑𝑒𝑥 = = × 100 = 146,48%
∑ 𝑝0 𝑞1 157500
∑ 𝑝1 𝑞1 230700
𝑃𝑎𝑎𝑠𝑐ℎ𝑒 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝐼𝑛𝑑𝑒𝑥 = = × 100 = 141.06%
∑ 𝑝1 𝑞0 163550
Excel can be used in the analysis of time series data. It can produce the trendline
graph of time series data using the Insert > Chart > Line option, as illustrated in
Chapter 2 and construct and superimpose the regression trendline equation on the
line graph, as illustrated in Chapter 5.
Conclusion
This chapter focused on time series analysis as a primary tool for extrapolating and
forecasting. The components of the time series and the various methods used to
identify trends in time series were examined. The chapter also discussed how to
compute and interpret price and quantity indices used to measure trends, and track
changes between time points. Index numbers are useful measures of the change in
the activity of an item or a collection of items from one time period to another.
Self-Assessment Questions
Table 6.10: Average prices and quantities for year 2009 and year 2019
2009 2019
Item Price (R) Number sold Price(R) Number sold
(in millions) (in millions)
Ruler 17 7.5 23.7 9.6
Eraser 7.3 4.2 12.5 6.2
Pencil 5.5 12.6 8.7 1.65
Pen 15.8 8.5 22.9 2.68
6.1.1 Calculate and interpret the Laspeyres price and quantity indices for
2019 using a base year of 2009.
6.1.2 Calculate and interpret the Paasche price and quantity indices for 2019
using a base year of 2009.
6.2 You want to invest money in a company. The following table represents the
portfolio for shares. Use the Laspeyres and Paasche’s price indices to
determine how the shares fared. Conclude your calculations with a summary
of how the shares faired, and whether or not you would consider investing in
the company.
BIBLIOGRAPHY
Davis, G. W., & Pecar, B., Santana, L. (2013). Business statistics using Excel.
Oxford University Press United Kingdom. Available at: https://b-
ok.africa/book/3719720/f36822
Davis, G. W., Pecar, B., & Santana, L. (2017). Business statistics using Excel: A first
course for South African students. Oxford University Press Southern Africa.
Larson, R., & Farber, B. (2019). Elementary statistics. Pearson Education Canada.
Wegner T. (2016). Applied Business Statistics. 3rd edition. Juta and Co, Ltd.