You are on page 1of 123

1

MASTERS FOR LECTURE HANDOUTS


2

INTRODUCTION TO THIS COURSE AND TO STATISTICS

The Course
This is a one semester course for Business Studies students. It is intended to be user
friendly, with a minimum of jargon and a maximum of explanation. It is accompanied
by a simple text-book: 'Statistics for Business - A One Complete Semester Course',
which contains many more explanations, extra worked examples, practice exercises with
answers, examination examples, etc. Having completed the course you should be able to:
 understand the use of most simple statistical techniques used in the world of
business;
 understand published graphical presentation of data;
 present statistical data to others in graphical form;
 summarise and analyse statistical data and interpret the analysis for others;
 identify relationships between pairs of variables;
 make inferences about a population from a sample;
 use some basic forecasting techniques;
 use a statistical software package (optional at the discretion of the lecturer).

The main themes of this course are:


 Descriptive statistics: graphical and numerical.
 Inferential statistics: confidence intervals and hypothesis tests.
 Pairwise relationships between variables: correlation, regression and chi-squared
tests.
 Forecasting: modelling and extension of time series.

Statistics (According to the Concise Oxford Dictionary)


 Numerical facts systematically collected
 The science of collecting, classifying and using statistics

The emphasis on this course is not on the actual collection of numerical facts (data) but
on their classification, summarisation, display and analysis. These processes are carried
out in order to help us understand the data and make the best use of them.
3

Business statistics:
Any decision making process should be supported by some quantitative measures
produced by the analysis of collected data. Useful data may be on:
 Your firm's products, costs, sales or services
 Your competitors' products, costs, sales or services
 Measurement of industrial processes
 Your firm's workforce
 Etc.
Once collected, this data needs to be summarised and displayed in a manner which helps
its communication to, and understanding by, others. Only when fully understood can it
profitably become part of the decision making process.

TYPES OF DATA AND SCALES OF MEASUREMENT


Any data you use can take a variety of values or belong to various categories, either
numerical or non-numerical. The 'thing' being described by the data is therefore known as
a variable. The values or descriptions you use to measure or categorise this 'thing' are the
measurements. These are of different types, each with its own appropriate scale of
measurement which has to be considered when deciding on suitable methods of
graphical display or numerical analysis.

Variable - something whose 'value' can 'vary'.


 A car could be red, blue, green, etc
 It could be classed as small, medium or large.
 Its year of manufacture could be 1991, 1995, 1999, etc.
 It would have a particular length.
These values all describe the car, but are measured on different scales.

Categorical data
This is generally non-numerical data which is placed into exclusive categories and then
counted rather than measured. People are often categorised by sex or occupation. Cars
can be categorised by make or colour.

Nominal data
The scale of measurement for a variable is nominal if the data describing it are simple
labels or names which cannot be ordered. This is the lowest level of measurement. Even
if it is coded numerically, or takes the form of numbers, these numbers are still only
labels. For example, car registration numbers only serve to identify cars. Numbers on
athletes vests are only nominal and make no value statements.
4

All nominal data are placed in a limited number of exhaustive categories and any
analysis is carried out on the frequencies within these categories. No other arithmetic is
meaningful.

Ordinal Data
If the exhaustive categories into which the set of data is divided can be placed in a
meaningful order, without any measurement being taken on each case, then it is classed
as ordinal data. This is one level up from nominal. We know that the members of one
category are more, or less, than the members of another but we do not know by how
much. For example, the cars can be ordered as: 'small', 'medium', 'large' without the aid
of a tape measure. Degree classifications are only ordinal. Athletes results depend on
their order of finishing in a race, not by 'how much' separates their times.

Questionnaires are often used to collect opinions using the categories: 'Strongly agree',
'Agree', 'No opinion', 'Disagree' or 'Strongly disagree'. The responses may be coded as 1,
2, 3, 4 and 5 for the computer but the differences between these numbers are not claimed
to be equal.

Interval and Ratio data


In both these scales any numbers are defined by standard units of measurement, so equal
difference between numbers genuinely means equal distance between measurements. If
there is also a meaningful zero, then the fact that one number is twice as big as another
means that the measurements are also in that ratio. These data are known as ratio data.
If, on the other hand, the zero is not meaningful the data are interval only.

There are very few examples of genuine interval scales. Temperature in degrees
Centigrade provides one example with the 'zero' on this scale being arbitrary. The
difference between 30°C and 50°C is the same as the difference between 40°C and 60°C
but we cannot claim that 60°C is twice as hot as 30°C. It is therefore interval data but
not ratio data. Dates are measured on this scale as again the zero is arbitrary and not
meaningful.

Ratio data must have a meaningful zero as its lowest possible value so, for example, the
time taken for athletes to complete a race would be measured on this scale. Suppose Bill
earns £20 000, Ben earns £15 000 and Bob earns £10 000. The intervals of £5000 between
Bill and Ben and also between Ben and Bob genuinely represent equal amounts of
money. Also the ratio of Bob's earnings to Bill's earnings are genuinely in the same ratio,
1 : 2, as the numbers which represent them since the value of £0 represents 'no money'.
This data set is therefore ratio as well as interval.

The distinction between these types is theoretical rather than practical as the same
numerical and graphical methods are appropriate for both. Money is always 'at least
interval' data.

We have therefore identified three measurement scales - nominal, ordinal and interval.
measuring nominal, ordinal and interval data. Data may be analysed using methods
appropriate to lower levels. For example: interval data may be treated as ordinal but
5

useful information is lost - if we know that Bill earns £20 000 and Ben earns £15 000 we
are throwing information away by only recording that Bill earns 'more than' Ben.

Data cannot be analysed using methods which are only appropriate for higher level data
as the results will be either invalid or meaningless. For example it makes no sense to code
the sex of students as 'male' = 1, 'female' = 2 and then report 'mean value is 1.7'. It is
however quite appropriate to report that '70% of the students are female'.

Qualitative and Quantitative data


Various definitions exist for the distinction between these two types of data. Non-
numerical, nominal, data is always described as being qualitative or non-metric as the
data is being described some quality but not measured. Quantitative or metric data
which describes some measurement or quantity is always numerical and measured on the
interval or ratio scales. All definitions agree that interval or ratio data are quantitative.
Some text-books, however, use the term qualitative to refer to words only, whilst others
also include nominal or ordinal numbers. Problems of definition could arise with
numbers, such as house numbers, which identify or rank rather than measure.

Discrete and Continuous data


Quantitative data may be discrete or continuous. If the values which can be taken by a
variable change in steps the data are discrete. These discrete values are often, but not
always, whole numbers. If the variable can take any value within a range it is
continuous. The number of people shopping in a supermarket is discrete but the amount
they spend is continuous. The number of children in a family is discrete but a baby's birth
weight is continuous.

POPULATIONS AND SAMPLES


The population is the entire group of interest. This is not confined to people, as is usual
in the non-statistical sense. Examples may include such 'things' as all the houses in a local
authority area rather than the people living in them.

It is not usually possible, or not practical, to examine every member of a population, so


we use a sample, a smaller selection taken from that population, to estimate some value
or characteristic of the whole population. Care must be taken when selecting the sample
as it must be representative of the whole population under consideration otherwise it
doesn't tell us anything relevant to that particular population.

Occasionally the whole population is investigated by a census, such as is carried out


every ten years in the British Isles to produce a complete enumeration of the whole
population. The data are gathered from the whole population. A more usual method of
collecting information is by a survey in which only a sample is selected from the
population of interest and its data examined. Examples of this are the Gallup polls
produced from a sample of the electorate when attempting to forecast the result of a
general election.
6

Analysing a sample instead of the whole population has many advantages such as the
obvious saving of both time and money. It is often the only possibility as the collecting of
data may sometimes destroy the article of interest, e.g. the quality control of rolls of film.

The ideal method of sampling is random sampling. By this method every member of the
population has an equal chance of being selected and every selection is independent of all
the others. This ideal is often not achieved for a variety of reasons and many other
methods are used.

DESCRIPTIVE STATISTICS
If the data available to us cover the whole population of interest, we may describe them
or analyse them in their own right, i.e. we are only interested in the specific group from
which the measurements are taken. The facts and figures usually referred to as 'statistics'
in the media are very often a numerical summary, sometimes accompanied by a graphical
display, of this type of data, e.g. unemployment figures. Much of the data generated by a
business will be descriptive in nature, as will be the majority of sporting statistics. In the
next few weeks you will learn how to display data graphically and summarise it
numerically.

INFERENTIAL STATISTICS
Alternatively, we may have available information from only a sample of the whole
population of interest. In this case the best we can do is to analyse it to produce the
sample statistics from which we can infer various values for the parent population. This
branch of statistics is usually referred to as inferential statistics. For example we use the
proportion of faulty items in a sample taken from a production line to estimate the
corresponding proportion of all the items.

A descriptive measure from the sample is usually referred to as a sample statistic and the
corresponding measure estimated for the population as a population parameter.
The problem with using samples is that each sample taken would produce a different
sample statistic giving us a different estimate for the population parameter. They cannot
all be correct so a margin of error is generally quoted with any sample estimations. This
is particularly important when forecasting future statistics.

In this course you will learn how to draw conclusions about populations from sample
statistics and estimate future values from past data.

(There is no tutorial sheet corresponding to this week's lecture but if your basic maths is
at all questionable it can be checked by working through the 'Basic Mathematics
Revision' sheet which covers all the basic arithmetic which is the only prior learning
required for this course.)
7

Exercise 1 Basic Mathematics Revision

1 Evaluate
3 4 5 2 4 1 2 14 15 16 4
a)   b) 1  2 3 c)   d) 
7 21 3 3 5 2 7 25 24 21 7
4 3 7 2 11 7
e)   f) 4  1
9 16 12 5 12 8

2 a) Give 489 267 to 3 significant figures. b) Give 489 267 to 2 sig. figs.
c) Give 0.002615 to 2 significant figures. d) Give 0.002615 to 1 sig. figs.
e) Give 0.002615 to 5 decimal places. f) Give 0.002615 to 3 dec. places.

3 Retail outlets in a town were classified as small, medium and large and their
numbers were in the ratio 6 : 11 : 1. If there were 126 retail outlets altogether, how
many were there of each type?

4 Convert:
a) 28% to a fraction in its lowest terms. b) 28% to a decimal.
3 3
c) to a decimal. d) to a percentage.
8 8
e) 0.625 to a fraction in its lowest terms. f) 0.625 to a percentage.

5 Express the following in standard form.


1 1
a) 296000 b) 0.000296 c) 0.4590 d) 459.0 e) f)
25000 0.00025

6 Reduce the following expressions to their simplest form, expanding brackets if


appropriate.
a) 3a + b + 2a - 4b b) 2a +4ab +3a2 + ab c) a2(3a + 4b + 2a)
2
d) (x + 2)(x + 4) e) (x + 2) f) (x + 1)(x - 1)

7 Make x the subject of the formula.


a) y = 5x - 4 b) y = x2 - 7 c) y = 2(3 + 6x)
3
d) y =
x

8 Find the value of x in the formulae in number 7 when y = -3.

9 Evaluate the following when x = -2, y = 5 and z = 4


a) xy b) (xy)2 c) (xy + z)2
2
d) zy - x e) (x + z)(2y - x) f) x2 + y2 + z2

10 Solve for x:
3 1
a) 3x - 1 = 4 - 2x b) 2(x - 3) = 3(1 - 2x) c) 
x 1 2
8

Answers

19 29 1 1 1
1 a) 1 b) c) d) 1 e)
21 30 10 3 7
f) 9

2 a) 489 000 b) 490 000 c) 0.0026 d) 0.003 e) 0.002 62


f) 0.003

3 42 small, 77 medium, 7 large

7 5
4 a) b) 0.28 c) 0.375 d) 37.5% e) f) 62.5%
25 8

5 a) 2.96 x 105 b) 2.96 x 10-4 c) 4.59 x 10-1 d) 4.59 x 102


e) 4.0 x 10-5 f) 4.0 x 103

6 a) 5a - 3b b) 3a2 + 5ab +2a c) 5a3 + 4a2b d) x2 + 6x +8


e) x2 + 4x + 4 f) x2 - 1

1 y6 3
7 a) x  y  4 b) x y7 c) x  d) x  y
2 12

8 a) 0.2 b) +2 or - 2 c) -0.75 d) -1

9 a) -10 b) 100 c) 36 d) 16 e) 24 f) 45

1
10 a) x = 1 b) 1 c) 7
8
9

GRAPHICAL REPRESENTATION OF DATA


In this lecture two methods, histograms and cumulative frequency diagrams, are
described in detail for presenting frequency data. You also need to read Chapter 2 of
'Business Statistics – A One Semester Course' which describes other presentation
methods: barcharts, piecharts, boxplots, frequency polygons, dotplots, and stem-and-leaf
diagrams. Also make sure that all the diagrams in this handout are completed before next
week's lecture as they will be used again for estimating summary statistics.

A frequency distribution is simply a grouping of the data together, generally in the form
of a frequency distribution table, giving a clearer picture than the individual values.
The most usual presentation is in the form of a histogram and/or a frequency polygon.

A Histogram is a pictorial method of representing data. It appears similar to a Bar Chart


but has two fundamental differences:
 The data must be measurable on a standard scale; e.g. lengths rather than colours.
 The Area of a block, rather than its height, is drawn proportional to the Frequency, so
if one column is twice the width of another it needs to be only half the height to
represent the same frequency.

Method
 Construct a Frequency Distribution Table, grouping the data into a reasonable number
of classes, (somewhere in the order of 10). Too few classes may hide some
information about the data, too many classes may disguise its overall shape.
 Make all intervals the same width at this stage.
 Construct the frequency distribution table. Tally charts may be used if preferred, but
any method of finding the frequency for each interval is quite acceptable.
 Intervals are often left the same width but if the data is scarce at the extremes then
classes may be joined. If frequencies are very high in the middle of the data classes
may be split.
 If the intervals are not all the same width, calculate the frequency densities, i.e.
frequency per constant interval. (Often the most commonly occurring interval is used
throughout.) If some intervals are wider than others care must be taken that the areas
of the blocks and not their heights are proportional to the frequencies.
 Construct the histogram labelling each axis carefully. For a given frequency
distribution, you may have to decide on sensible limits if the first or last Class
Interval is specified as 'under ....' or 'over....', i.e. is open ended.
 Hand drawn histograms usually show the frequency or frequency density vertically.
Computer output may be horizontal as this format is more convenient for line printers.

(Your histogram will be used again next week for estimating the modal mileage.)

A frequency polygon is constructed by joining the midpoints at the top of each column
of the histogram. The final section of the polygon often joins the mid point at the top of
each extreme rectangle to a point on the x-axis half a class interval beyond the rectangle.
This makes the area enclosed by the rectangle the same as that of the histogram.
10

Example: You are working for the Transport manager of a large chain of supermarkets
which hires cars for the use of its staff. Your boss is interested in the weekly distances
covered by these cars. Mileages recorded for a sample of hired vehicles from 'Fleet 1'
during a given week yielded the following data:

138 164 150 132 144 125 149 157 161 150
146 158 140 109 136 148 152 144 145 145
168 126 138 186 163 109 154 165 135 156
146 183 105 108 135 153 140 135 142 128
Minimum = 105 Maximum = 186 Range = 186 - 105 = 81

Nine intervals of 10 miles width seems reasonable, but extreme intervals may be wider.
Next calculate the frequencies within each interval. (Not frequency density yet.)

Frequency distribution table


Class interval Tally Frequency Frequency density
100 and < 110 |||| 4 2.0
110 and < 120 0
120 and < 130 ||| 3 3.0
130 and < 140 |||| || 7 7.0
140 and < 150 |||| |||| | 11 11.0
150 and < 160 |||| ||| 8 8.0
160 and < 170 |||| 5 5.0
170 and < 180 0 1.0
180 and < 190 || 2
Total 40

The data is scarce at both extremes so join the extreme two classes together. This doubles
the interval width so the frequency needs to be halved to produce the frequency per 10
miles, i.e. the frequency density. Calculate all the frequency densities. Next draw the
histogram and add the frequency polygon.

Freq. per 10 mile interval Histogram and frequency polygon

12

10

0
100 120 140 160 180 200
Mileages
11

Stem and leaf plots


A stem and leaf plot displays the data in the same 'shape' as the histogram, though it tends
to be shown horizontally. The main difference is that it retains all the original information
as the numbers themselves are included in the diagram so that no information is 'lost'.
The 'stem' indicates, in the same example as above, the first two digits, (hundreds and
tens), and the 'leaf' the last one, (the units). The lowest value below is 105 miles.
Sometimes the frequency on each stem is included.

Example (Same data as for the previous histogram)

Frequency Stem Leaf


4 10 | 5789
0 11 |
3 12 | 568
7 13 | 2555688
11 14 | 00244556689
8 15 | 00234678
5 16 | 13458
0 17 |
2 18 | 36

Cumulative frequency diagrams (Cumulative frequency polygons, Ogives)


A Cumulative Frequency Diagram is a graphical method of representing the accumulated
frequencies up to and including a particular value - a running total. These cumulative
frequencies are often calculated as percentages of the total frequency. This method, as we
shall see next week, is used for estimating median and quartile values and hence the
interquartile or semi-interquartile ranges of the data. It can also be used to estimate
the percentage of the data above or below a certain value.

Method
 Construct a frequency table as before.
 Use it to construct a cumulative frequency table noting that the end of the interval
is the relevant plotting value.
 Calculate a column of cumulative percentages.
 Plot the cumulative percentage against the end of the interval and join up the points
with straight lines. N.B. Frequencies are always plotted on the vertical axis.

Using the data in the section on Histograms, we work through the above method for
drawing the cumulative frequency diagram. Next week you will use it to estimate the
median mileage, the interquartile range and the semi-interquartile range of the data.
12

Cumulative frequency table

Cumulative % cumulative
Interval less than Frequency
frequency frequency
100 0 0 0.0
120 4 4 10.0
130 3 7 17.5
140 7 14 35.0
150 11 25 62.5
160 8 33 82.5
170 5 38 95.0
190 2 40 100.0

% C.F. Cumulative frequency diagram


100

80

60

40

20

0
100 120 140 160 180 200
Mileages

We can estimate from this diagram that the percentage of vehicles travelling, say, less
than 125 miles is 14%.

Next week we shall use the same histogram and cumulative frequency diagram to:
 Estimate the mode of the frequency data.
 Estimate the Median value by dotting in from 50% on the Cumulative Percentage
axis as far as your ogive and then down to the values on the horizontal axis. The
value indicated on the horizontal axis is the estimated Median.
 Estimate the Interquartile Range by dotting in from 25% and down to the horizontal
axis to give the Lower Quartile value. Repeat from 75% for the Upper Quartile
value.
The Interquartile Range is the range between these two Quartile values.
The Semi-interquartile Range is half of the Interquartile range.
13

 The same technique can be employed to estimate a particular value from the
percentage of the data which lies above or below it.
Completed diagrams from lecture handout (Not for the student handouts)
Example You are working for the Transport manager of a large chain of supermarkets
which hires cars for the use of its staff. He is interested in the weekly distances covered
by these cars. Mileages recorded for a sample of hired vehicles from Fleet 1during a
given week yielded the following data:
138 164 150 132 144 125 149 157 161 150
146 158 140 109 136 148 152 144 145 145
168 126 138 186 163 109 154 165 135 156
146 183 105 108 135 153 140 135 142 128
Minimum = 105 Maximum = 186 Range = 186 - 105 = 81
Nine intervals of 10 miles width reasonable, but extreme intervals may be wider.
Next calculate the frequencies only within each interval.

Frequency distribution table


Class interval Tally Frequency Frequency density
100 and < 110 |||| 4 2.0
110 and < 120 0
120 and < 130 ||| 3 3.0
130 and < 140 |||| || 7 7.0
140 and < 150 |||| |||| | 11 11.0
150 and < 160 |||| ||| 8 8.0
160 and < 170 |||| 5 5.0
170 and < 180 0 1.0
180 and < 190 || 2
Total 40
The data is scarce at both extremes so join the extreme two classes together. This doubles
the interval width so the frequency needs to be halved to produce the frequency per 10
miles, i.e. the frequency density. Next draw the histogram and add the frequency
polygon.

Freq. per 10 mile interval Histogram and frequency polygon

12

10

0
14

100 120 140 160 180 200


Mileages
15

Cumulative frequency table

Cumulative % cumulative
Interval less than Frequency
frequency frequency
100 0 0 0.0
120 4 4 10.0
130 3 7 17.5
140 7 14 35.0
150 11 25 62.5
160 8 33 82.5
170 5 38 95.0
190 2 40 100.0

Cumulative frequency diagram


% C.F.
100

80

60

40

20

0
100 120 140 160 180 200
Mileages

We can estimate from this diagram that the percentage of vehicles travelling, say, less
than 125 miles is 14%.
16

NUMERICAL SUMMARY OF DATA

In this lecture we shall look at summarising data by numbers, rather then by graphs as last
week. We shall use the diagrams drawn in last week's lecture again this week for making
estimations and you will use the diagrams you drew in Tutorial 2 for estimating summary
statistics in Tutorial 3. You will first learn the meanings of the various summary statistics
and then learn how to use your calculator in standard deviation mode to save time and
effort in calculating mean and standard deviation. Further summary statistics are
discussed in Business Statistics, Chapter 3, Section 3.6.

As we saw last week, a large display of data is not easily understood but if we make some
summary statement such as 'the average examination mark was 65' and the 'marks ranged
from 40 to 75' we get a clearer understanding of the situation. A set of data is often
described, or summarised by two parameters: a measure of centrality, or location, and
a measure of spread, or dispersion.

MEASURES OF CENTRALITY
Mode: This is the most commonly occurring value.
The data may be nominal, ordinal, interval or ratio. (Section 1.4 of Business Statistics)

Median: The middle value when all the data are placed in order.
The data themselves must be ordinal or interval.
For an even number of values take the average of the middle two.

Mean: The Arithmetic Average.


The data must be measurable on an interval scale.
This is calculated by dividing the 'sum of the values' by the 'number of the values'.
x   fx 
x or x
n n
x represents the value of the data
f is the frequency of that particular value
x is the shorthand way of writing 'mean'
 (sigma) is the shorthand way of writing 'the sum of'.

The mean may be thought of as the 'typical value' for a set of data and, as such, is the
logical value for use when representing it.

The choice of measure of centrality depends, therefore, on the scale of measurement of


the data. For example it makes no sense to find the mean of ordinal or nominal data. All
measures are suitable for interval or ratio data so we shall consider numbers in order that
all the measures can be illustrated.

Data values may also be


 discrete, i.e. not continuous. - people.
 continuous - heights.
 single - each value stated separately.
 grouped - frequencies stated for the number of cases in each value group,
or interval for continuous data
17

You still work for the Transport Manager, as last week, investigating the usage of cars.

Example 1 Non-grouped discrete data


Nine members of staff were selected at random from those using hire cars. The number of
days he/she left the car unused in the garage in the previous week was:
2 6 2 4 1 4 3 1 1
Rearranged in numerical order: 1 1 1 2 2 3 4 4 6
 x 24
Mode = 1; Median = 2; Mean = x   2.67
n 9

Example 2 Grouped discrete data. Days idle for a whole staff of users of hired cars:

No. of Days unused (x) No. of Staff (f) fx


0 5 0
1 24 24
2 30 60
3 19 57
4 10 40
5 5 25
6 2 12
Total 95 218

Mode = Most common number of days = 2 days


Median = Middle, 48th, number of days = 2 days
Total number of days 218
Mean =   2.295  2.3 days
Total number of staff 95

Example 3 Grouped continuous data

40 employees stated their fuel costs during the previous week to be:

Value (£) Mid-value (x) No. of Employees(f) fx


59 and < 60 59.5 2 119.0
60 and < 61 60.5 5 302.5
61 and < 62 61.5 4 246.0
62 and < 63 62.5 6 375.0
63 and < 64 63.5 5 317.5
64 and < 65 64.5 7 451.5
65 and < 66 65.5 3 196.5
66 and < 67 66.5 2 133.0
Total 34 2141.0
Modal Class = £64 to £65
Medial Class = Class including 17/l8th user = £62 to £64
Total value of invoices 2141
Mean = Total number of employees  34
 £62.97
18

MEASURES OF SPREAD (THE DATA MUST BE AT LEAST ORDINAL)


Range: Largest value - smallest value.

Interquartile Range: The range of the middle half of the ordered data.

Semi interquartile Range: Half the value of the interquartile range.

Population Standard Deviation: (xn) This is also known as 'Root Mean Square
Deviation' and is calculated by squaring and adding the deviations from the mean, finding
the average of the squared deviations, and then square-rooting the result, or by using
your calculator. As the name suggests we have the whole population of interest
available for analysis.

  x  x  f  x  x
2 2
s  or s  for frequency data
n n
x represents the value of the data
f is the frequency of that particular value
x is the shorthand way of writing 'mean'
s is the shorthand way of writing 'standard deviation'
 is the shorthand way of writing 'the sum of'.
means 'take the positive square root of' as the negative root has no meaning.

The Standard Deviation is a measure of how closely the data are grouped about the mean.
The larger the value the wider the spread. It is a measure of just how precise the mean is
as the value to represent the whole data set.

An equivalent formula which is often used is:

2 2 2 2
x x  fx   fx  for frequency data.
s    or s   
n  n  n  n 

n-1
Sample standard deviation (x )
Usually we are interested in the whole population but only have a sample taken from it
available for analysis. It has been shown that the formula above gives a biased value
-always too small. If denominator used is (n - 1) instead of (n), the estimate is found to be
more accurate.
  x  x  f  x  x
2 2
s  or s  for frequency data
n 1 n 1

As we shall see shortly on the calculator, the keys for the two standard deviations are
described as xn and xn-1 respectively

The examples below are the same as used previously for the measures of centrality
because these two parameters are generally calculated together.
19

Example 1 Discrete Data:


The number of idle days for nine cars: 2 6 2 4 1 4 3 1 1
Rearranged in numerical order: 1 1 1 2 2 3 4 4 6

Range: = 6 - 1 = 5 days
Interquartile Range: Lower quartile, one quarter of (9 + 1)= 2.5th = 1 day
Upper quartile, three quarters of (9 + 1) = 4 days
Interquartile range 4 - 1 = 3 days

Semi interquartile Range: half interquartile range = 1.5 days

(A much larger sample should really be used for this type of analysis.)

Standard Deviation:
(See your calculator booklet for the method as calculators vary considerably. See next
page if you have a Casio. If still uncertain, ask in tutorials)
For these nine numbers only = 1.633 days ( x n shift 2 for Casios)
As estimator for the whole population = 1.732 days ( x n 1 shift 3 for Casios)

Example 2. Grouped Discrete Data: The numbers of idle days for 95 cars are:

No. of Days Idle (x) No. of Staff (f) Cumulative frequency


0 5 5
1 24 29
2 30 59
3 19 78
4 10 88
5 5 93
6 2 95
Total 95
Range: = 6 - 0 = 6
95  1
Interquartile Range: Lower quartile = = 24th = 1 day
4
3(95  1)
Upper quartile = = 72nd = 3 days
4
Interquartile Range = 3 - 1 = 2 days

Semi interquartile range: 2/2 = 1 day

Standard Deviation: (from calculator)


For this sample only = 1.345 days (x n shift 2)
As estimator for population = 1.352 days ( x n 1 shift 3)
20

Example 3 Grouped Continuous Data The invoices for 34 cars:

Value (£) Mid-value (x) No. of Employees fx


59 and < 60 59.5 2 119.0
60 and < 61 60.5 5 302.5
61 and < 62 61.5 4 246.0
62 and < 63 62.5 6 375.0
63 and < 64 63.5 5 317.5
64 and < 65 64.5 7 451.5
65 and < 66 65.5 3 196.5
66 and < 67 66.5 2 133.0
Total 34 2141.0

Range: = 66.5 - 59.5 = £7.0 (estimate)

Interquartile Range: Usually calculated from the quartiles estimated (see below) from a
cumulative frequency diagram - an ogive.

Semi interquartile range: estimated from the interquartile range from the ogive.

Standard Deviation: (from calculator)


For sample only = £1.929 ( x n shift 2)
As estimator for population = £1.958 ( x n 1 shift 3)
N.B. The sample standard deviation, x n 1 , is the standard deviation produced by
default from both Minitab and SPSS analysis.

Use of a Casio calculator in S.D. Mode for finding mean and standard deviation

Input data (x is the number being keyed in.)


1. Clear all memories SHIFT AC
2. Get into Standard Deviation Mode MODE 2
3. Input data
a) Single numbers x DT x DT etc.
b) Grouped data x; f DT x ; f DT etc.
(where x is the value and f the frequency.)
Output results
4. Check sample size (n) RCL Red C (hyp)
5. Output mean ( x ) SHIFT 1
(6. Output population standard deviation  x n  SHIFT 2)
7. Output sample standard deviation ( x n 1 ) SHIFT 3
21

Estimation of the Mode from a Histogram

Mode (Last week's Mileages of Cars example) We know that the modal mileage is in
the interval between 140 and 150 miles but need to estimate it more accurately.
Draw 'crossed diagonals' on the protruding part of the highest column. These two
diagonals cross at an improved estimate of the modal mileage. By this method we take
into consideration also the frequencies of the columns adjacent to the modal column.

Histogram of Car mileage as drawn last week


Freq. per 10 mile interval

12

10

0
100 110 120 130 140 150 160 170 180 190
Mileages

Estimated Mode:
22

Estimation the Median and Interquartile range from a Cumulative frequency


diagram

Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative
Percentage axis to your ogive and then down to the values on the horizontal axis. The
value indicated is the estimated Median for the Mileages.

Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to
your ogive and then down to the values on the horizontal axis. The values indicated are
the estimated Upper quartile and Lower quartile respectively of the Mileages.

The difference between these measures is the Interquartile range. Half the Inter-quartile
range is the Semi-interquartile range.

Cumulative frequency diagram


% C.F.
100

80

60

40

20

0
100 110 120 130 140 150 160 170 180 190 200
Mileages

Estimated Median:

Estimated Quartiles:

Interquartile range:

Semi-Interquartile range:

% of users who travelled more than 170 miles


23

COMPUTER OUTPUT: NUMERICAL SUMMARY OF EXAMPLE 3

Minitab output

Example 3
Descriptive Statistics
Variable N Mean Median Tr Mean StDev SE Mean
Costs 34 62.974 63.000 62.970 1.955 0.335

Variable Min Max Q1 Q3


Costs 59.500 66.500 61.500 64.500

SPSS output
Example 3
Descriptives

Statistic Std. Error


INVOICES Mean 62.971 .336
95% Confidence Lower Bound 62.288
Interval for Mean Upper Bound 63.654
5% Trimmed Mean 62.967
Median 63.000
Variance 3.832
Std. Deviation 1.958
Minimum 59.5
Maximum 66.5
Range 7.0
Interquartile Range 3.000
Skewness -.043 .403
Kurtosis -.899 .788
24

Completed diagrams from lecture handout (Not for student handouts)

Estimation of the Mode from a Histogram

Mode (Last week's Mileages of Cars example) We know that the modal mileage is in
the interval between 140 and 150 miles but need to estimate it more accurately.
Draw 'crossed diagonals' on the protruding part of the highest column. These two
diagonals cross at an improved estimate of the modal mileage.

Histogram of Car mileage as drawn last week


Freq. per 10 mile interval

12

10

0
100 110 120 130 140 150 160 170 180 190
Mileages

Estimated Mode: 146 miles


25

Estimation the Median and Interquartile range from a Cumulative frequency


diagram

Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative
Percentage axis to your ogive and then down to the values on the horizontal axis. The
value indicated is the estimated Median for the Mileages.

Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to
your ogive and then down to the values on the horizontal axis. The values indicated are
the estimated Upper quartile and Lower quartile respectively of the Mileages.

The difference between them measures is the Interquartile range. Half the Inter-quartile
range is the Semi-interquartile range.

Cumulative frequency diagram


% C.F.
100

80

60

40

20

0
100 110 120 130 140 150 160 170 180 190 200
Mileages

Estimated Median: 146 miles

Estimated Quartiles: 134 and 156 miles

Interquartile range: 156 - 134 = 22 miles

Semi-Interquartile range: 22/2 = 11 miles .

% of users who travelled more than, for e.g. > 170 miles = 5%
26

PROBABILITY

In this lecture we will consider the basic concept of probability which is of such
fundamental importance in statistics. We will calculate probabilities from symmetry and
relative frequency for both single and combined events and use expected values in order
to make logical decisions. We will only use contingency tables for combined probabilities
in this lecture so make sure you read Chapter 4 of Business Statistics for descriptions of
tree diagrams. Make sure you complete your tutorial, if not finished in class time.

The concept of probability was initially associated with gambling. Its value was later
realised in field of business, initially in calculations for shipping insurance.

Other areas of business:


 Quality control: Sampling schemes based on probability, e.g. acceptance sampling.
 Sample design for sample surveys.
 In decision making, e.g. the use of decision trees.
 Stock control, e.g. probable demand for stock before delivery of next order.

Probability can be calculated from various sources:


 simple probability calculated from symmetry;
 simple probability calculated from frequencies;
 conditional probability calculated from contingency tables;
 conditional probability calculated from tree diagrams;
 etc.

Probability is measure of uncertainty


Probability measures the extent to which an event is likely to occur and can be calculated
from the ratio of favourable outcomes to the whole number of all possible outcomes.
(An outcome is the result of an trial, i.e. the result of rolling a die might be a six.
An event may be a set of outcomes, 'an even number', or a single outcome, 'a six'.)

There are two main ways of assessing the probability of a single outcome occurring:
 from the symmetry of a situation to give an intuitive expectation,
 from past experience using relative frequencies to calculate present or future
probability.

Value of Probability
Probability always has a value between 0, impossibility, and 1, certainty.

Probability of 0.0 corresponds to ‘impossible’


0.1 corresponds to ‘extremely unlikely’
0.5 corresponds to ‘evens chance’
0.8 corresponds to ‘very likely’
27

1.0 corresponds to ‘certainty’

Probability from symmetry


We can define the probability of an event A taking place as:
Number of equally likely outcomes in which A occurs
P event A 
Total number of equally likely outcomes

Invoking symmetry there is no need to actually do any experiments. This is the basis for
the theory of probability which was developed for use in the gaming situation. – 'a priori'
probability.

Examples

1. Tossing 1 coin:
Possibilities: Head, Tail. P(a head) = 1/2 = 0.5 P(a tail) = 1/2 = 0.5

2. Tossing 2 coins:
Possibilities: Head Head, Head Tail, Tail Head, Tail Tail.
P(2 heads) = P(2 tails) = P(1 head and 1 tail) =
Note that the sum of all the possible probabilities is always one.
This makes sense as one of them must occur and P(certainty) = 1.

3. Rolling a single die


Sample space: 1, 2, 3, 4, 5, 6.

P(6) = P(anything but 6) =


P(an even number) = P(an odd number) =
Again each pair of probabilities sums to 1.

4. Rolling a pair of dice


(all possibilities shown below - known as total sample space)

1,1 1,2 1,3 1,4 1,5 1,6

2,1 2,2 2,3 2,4 2,5 2,6

3,1 3,2 3,3 3,4 3,5 3,6

4,1 4,2 4,3 4,4 4,5 4,6

5,1 5,2 5,3 5,4 5,5 5,6

6,1 6,2 6,3 6,4 6,5 6,6

P(double six) = P(any double) = P(less than 4) =


P(more than 8) = P(at least one six) = P(an even total) =
28

Probability from Frequency


An alternative way to look at Probability would be to carry out an experiment to
determine the proportion of favourable outcomes in the long run.
P(Event E) = Number of times E occurs
Number of possible occurrences

This is the frequency definition of Probability.

The relative frequency of an event is simply the proportion of the possible times it has
occurred.

As the number of trials increases the relative frequency demonstrates a long term
tendency to settle down to a constant value. This constant value is the probability of the
event.

But how long is this 'long term'?

Table of Relative Frequencies from one experiment of flipping a coin:

Total number Number Proportion


of throws Result of heads of heads
1 H 1 1.00
2 H 2 1.00
3 T 2 0.67
4 T 2 0.50
5 H 3 0.60
6 T 3 0.50
7 T (if H) 3 (4) 0.43 (0.57)
8 etc. etc. etc.

In this experiment the ‘long term’ would be the time necessary for the relative frequency
to settle down to two decimal places.

Proportion of heads
1.1

1.0

.9
Value Proportion of Heads

.8

.7

.6

.5

.4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

THROW

NOTE: The outcome of an individual trial is either 1 or 0 heads.


It is the long-term tendency which gives the estimated probability.
Also: n must be a large number - the larger n is, the better the estimate of the probability.
29

Contingency tables

Another alternative is that probabilities can be estimated from contingency tables.


Relative frequency is used as before, this time from past experience.

Contingency tables display all possible combined outcomes and the frequency with
which each of them has happened in the past. These frequencies are used to calculate
future probabilities.

Example 5 A new supermarket is to be built in Bradfield. In order to estimate the


requirements of the local community, a survey was carried out with a similar community
in a neighbouring area. Part of the results of this are summarised in the table below:

Expenditure on drink
Mode of travel None 1p and under £20 At least £20 Total
On foot 40 20 10
By bus 30 35 15
By car 25 33 42
Total

Suppose we put all the till invoices into a drum and thoroughly mix them up. If we close
our eyes and take out one invoice, we have selected one customer at random.

We first complete all the row and column totals.

From the 'sub-totals' we can now calculate, for example:


P(customer spends at least £20) = 67/250 = 0.268
P(customer will travel by car) =

From the ‘cells’ we can calculate, for example:


P(customer spends over £20 and travels by car) = 42/250 = 0.168
P(customer arrives on foot and spends no money on drink) =
P(customer spends over £20 or travels by car) = (40 + 20 + 10 + 30 + 25)/250 = 0.50
P(customer arrives on foot or spends no money on drink) =

These last two values are probably more easily calculated by adding the row and column
totals and then subtracting the intersecting cell which has been included twice:
P(customer spends over £20 or travels by car) = (70 + 95 - 40)/250 = 0.50
P(customer arrives on foot or spends no money on drink) =
Sometimes we need to select more than one row or column:
P(a customer will spend less than £20) = (95 + 88)/250 = 0.732
30

P(a customer will not travel by car) =


P(a customer will not travel by car and will spend less than £20) =
P(a customer will not travel by car or will spend less than £20) =

In the examples above all the customers have been under consideration without any
condition being applied which might exclude any of them from the overall ratio. Often
not all are included as some condition applies.

CONDITIONAL PROBABILITY
If the probability of the outcome of a second event depends upon the outcome of a
previous event then the second is conditional on the result of the first. This does not
imply a time sequence but simply that we are asked to find the probability of an event
given additional information - an extra condition, i.e. if we know that the customer
travelled by car all the other customers who did not are excluded from the calculations.

If we need P(a customer spends at least £20, if it is known that he/she travelled by car)
We eliminate from the choice all those who did not arrive in a car, i.e. we are only
interested in the third row of the table. A short hand method of writing 'if it is know
that . .' or 'given that . . .' is '|'.

P(a customer spends at least £20, | he/she travelled by car) = 42/100 = 0.420

P(a customer came by car | he spent at least £20 on drink) = 42/67 = 0.627

Note that
P(spending  £20, | he/she travelled by car)  P(coming by car | he spent  £20 on drink)

P(a customer spends at least £20, | he/she travelled by bus) =

P(a customer spends less than £20, | he/she did not travel by car) =

P(a customer came on foot, | he/she spent at least £20 on drink) =

Many variations on the theme of probability have been covered in this lecture but they all
boil down to answering the question 'how many out of how many?'
31

EXPECTED VALUES
A commonly used method in decision making problems is the consideration of expected
values. The expected value for each decision is calculated and the option with the
maximum or minimum value (dependent upon the situation) is selected.

The expected value of each decision is defined by:


E(x) = px
where x is the value associated with each out outcome, E(x) is the expected value of the
event x and p is the probability of it happening. If you bought 5 out of 1000 raffle tickets
for a prize of £25 the expected value of your winnings would be 0.005 x £25 = £0.125.
If there was also a £10 prize, you would also expect to win 0.005 x £10 = £0.05 giving
£0.175 in total. (Strictly speaking there are only 999 tickets left but the difference is
negligible.)

Clearly this is only a theoretical value as in reality you would win either £25, £10 or £0.

Example 6: Two independent operations A and B are started simultaneously. The times
for the operations are uncertain, with the probabilities given below:
Operation A Operation B
Duration (days) (x) Probability (p) Duration (days) (x) Probability (p)
1 0.0 1 0.1
2 0.5 2 0.2
3 0.3 3 0.5
4 0.2 4 0.2

Determine whether A or B has the shorter expected completion time.

Operation A
E(x) = px = (1 x 0.0) + (2 x 0.5) + (3 x 0.3) + (4 x 0.2) = 2.7 days

Operation B
E(x) = px = (1 x 0.1) + (2 x 0.2) + (3 x 0.5) + (4 x 0.2) = 2.8 days

Hence Operation A has the shorter completion time.

Example 7 A marketing manager is considering whether it would be more profitable to


distribute his company's product on a national or on a more regional basis. Given the
following data, what decision should be made?

National distribution Regional distribution


Level of Net profit Prob. that Level of Net Prob. that
Demand £m (x) demand is px Demand profit demand is px
met (p) £m (x) met (p)
High 4.0 0.50 High 2.5 0.50
Medium 2.0 0.25 Medium 2.0 0.25
Low 0.5 0.25 Low 1.2 0.25

Expected profits = px


32

NORMAL PROBABILITY DISTRIBUTIONS

The normal probability distribution is one of many which can be used to describe a set
of data. The choice of an appropriate distribution depends upon the 'shape' of the data and
whether it is discrete or continuous.
You should by now be able to recognise whether the data is discrete or continuous, and a
brief look at a histogram should give you some idea about its shape. In Chapter 5 of
Business Statistics the four commonly occurring distributions below are studied. These
cover most of the different types of data.
This lecture will concentrate on just the normal distribution, but you should be aware of
the others.

Observed Data

Type Discrete Continuous

Shape Symmetric Skewed Symmetric Skewed

Distribution Binomial Poisson Normal Exponential

The Normal distribution is found to be a suitable model for many naturally occurring
variables which tend to be symmetrically distributed about a central modal value - the
mean.

If the variable described is continuous, its probability function is described as a


probability density function which always has a smooth curve under which the area
enclosed is unity.

The Characteristics of any Normal Distribution


The normal distribution approximately fits the actual observed frequency distributions of
many naturally occurring phenomena e.g. human characteristics such as height, weight,
IQ etc. and also the output from many processes, e.g. weights, volumes, etc.

There is no single normal curve, but a family of curves, each one defined by its mean, µ,
and standard deviation, ; µ and  are called the parameters of the distribution.
33

As we can see the curves may have different centres and/or different spreads but they all
have certain characteristics in common:
 The curve is bell-shaped,
 it is symmetrical about the mean (µ),
 the mean, mode and median coincide.

The Area beneath the Normal Distribution Curve


No matter what the values of µ and  are for a normal probability distribution, the total
area under the curve is equal to one. We can therefore consider partial areas under the
curve as representing probabilities. The partial areas between a stated number of standard
deviation below and above the mean is always the same, as illustrated below. The exact
figures for the whole distribution are to be found in standard normal tables.

Note that the curve neither finishes nor meets the horizontal axis at   3, it only
approaches it and actually goes on indefinitely.

Even though all normal distributions have much in common, they will have different
numbers on the horizontal axis depending on the values of the mean and standard
deviation and the units of measurement involved. We cannot therefore make use of the
standard normal tables at this stage.

The Standardised Normal Distribution


No matter what units are used to measure the original data, the first step in any
calculation is to transform the data so that it becomes a standardised normal variate
following the standard distribution which has a mean of zero and a standard deviation
of one. The effect of this transformation is to state how many standard deviations a
particular given value is away from the mean. This standardised normal variate is without
units as they cancel out in the calculation.

The standardised normal distribution


is symmetrically distributed about zero;
uses 'z-values' to describe the number of
standard deviations any value is away
from the mean; and approaches the
x-axis when the value of z exceeds 3.
34

The formula for calculating the exact number of standard deviations (z) away from the
x
mean () is: z

The process of calculating z is known as standardising, producing the standardised
value which is usually denoted by the letter z, though some tables may differ.

Knowing the value of z enables us to find, from the Normal tables, the area under the
curve between the given value (x) and the mean () therefore providing the probability of
a value being found between these two values. This probability is denoted by the letter Q,
which stands for quantile. (This is easier to understand with numbers!)

For standardised Normal distributions the Normal tables are used. (See back page)

Finding Probabilities under a Normal Curve


The steps in the procedure are:
 Draw a sketch of the situation,
 standardise the value of interest, x, to give z,
 use the standard tables to find the area under the curve associated with z,
 if necessary, combine the area found with another to give the required area,
 convert this to: a probability, using the area as found since total area = 1;
or a percentage, multiplying the area by 100 since total area = 100%;
or a frequency, multiplying the area by the total frequency;
as required by the question.

Example 1 (Most possible variations of questions are included here!)


In order to estimate likely expenditure by customers at a new supermarket, a sample of
till slips from a similar supermarket describing the weekly amounts spent by 500
randomly selected customers was analysed. These data were found to be approximately
normally distributed with a mean of £50 and a standard deviation of £15. Using this
knowledge we can find the following information for shoppers at the new supermarket.

The probability that any shopper selected at random:


a) spends more than £80 per week,
b) spends less than £50 per week.
The percentage of shoppers who are expected to:
c) spend between £30 and £80 per week,
d) spend between £55 and £70 per week.
The expected number of shoppers who will:
35

e) spend less than £70 per week,


f) spend between £37.50 and £57.50 per week.
a) The probability that a shopper selected at random spends more than £80 per week
We need P(x > £80), the probability that a customer spends over £80

 = £50,  = £15

x = £80  = £50, 
= £15,

x = £80.
5 20 35 50 65 80 95
Q z
5 20 35 50 65 80 95
Q1 z1Q2 z2
x 80  50 30
First standardise: z     2.00 z2
 15 15

From tables: z  2.00  Q  0.4722 ( z n the margin, Q in the body of the tables)

Therefore: P(x > £80) = 0.5 - 0.4772 = 0.0228

b) The probability that a shopper selected at random spends more than £50 per week

No need to do any calculations for this question. The mean is £50 and, because the
distribution is normal, so is the median. Half the shoppers, 250, are therefore expected
to spend more than £50 per week.

Using the frequency definition of probability: 250/500 = 0.5

c) The percentage of shoppers who are expected to spend between £30 and £80 per
week

We first need P(£30 < x < £80), then we convert the result to a percentage.

 = £50,  = £15,

x1 = £30, x2 = £80

5 20 35 50 65 80 95 £
z1 Q1 Q2 z2
36

Our normal tables provide the area between a particular value and the mean so the area
between £30 and £80 needs splitting by the mean so that the partial areas, £30 to £50 and
£50 to £80 can be calculated separately and then the two parts recombined.

P(£50 < x < £80)

From (a) x = 80  z2 = 2.00  Q2 = 0.4772

x 30  50  20
P(£30 < x < £50) z1      1.333
 15 15

(The table values are all positive, so when z is negative we invoke the symmetry of the
situation and use its absolute value in the table.)

From tables: z1   1.333  Q1  0.4088 (by interpolation)


(Q1 lies between 0.4082 and 0.4099 and is a third of the distance between them, 0.0017,
being nearer to 0.4082 so 0.4082 + 0.0006 = 0.4088)

Therefore: P(£30 < x < £80) = Q1 + Q2 = 0.4088 + 0.4772 = 0.8860

The whole area is equivalent to100% so 0.8860 of it = 88.6%

d) Percentage of shoppers expected to spend between £55 and £70

We first need P(£55 < x < £70), then we convert the result to a percentage.

 = £50,  = £15, P(£55 P(£55


< x < £70)
x1 = £30, x2 = £70
 = £50,  =
£15,
5 20 35 50 65 80 5 95 20 35 50 65 80 95
x1 = £55, x2 Q1 z1 Q2 z2
= £70 z2
We now need to find the area between the mean and £70 and then subtract from it
the area between the mean and £55.
Fig. 5.12
x 
P(£50 < x < £55) z1  

From tables Q1 =
x 
P(£50 < x < £70 ) z2  

From tables Q2 =

Therefore: P(£55 < x < £70) = Q2 - Q1 =


37

% of shoppers expected to spend between £55 and £70 =

e) The expected number of shoppers who will spend less than £70 per week

 = £50,  = £15,

x = £70.

The area we need is that between the mean and £70 plus the 0.5 which falls below the
mean.

P(£70 < x < £50)

First standardise: z 

From tables: Q =

Therefore: P(x < £70) =

The expected number of shoppers spending less than £70 is

f) The expected number of shoppers who will spend between £37.50 and £57.50 per
week

We first need P(£37.50 < x < £57.50). This is then multiplies by the total frequency.

 = £50,  = £15,

x1 = £37.50, x2 = £57.50

P(£37.5 < x < £50)


5 20 35 50 65 80 95 £
38

Finding Values from Given Proportions


In this example in parts (a) to (f), we were given the value of x and had to find the area
under the normal curve associated with it. Another type of problem gives the area, even if
indirectly, and asks for the associated value of x, in this case the value of the shopping
basket. Carrying on with the same example we shall find the following information:

The value below which:


g) 70% of the customers are expected to spend,
h) 45% of the customers are expected to spend.

The value expected to be exceeded by:


i) 10% of the customers,
j) 80% of the customers.

The value below which:


k) 350 of the shoppers are expected to spend,
1) 100 of the shoppers are expected to spend.

g) The value below which 70% of the shoppers are expected to spend

 = £50,  = £15,

x = £?

5 20 35 50 65 80 95
0.5 Q = 0.2 x = ?

We first need to find the value of Q.


70% of the total area, 1.00, is below x  20% is between  and x  Q = 0.2000

From tables, used the other way round:


If Q = 0.2000 ( in the body of the table)  z = 0.524 (in the margin of the table, by
interpolation as z lies between 0.52 and 0.53 and is a little nearer to 0.52)

We now know the value of z and need to find x.


Using standardising formula:
x  x  50
z   0.524 
 15
15  0.524  x  50
7.86  x  50
£57.86  x
The value below which 70% of customers spend is £57.86
39

If your algebra is a bit 'iffy' try using the fact that this value of z, 0.524, tells us that x is
0.524 standard deviations above the mean.

Its value is therefore  + 0.524 = 50 + 0.524 x 15 = £57.86

h) The value below which 45% of the shoppers are expected to spend

 = £50,  = £15,

x = £?

5 20 35 50 65 80 95 £
0.45 0.5 x = ? Q = 0.05

45% below x  5% between x and  so Q = 0.05

From tables if Q =  z =
We know, however, that x is below the mean so z is negative, -0.126.

Using the standard formula:


x 
z  

The value below which 45% of customers spend is

Alternatively x is 0.126 standard deviations below the mean

x =  - 0.126 = 50 - 0.126 x 15 = £48.11


40

i) The value expected to be exceeded by 10% of till slips.

5 20 35 50 65 80 95 £

 = £50,  = £15,

x = £?

10% above x  Q =

From tables if Q =  z =

Using the standard formula:


x 
z  

The value exceeded by 10% of the till slips is

Alternatively: x =  +  =

The remainder of the questions can be completed similarly in your own time.
41

Answers to all the questions:

(a) 0.0228 (b) 0.5 (c) 88.6% (d) 27.8%


(e) 454 (f) 245 (g) £57.88 (h) £48.11
(i) £69.23 (j) £37.37 (k) £57.88 (l) £37.37
Table 1 AREAS UNDER THE STANDARD NORMAL CURVE

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879

0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389

1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319

1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767

2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936

2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
42

3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998

ESTIMATION OF POPULATION PARAMETERS

Last week you became familiar with the normal distribution. We now estimate the
parameters of a normally distributed population by analysing a sample taken from it. In
this lecture we will be concentrating on the estimation of percentages and means of
populations but do note that any population parameter can be estimated from a sample.

Sampling
Sampling theory takes a whole lecture on its own! Since any result produced from the
sample can be used to estimate the corresponding result for the population it is absolutely
essential that the sample taken is as representative as possible of that population.
Common sense rightly suggests that the larger the sample the more representative it is
likely to be but also the more expensive it is to take and analyse. A random sample is
ideal for statistical analysis but, for various reasons, other methods also have been
devised for when this ideal is not feasible. We will not study sampling in this lecture but
just give a list of the main methods below. More details can be found in Section 6.3 of
Business Statistics.

 Simple Random Sampling


 Systematic Sampling
 Stratified Random Sampling
 Multistage Sampling
 Cluster Sampling
 Quota Sampling

It is usually neither possible nor practical to examine every member of a population so we


use the data from a sample, taken from the same population, to estimate the 'something'
we need to know about the population itself. The sample will not provide us with the
exact 'truth' but it is the best we can do. We also use our knowledge of samples to
estimate limits within which we can expect the 'truth' about the population to lie and state
how confident we are about this estimation. In other words instead of claiming that the
mean cost of buying a small house is, say, exactly £75 000 we say that it lies between
£70 000 and £80 000.

Types of Parameter estimates


These two types of estimate of a population parameter are referred to as:
43

 Point estimate - one particular value;


 Interval estimate - an interval centred on the point estimate.
44

Point Estimates of Population Parameters


From the sample, a value is calculated which serves as a point estimate for the population
parameter of interest.

a) The best estimate of the population percentage,  , is the sample percentage, p.

b) The best estimate of the unknown population mean,  , is the sample mean,
x
x . This estimate of  is often written ̂ and referred to as 'mu hat'.
n

c) The best estimate of the unknown population standard deviation,  , is the sample
standard deviation s, where:

  x  x2 This is obtained from the xn-1 key on the calculator.


s 
 n  1

N.B.   x  x2 from xn is not used as it underestimates the value of  .


s 
 n

Example 1: The Accountant at a very small supermarket wishes to obtain some


information about all the invoices sent out to its account customers. In order to obtain an
estimate of this information, a sample of twenty invoices is randomly selected from the
whole population of invoices. Use the results below obtained from them to obtain point
estimates for:
1) the percentage of invoices in the population, , which exceed £40.
2) the population mean, ;
3) the population standard deviation, ;

Values of Invoices (£):

32.53 22.27 33.38 41.47 38.05 25.27 26.78 30.97 38.07 38.06
31.47 38.00 43.16 29.05 22.20 25.11 24.11 43.48 32.93 42.04

4
1) p   100  20% ˆ  20%
20

658.44
2) x   £32.92 ˆ  £32.92
20

3) s (from x n 1 )  £7.12 ˆ  £7.12


45

Interval Estimate of Population Parameter (Confidence interval)


Sometimes it is more useful to quote two limits between which the parameter is expected
to lie, together with the probability of it lying in that range.

The limits are called the confidence limits and the interval between them the confidence
interval.

The width of the confidence interval depends on three sensible factors:


a) the degree of confidence we wish to have in it,
i.e. the probability of it including the 'truth', e.g. 95%;
b) the size of the sample, n;
c) the amount of variation among the members of the sample, e.g. for means this the
standard deviation, s.

The confidence interval is therefore an interval centred on the point estimate, in this case
either a percentage or a mean, within which we expect the population parameter to lie.
The width of the interval is dependent on the confidence we need to have that it does in
fact include the population parameter, the size of the sample, n, and its standard
deviation, s, if estimating means. These last two parameters are used to calculate the
standard error, s/n, which is also referred to as the standard deviation of the mean.

The number of standard errors included in the interval is found from statistical tables -
either the normal or the t-table. Always use the normal tables for percentages which need
large samples. For means the choice of table depends on the sample size and the
population standard deviation:

Population standard deviation


Known: standard error = s
Sample size  Unknown: standard error =
n
n

Large Normal tables Normal tables

Small Normal tables t-tables

Interpretation of Confidence intervals


How do we interpret a confidence interval? If 100 similar samples were taken and
analysed then, for a 95% confidence interval, we are confident that 95 of the intervals
calculated would include the true population mean. In practice we tend to say that we are
95% confident that our interval includes the true population value. Note that there is only
one true value for the population mean, it is the variation between samples which gives
the range of confidence intervals.
46

Confidence Intervals for a Percentage or Proportion


The only difference between calculating the interval for percentages or for proportions is
that the former total 100 and the latter total 1. This difference is reflected in the formulae
used, otherwise the methods are identical. Percentages are probably the more commonly
calculated so in Example 2 we will estimate a population percentage.

The confidence interval for a population percentage or a proportion, , is given by:


p100  p  p 1  p 
  pz for a percentage or   pz for a proportion
n n
where:  is the unknown population percentage or proportion being estimated,
p is the sample percentage or proportion, i.e. the point estimate for ,
z is the appropriate value from the normal tables,
n is the sample size.

p100  p  p 1  p 
The formulae and represent the standard errors of a
n n
percentage and a proportion respectively.

The samples must be large, ( >30), so that the normal table may be used in the formula.

We therefore estimate the confidence limits as being at z standard errors either side of the
sample percentage or proportion. The value of z, from the normal table, depends upon the
degree of confidence, e.g. 95%, required. We are prepared to be incorrect in our estimate
5% of the time and confidence intervals are always symmetrical so, in the tables we look
for Q to be 5%, two tails.

Example 2
In order to investigate shopping preferences at a supermarket a random sample of 175
shoppers were asked whether they preferred the bread baked in-store or that from the
large national bakeries. 112 of those questioned stated that they preferred the bread baked
in-store. Find the 95% confidence interval for the percentage of all the store's customers
who are likely to prefer in-store baked bread.
112
The point estimate for the population percentage,, is p =  100 = 64%
175
p100  p 
Use the formula:   p  z where p = 64 and n = 175
n
From the short normal table 95% confidence  5%, two tails  z = 1.96

p100  p  64  36
p  z  64  1.96  
n 175

the confidence limits for the population percentage, , are and

<<
47

Confidence interval for the Population Mean, , when the population standard
deviation, , is known.

Example 3: For the small supermarket as a whole it is known that the standard deviation
of the wages for part-time employees is £1.50.
A random sample of 10 employees from the small supermarket gave a mean wage of
£4.15 per hour. Assuming the same standard deviation, calculate the 95% confidence
interval for the average hourly wage for employees of the small branch and use it to see
whether the figure could be the same as for the whole chain of supermarkets which has a
mean value of £4.50.

As we actually know the population standard deviation we do not need to estimate it


from the sample standard deviation. The normal table can therefore be used to find the
number of standard errors in the interval.


Confidence Interval:   x  z where z comes from the short normal table
n
 1.50
  xz  4.15  1.96  
n 10

This interval includes the mean, £4.50, for the whole chain so the average hourly wage
could be the same for all employees of the small supermarket.

Confidence interval for the Population Mean, , when the population standard
deviation is not known, so needs estimating from the sample standard deviation, s.

The population standard deviation is unknown, so the t-table, must be used to compensate
for the probable error in estimating its value from the sample standard deviation.

Example 4: Find the 99% confidence interval for the mean value of all the invoices, in
Example 1, sent out by the small supermarket branch. If the average invoice value for the
whole chain is £38.50, is the small supermarket in line with the rest of the branches?

s
Confidence Interval:   x  t where the value of t comes from the table of
n
'percentage points of the t-distribution' using n - 1 degrees of freedom ( = n - 1)

From Example 1 x  £32.92, s  £7.12, n  20. degrees of freedom = (20 – 1) = 19;


99% confidence; from tables t = 2.87

s
  xt =
n

This interval does not include £38.50, so the small branch is out of line with the rest.
Comparison of Means using 'Overlap' in Confidence Intervals
48

We are going to extend the previous method to see if two populations could have the
same mean or, alternatively, if two samples could have come from the same population as
judged by their means. We assume that the standard deviations of both populations are
the same.

If, for example, a supermarket chain wished to see if a local advertising campaign was
successful or not they could take a sample of customer invoices before the campaign and
another after the campaign and calculate confidence intervals for the mean spending of all
customers at both times. If the intervals were found to overlap the means could be the
same so the campaign might have had no effect. If, on the other hand, they were quite
separate with the later sample giving the higher interval then the campaign must have
been effective in increasing sales.

Example 5
The till slips of supermarket customers were sampled both before and after an advertising
campaign and the results were analysed with the following results:

Before: x = £37.60, s = £6.70, n = 25


After: x = £41.78, s = £5.30, n = 25

Has the advertising campaign been successful in increasing the mean spending of all
the supermarket customers? Calculate two 95% confidence intervals and compare the
results.

For both, n = 25 so 24 degrees of freedom giving t = 2.06 for 5%, 2-tails.

sB 6.70
Before: B  xB  t  37.60  2.06   37.60  2.76
nB 25
£34.84 <  < £40.36

sA
After: A  x A  t 
nA

Interpretation: The sample mean had risen considerably but, because the confidence
intervals overlap, the mean values for all the sales may lie in the common ground. There
may be no difference between the two means so the advertising campaign has
not been proved to be successful.

We shall improve on this method in the next example.

.
49

Confidence Intervals for Paired Data


If two measures are taken from each case, i.e. 'before' and 'after', in every instance then
the 'change' or 'difference' for each case can be calculated and a confidence interval for
the mean of the 'changes' calculated. If the data can be 'paired', i.e. it is not independent,
then this method should be used as a smaller interval is produced for the same percentage
confidence giving a more precise estimate.
sd
Confidence Interval:  d  x d  t x d , s d and n d refer to the calculated differences .
nd
Example 6
The supermarket statistician realised that there was a considerable range in the spending
power of its customers. Even though the overall spending seemed to have increased the
high spenders still spent more than the low spenders and that the individual increases
would show a smaller spread. In other words these two populations, 'before' and 'after',
are not independent.
Before the next advertising campaign at the supermarket, he took a random sample of 10
customers, A to J, and collected their till slips. After the campaign, slips from the same
10 customers were collected and both sets of data recorded. Using the paired data, has
there been any mean change at a 95% confidence level?

A B C D E F G H I J
Before 42.30 55.76 32.29 10.23 15.79 46.50 32.30 78.65 32.20 15.90
After 43.09 59.20 31.76 20.78 19.50 50.67 37.32 77.80 37.39 17.24

We first need to calculate the differences. The direction doesn't matter but it seems
sensible to take the earlier amounts away from the later ones to find the changes:
A B C D E F G H I J
Diffs. 0.79 3.44 -0.53 10.55 3.71 4.17 5.02 -0.85 5.19 1.34
We can now forget the original two data sets and just work with the differences.

From the differences x d  £3.28, sd  £3.37 and n d  10

sd 3.37
95% C.I.  d  x d  t  3.28  2.26   3.28  2.41
nd 10
£0.87 < d < £5.69

This interval does not include zero so the possibility of 'no change' has been eliminated.
Because the mean amount spent after the campaign is greater than that before it, there has
been a significant increase in spending.

In general: If the confidence interval includes zero, i.e. it changes from negative to
positive, then there is a possibility that no change has taken place and the original
situation has remained unchanged. If both limits have the same sign, as above, then zero
is excluded and some change must have taken place.
When interpreting this change look carefully at the direction of the change as this will
depend on your order of subtraction. In the above case it is obvious that x d is positive so
there has been an increase in the average spending.
50

COMPUTER ANALYSIS OF EXAMPLES 1 AND 6


Confidence Intervals

Example 1 (Minitab)
Variable N Mean StDev SE Mean 95.0 % CI
Invoices 20 32.92 7.12 1.59 ( 29.59, 36.25)

Descriptives (SPSS)

Statistic Std. Error


INVOICES Mean 32.9200 1.5912
95% Confidence Lower Bound 29.5896
Interval for Mean Upper Bound 36.2504
5% Trimmed Mean 32.9289
Median 32.7300
Variance 50.639
Std. Deviation 7.1161
Minimum 22.20
Maximum 43.48
Range 21.28
Interquartile Range 12.4200
Skewness -.028 .512
Kurtosis -1.304 .992

Example 6 (Minitab)
Variable N Mean StDev SE Mean 95.0 % CI
Diffs. 10 3.28 3.37 1.06 ( 0.87, 5.69)

Descriptives (SPSS)

Statistic Std. Error


DIFFS Mean 3.2830 1.0649
95% Confidence Lower Bound .8739
Interval for Mean Upper Bound 5.6921
5% Trimmed Mean 3.1089
Median 3.5750
Variance 11.341
Std. Deviation 3.3676
Minimum -.85
Maximum 10.55
Range 11.40
Interquartile Range 4.6025
Skewness .901 .687
Kurtosis 1.376 1.334
51

Table 2 PERCENTAGE POINTS OF THE t-DlSTRlBUTlON

One tailed Two tailed

One tail  5% 2.5% 1% 0.5% 0.1% 0.05%


Two tails  10% 5% 2% 1% 0.2% 0.1%

=1 6.31 4.30 12.71 31.82 63.66 636.6


2 2.92 4.30 6.96 9.92 22.33 31.60
3 2.35 3.18 4.54 5.84 10.21 12.92
4 2.13 2.78 3.75 4.60 7.17 8 .61
5 2.02 2.57 3.36 4.03 5.89 6.87

6 1.94 2.45 3.14 3.71 5.21 5.96


7 1.89 2.36 3.00 3.50 4.79 5.41
8 1.86 2.31 2.90 3.36 4.50 5.04
9 1.83 2.26 2.82 3.25 4.30 4.78
10 1.81 2.23 2.76 3.17 4.14 4.59

12 1.78 2.18 2.68 3.05 3.93 4.32


15 1.75 2.13 2.60 2.95 3.73 4.07
20 1.72 2.09 2.53 2.85 3.55 3.85
24 1.71 2.06 2.49 2.80 3.47 3.75
30 1.70 2.04 2.46 2.75 3.39 3.65

40 1.68 2.02 2.42 2.70 3.31 3.55


60 1.67 2.00 2.39 2.66 3.23 3.46
 1.64 1.96 2.33 2.58 3.09 3.29
 = no of degrees of freedom,  = total hatched area in tails

Table 3 % POINTS OF THE STANDARD NORMAL CURVE

One tailed Two tailed

One tail 5% 2.5% 1% 0.5% 0.1% 0.05%


Two tails 10% 5% 2% 1% 0.2% 0.1%

z 1.64 1.96 2.33 2.58 3.09 3.29


52

HYPOTHESIS TESTING
Last week we estimated an unknown population parameter from the corresponding
statistic obtained from the analysis of a sample from the population. This week we shall
similarly analyse a sample and then use its statistic to see whether some claim made
about the population is reasonable or not. We shall test proportions and means.

When an estimate from a sample is used to test some belief, claim or hypothesis about
the population the process is known as hypothesis testing (Significance Testing).

Some conclusion about the population is drawn from evidence provided by the sample.
We shall test for proportions and means. A formal procedure is used for this technique.

The Method which is common to all Hypothesis Tests:


(e.g. A claim as been made that average male salary in a particular firm is £15 000)

1 State the null hypothesis (H0) - This is a statement about the population which may, or
may not, be true. It takes the form of an equation making a claim about the population.
H0: µ = £15 000 where µ is the population mean.

2 State the alternative hypothesis (H1) - the conclusion reached if the Null Hypothesis is
rejected. It will include the terms: not equal to ( ¹ ), greater than ( > ), or less than ( < ).
H1: µ ¹ £15 000, or H1: µ < £15 000, or H1: µ > £15 000

3 State the significance level () of the test - the proportion of the time you are willing to
reject H0 when it is in fact true. If not stated specifically a 5% (0.05) significance level is
used. From computer output the p-value is the probability of H0 being correct.

4 Find the critical value - this is the value from the appropriate table which the test statistic
is required to reach before we can reject H0. It depends on the significance level, , and
whether the test is one- or two-tailed.

5 Calculate the test statistic - This value is calculated from the data obtained from the
sample. From the data we find the sample statistics, such as the mean and standard
deviation, which we need for calculating the test statistic; these are then substituted into
the appropriate formula to find its value. (Each formula is used in one specific type of
hypothesis test only and so needs careful selection.)

6 Reach a conclusion - Compare the values of the Test Statistic and the Critical Value. If
the Test Statistic is greater than the Critical Value then we have enough evidence from
the sample data to cause us to reject the null hypothesis and conclude that the Alternative
Hypothesis is true, e.g. the mean salary is not £15 000. If the Test Statistic is not greater
than the Critical Value then we do not have sufficient evidence and so do not reject the
null hypothesis, e.g. the mean salary for the whole firm could be £15 000.
53

(NB We can prove that the Null Hypothesis is false but we can never prove it to be true
since we do not have all the evidence possible as only a sample is available for analysis.)
Graphical interpretation of conclusion

If the test statistic falls into the critical region there is sufficient evidence to cause the null
hypothesis to be rejected. The total shaded area is the significance level,  (alpha), as
either a percentage (5%) or a decimal (0.05). The methodology will seem simpler with
specific numerical examples!

TESTING FOR THE PROPORTION, OR PERCENTAGE, OF A POPULATION


The critical value comes from the Normal tables as all sample sizes are large for this test.
H0 :  = c c is the hypothesised population proportion or % H1 :  , >, or < c
Significance Level: as stated or 5%.

Critical Value: from Normal tables

Test Statistic: p where p and  are the sample and population
Calculated by  1    proportions respectively, and n the sample size.
n
p
For a percentage use  100   
n
Conclusion: H0 is rejected if the test statistic is so large that the sample could not have
come from a population with the hypothesised proportion.

Example 1
A company manufacturing a certain brand of breakfast cereal claims that 60% of all
housewives prefer its brand to any other. A random sample of 300 housewives include
165 who do prefer the brand. Is the true percentage as the company claims or lower at the
5% significance level?
165
Given Information:   60% ?, p   100  55%, n  300
300
H0:  = 60% H1:   60%
Critical Value: Normal tables, 1 tail, 5% significance, 1.645
p 55  60 5
   177
.
Test Statistic:   100    60  40 8
n 300
Conclusion: Test statistic > critical value so reject H0. The % using the brand is < 60%
54

Example 2
An auditor claims that 10% of invoices for a certain company are incorrect. To test this
claim a random sample of 200 invoices are checked and 24 are found to be incorrect. Test
at the 1% significant level to see if the auditor's claim is supported by the sample
evidence.

Given Information: p = 10%?, p = n =

H0: p H1 : p

Critical Value: Normal tables, 2 tail,

Test Statistic

p
100  
n

Conclusion: Test statistic is less than critical value so H0 cannot be rejected.


The percentage of incorrect invoices is consistent with the auditor's claim of 10%.

TESTING FOR THE MEAN OF A POPULATION


The methods described in this section require the population to be measured on the
interval or ratio scales and to be normally distributed. You may assume this to be correct
for any data presented here.

As with confidence intervals, before deciding which test to use we have to ask whether
the sample size is large or small and whether the population standard deviation is known
or has to be estimated from the sample. The methods used are basically the same but
different formulae and tables need to be used.

Summarising again the choice of statistical table:

Population standard deviation


Known: standard error = s
Sample size  Unknown: standard error =
n
n

Large Normal tables: z-test Normal tables: z-test

Small Normal tables: z-test t-tables: t-test


55

Method
A mean value is hypothesised for the population. A sample is taken from that population
and its mean value calculated. This sample mean is then used to see if the value
hypothesised for the population is reasonable or not. If the population standard deviation
is not known then that from the sample is calculated and used to estimate it. The
appropriate test to use, either a t-test or a z-test, depends on whether this estimated value
has been used or not.

Ho: µ = c H1: µ ¹ c,  < c, or  > c, where µ is the population mean and


c is the hypothesised value
Significance Level :  = 5%, or as stated in the question.
Critical value From normal (z) table or t-tables, significance level, number of tails,
degrees of freedom.

If we know the population standard deviation and do not need to estimate it the normal
table is used. If we need to estimate it then a slightly larger critical value is found from
the t-table.
x  x 
Test statistic z or t
/ n s/ n
where x and m are the means of the sample and population respectively and s and s are
their standard deviations.
Conclusion Compare the test statistic with the critical value. Decide whether to reject Ho
or not.
Conclude in terms of question.

Example 3
A packaging device is set to fill detergent packets with a mean weight of 150g. The
standard deviation is known to be 5.0g. It is important to check the machine periodically
because if it is overfilling it increases the cost of the materials, whereas if it is
underfilling the firm is liable to prosecution. A random sample of 25 filled boxes is
weighed and shows a mean net weight of 152.5g. Can we conclude that the machine is no
longer producing the 150g. quantities? Use a 5% significance level. (0.05 sig. level)
Given µ = 150 g? s = 5 g, n = 25, x = 152.5 g
s is known so a z-test is appropriate. The machine can 'no longer produce the 150 g
quantities' with quantities which are either too heavy or too light therefore the appropriate
test is two tailed.
H0: µ = 150 g H1: µ ¹ 150 g

Significance level (): 5% (0.05)


Critical value: s known therefore normal tables, 5%, two tailed, 1.96.
x  152.5  150 2.5
Test statistic: z  = = 2.5
/ n 5/ 25 1
Conclusion: The test statistic exceeds the critical value so reject H0 and conclude that
the mean weight produced is no longer 150g.
56

Example 4

The mean and standard deviation of the weights produced by this same packaging device
set to fill detergent packets with a mean weight of 150g, are known to drift upwards over
time due to the normal wearing of some bearings. Obviously it cannot be allowed to drift
too far so a large random sample of 100 boxes is taken and the contents weighed. This
sample has a mean weight of 151.0 g and a standard deviation of 6.5g. Can we conclude
that the mean weight produced by the machine has increased? Use a 5% significance
level. (0.05 sig. level)

Given µ = 150 g? n = 100, x = 151.0 g, s = 6.5g,

We are only interested in whether the mean weight has increased or not so a ? tailed
test is appropriate.

H0: µ H1: µ

Significance level ():

Critical value: s is unknown but the sample is large therefore use the normal tables,
5%, one tailed,

x 
Test statistic: z =
s/ n

Conclusion: The test statistic the critical value so we reject H0


and conclude that the machine
57

Example 5 One sample t-test

The personnel department of a company developed an aptitude test for screening


potential employees. The person who devised the test asserted that the mean mark
attained would be 100. The following results were obtained with a random sample of
applicants:

x = 96, s = 5.2, n = 13.

Test this hypothesis against the alternative that the mean mark is less than 100, at the 1%
significance level.

Given: µ = 100?, x = 96, s = 5.2, n = 13.

H0: µ H1: µ

Significance level (): 1% (0.01)

Critical Value: s unknown so t-tables. 1% sig., 1 tail, n - 1 = 12 d. of f. = 2.68

x 
Test statistic: t 
s/ n

Conclusion: The test statistic the critical value H0 and


conclude that the mean mark
58

Testing for Mean of Differences of Paired Data - Paired t-test


If we have paired data the 'difference' between each pair is first calculated and then
these differences are treated as a single set of data in order to consider whether there has
been any change, or if there is any difference between them. In practice we carry out a
one sample t-test on the differences.

Method

H0: µd = 0 where µd is the mean of the differences. H1: µd ¹ 0, < 0, or > 0

Significance Level (): 5%, or as stated.

Critical value t-table, deg of freedom, % significance, number of tails.


xd  0
Test statistic where the subscript d is used to denote 'differences'.
sd / n d
Conclusion Compare the test statistic with the critical value. Decide whether to reject H0
or not. Conclude in terms of the question.

Example 6 A training manager wishes to see if there has been any alteration in the
aptitude of his trainees after they have been on a course. He gives each an aptitude test
before they start the course and an equivalent one after they have completed it. The
scores are recorded below. Has any change taken place at a 5% significance level?

Trainee A B C D E F G H I
Score before Training 74 69 45 67 67 42 54 67 76
Score after Training 69 76 56 59 78 63 54 76 75

First the 'changes' are computed and then a simple t-test is carried out on the result.

Changes (After - Before): -5, +7, +11, -8,

H0: µd = 0; H1: µd  0.

Critical value: t-table, % significance, deg of freedom, tailed =

xd  0
Test statistic xd = , sd = , n = 9.
sd / n d

Conclusion T.S. less extreme than C.V. so do not reject H0. There may be no change.

Only the most commonly used of hypothesis tests have been included in this lecture.
You will meet others in the next few weeks and a further selection, including non-
parametric tests, are included in 'Business Statistics', Chapter 7.
59

Table 2 PERCENTAGE POINTS OF THE t-DlSTRlBUTlON

One tailed Two tailed

One tail  5% 2.5% 1% 0.5% 0.1% 0.05%


Two tails  10% 5% 2% 1% 0.2% 0.1%

=1 6.31 4.30 12.71 31.82 63.66 636.6


2 2.92 4.30 6.96 9.92 22.33 31.60
3 2.35 3.18 4.54 5.84 10.21 12.92
4 2.13 2.78 3.75 4.60 7.17 8 .61
5 2.02 2.57 3.36 4.03 5.89 6.87

6 1.94 2.45 3.14 3.71 5.21 5.96


7 1.89 2.36 3.00 3.50 4.79 5.41
8 1.86 2.31 2.90 3.36 4.50 5.04
9 1.83 2.26 2.82 3.25 4.30 4.78
10 1.81 2.23 2.76 3.17 4.14 4.59

12 1.78 2.18 2.68 3.05 3.93 4.32


15 1.75 2.13 2.60 2.95 3.73 4.07
20 1.72 2.09 2.53 2.85 3.55 3.85
24 1.71 2.06 2.49 2.80 3.47 3.75
30 1.70 2.04 2.46 2.75 3.39 3.65

40 1.68 2.02 2.42 2.70 3.31 3.55


60 1.67 2.00 2.39 2.66 3.23 3.46
 1.64 1.96 2.33 2.58 3.09 3.29
 = no of degrees of freedom,  = total hatched area in tails

Table 3 % POINTS OF THE STANDARD NORMAL CURVE

One tailed Two tailed

One tail 5% 2.5% 1% 0.5% 0.1% 0.05%


Two tails 10% 5% 2% 1% 0.2% 0.1%

z 1.64 1.96 2.33 2.58 3.09 3.29


60

ANALYSIS OF VARIANCE (ANOVA)

Analysis of Variance, in its simplest form, can be thought of as an extension of the


two-sample t-test to more than two samples. (Further methods in Chapter 8 of Business
Statistics)
As an example: Using the 2-sample t-test we have tested to see whether there was any
difference between the size of invoices in a company's Leeds and Bradford stores. Using
Analysis of variance we can, simultaneously, investigate invoices from as many towns as
we wish, assuming that sufficient data is available.

Problem: Why can't we just carry out repeated t-tests on pairs of the variables?

If many independent tests are carried out pairwise then the probability of being correct
for the combined results is greatly reduced. For example, if we compare the average
marks of two students at the end of a semester to see if their mean scores are significantly
different we would have, at a 5% level, 0.95 probability of being correct. Comparing
more students:

Students Pairwise tests P( all correct) P(at least one incorrect)


2 1 0.95 0.05
3 3 0.953 = 0.875 0.125
n {n(n-1)}/2 0.95n 1 - 0.95n
10 45 0.9545 = 0.10 0.90

Obviously this situation is not acceptable

Solution: We need therefore to use methods of analysis which will allow the variation
between all n means to be tested simultaneously giving an overall probability of 0.95
of being correct at the 5% level. This type of analysis is referred to as Analysis of
Variance or 'ANOVA' in short. In general, it:
 measures the overall variation within a variable;
 finds the variation between its group means;
 combines these to calculate a single test statistic;
 uses this to carry out a hypothesis test in the usual manner.

In general, Analysis of Variance, ANOVA, compares the variation between groups and
the variation within samples by analysing their variances.

One-way ANOVA: Is there any difference between the average sales at various
departmental stores within a company?

Two-way ANOVA: Is there any difference between the average sales at various stores
within a company and/or the types of department? The overall variation is split 'two
ways'.
61

One-way ANOVA
Total variation
(SST)

Variation due to difference Residual (error) variation not


between the groups, i.e. due to difference between the
between the group means. group means.
(SSG) (SSE)

Two-way ANOVA
Total variation
(SST)

Variation due to difference Residual (error) variation not


between the groups, i.e. due to difference between the
between the group means main group means.
(SSG) (SSE1)

Variation due to difference Genuine residual (error)


between the block means, variation not due to
i.e. second group means difference between either
(SSBl) set of group means (SSE)

where SST = Total Sum of Squares; SSG = Treatment Sum of Squares between the
groups; SSBl = Blocks Sum of Squares; SSE = Sum of Squares of Errors.
(At this stage just think of 'sums of squares' as being a measure of variation.)

The method of measuring this variation is variance, which is standard deviation squared.
Total variance = between groups variance + variance due to the errors

It follows that: Total sum of = Sum of squares between + Sum of squares due
squares (SST) the groups (SSG) to the errors (SSE)

If we find any two of the three sums of squares then the other can be found by difference.

In practice we calculate SST and SSG and then find SSE by difference.

Since the method is much easier to understand with a numerical example, it will be
explained in stages using theory and a numerical example simultaneously.
62

Example 1

One important factor in selecting software for word processing and database management
systems is the time required to learn how to use a particular system. In order to evaluate
three database management systems, a firm devised a test to see how many training hours
were needed for five of its word processing operators to become proficient in each of
three systems.
System A 16 19 14 13 18 hours
System B 16 17 13 12 17 hours
System C 24 22 19 18 22 hours

Using a 5% significance level, is there any difference between the training time needed
for the three systems?

In this case the 'groups' are the three database management systems. These account for
some, but not all, of the total variance. Some, however, is not explained by the difference
between them. The residual variance is referred to as that due to the ‘errors’.

Total variance = between systems variance + variance due to the errors.

It follows that: Total sum = Sum of squares + Sum of squares


of squares between systems of errors
(SST) (SSSys) (SSE)

In practice we transpose this equation and use: SSE = SST - SSSys

Calculation of Sums of Squares


The 'square' for each case is (x - x )2 where x is the value for that case and x is the
mean. The 'total sum of squares' is therefore   x  x  2 . The classical method for
calculating this sum is to tabulate the values; subtract the mean from each value; square
the results; and finally sum the squares. The use of a statistical calculator is preferable!

In the lecture on summary statistics we saw that the standard deviation is calculated by:
  x  x
2
sn  so   x  x  2 = ns 2n with s from the calculator using [xn]
n
  x  x
2
or s n 1  so   x  x  2 =  n  1 s 2n 1 with s from the calculator using
n 1
[xn-1]
Both methods estimate exactly the same value for the total sum of squares.

TotalSS Input all the data individually and output the values for n , x and  n from
the calculator in SD mode. Use these values to calculate  2n and n 2n .
n x n n2 n n2 .
63

15 17.33 3.419 11.69 175.3 = SS Total


SSSys Calculate n and x for each of the management systems separately:
n x
System A 5 16
System B 5 15
System C 5 21

Input as frequency data, (n = 5), in your calculator and output n, x and n .

n x n n2 n n2 .
SS for Systems 15 17.33 2.625 6.889 103.3 = SSSys

SSE is found by difference SSE = SST - SSSys = 175.3 - 103.3 = 72.0

Using the ANOVA table


Below is the general format of an analysis of variance table. If you find it helpful then
make use of it, otherwise just work with the numbers, as on the next page.

General ANOVA Table (for k groups, total sample size N)

Source S.S d.f. M.S.S. F

SSG MSG
Between groups SSG k-1  MSG F
k 1 MSE

SSE
Errors SSE (N-1) - (k-1)  MSE
Nk

Total SST N-1

Method
 Fill in the total sum of squares, SST, and the between groups sum of squares, SSG,
after calculation; find the sum of squares due to the errors, SSE, by difference;
 the degrees of freedom, d.f., for the total and the groups are one less than the total
number of values and the number of groups respectively; find the error degree of
freedom by difference;
 the mean sums of squares, M.S.S., is found in each case by dividing the sum of
squares, SS, by the corresponding degrees of freedom.
 The test statistic, F, is the ratio of the mean sum of squares due to the differences
between the group means and that due to the errors.
64

In this example: (k = 3 systems, N = 15 values)

Source S.S. d.f. M.S.S. F


Between 103.3 3-1=2 103.3/2 = 51.65 51.65/6.00 = 8.61
systems
Errors 72.0 14 - 2 = 12 72.0/12 = 6.00

Total 175.3 15 - 1 = 14

The hypothesis test


The methodology for this hypothesis test is similar to that described last week.

The null hypothesis, H0, is that all the group means are equal. H0: 1 = 2 = 3 = 4 etc.

The alternative hypothesis, H1, is that at least two of the group means are different.

The significance level is as stated or 5% by default.

The critical value is from the F-tables, F  1 ,  2  , with the two degrees of freedom
from the groups, 1, and the errors, 2.

The test statistic is the F-value calculated from the sample in the ANOVA table.

The conclusion is reached by comparing the test statistic with the critical value and
rejecting the null hypothesis if the test statistic is the larger of the two.

Example 1 (cont.)

H0: A = B = C H1: At least two of the means are different.


Critical value: F0.05 (2,12) = 3.89 (Deg. of free. from 'between systems' and 'errors'.)
Test statistic: 8.61
Conclusion: T.S. > C.V. so reject H0. There is a difference between the mean learning
times for at least two of the three database management systems.

Where does any difference lie?


We can calculate a critical difference, CD, which depends on the MSE, the sample sizes
and the significance level, such that any difference between means which exceeds the CD
is significant and any less than it is not.
The critical difference formula is: 1 1 
CD  t MSE  
 n1 n 2 
t has the error degrees of freedom and one tail. MSE from the ANOVA table.
65

Example1 (cont.)

1 1  1 1
CD  t MSE     1.78 6.00    2.76
 n1 n 2  5 5

From the samples x A  16, x B  15, x C  21.


System C takes significantly longer to learn than Systems A and B which are similar.

Two-way ANOVA
In the above example it might have been reasonable to suggest that the five Operators
might have different learning speeds and were therefore responsible for some of the
variation in the time needed to master the three Systems. By extending the analysis from
one-way ANOVA to two-way ANOVA we can find our whether Operator variability is a
significant factor or whether the differences found previously were just due to the
Systems.

Example 2 Operators
1 2 3 4 5
System A 16 19 14 13 18
System B 16 17 13 12 17
System C 24 22 19 18 22

Again we ask the same question: using a 5% level, is there any difference between the
training time for the three systems? We can use the Operator variation just to explain
some of the unexplained error thereby reducing it, 'blocked' design, or we can consider it
in a similar manner to the System variation in the last example in order to see if there is a
difference between the Operators.

In the first case the 'groups' are the three database management systems and the 'blocks'
being used to reduce the error are the different operators who themselves may differ in
speed of learning. In the second we have a second set of groups - the Operators.

Total variance = between systems variance + between operators variance


+ variance of errors.

So Total sum = Sum of squares + Sum of squares + Sum of squares


of squares between systems between operators of errors
(SST) (SSSys) (SSOps) (SSE)

In 2-way ANOVA we find SST, SSSys, SSOps and then find SSE by difference.
SSE = SST - SSSys - SSOps

We already have SST and SSSys from 1-way ANOVA but still need to find SSOps.

Operators
66

1 2 3 4 5 Means
System A 16 19 14 13 18 16.00
System B 16 17 13 12 17 15.00
System C 24 22 19 18 22 21.00
Means 18.67 19.33 15.33 14.33 19.00 17.33

From example 1: SST = 175.3 and SSSys = 103.3

SSOps Inputting the Operator means as frequency data (n = 3) gives:


n = 15, x = 17.33,  n = 2.078 and nn2 = 64.7 = SSOps

Two-way ANOVA table, including both Systems and Operators:


Source S.S. d.f. M.S.S. F
Between Systems 103.3 3-1=2 103.3/2 = 51.65 51.65/0.91 = 56.76
Between Operators 64.7 5-1=4 64.7/4 = 16.18 16.18/0.91 = 17.78
Errors 7.3 14 - 6 = 8 7.3/8 = 0.91
Total 175.3 15 - 1 = 14

Hypothesis test (1) for Systems


H0:  A  B  C H1: At least two of them are different.
Critical value: F0.05 (2,8) = 4.46
Test Statistic: 56.76 (Notice how the test statistic has increased with the use of the more
powerful two-way ANOVA)
Conclusion: T.S. > C.V. so reject H0. There is a difference between at least two of the
mean times needed for training on the different systems.
Using CD of 1.12 (see overhead): C takes significantly longer to learn than A and/or B.

Hypothesis test (2) for Operators


H0:  1  2  3  4  5 H1: At least two of them are different.
Critical value: F0.05 (4,8) = 3.84
Test Statistic: 17.78
Conclusion: T.S. > C.V. so reject H0. There is a difference between at least two of the
Operators in the mean time needed for learning the systems.

Using CD of 1.45 calculated as previously (see overhead): Operators 3 and 4 are


significantly quicker learners than Operators 1, 2 and 5.
67

Table 4 PERCENTAGE POINTS OF THE F DISTRIBUTION  = 5%

Numerator degrees of freedom


1
2 1 2 3 4 5 6 7 8 9
Denominator degrees of freedom

1 161.40 199.50 215.70 224.60 230.20 234.00 236.80 238.90 240.50


2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8081
4 7.71 6.94 6.56 6.39 6.26 6.16 6.09 6.04 6.00
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04
120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96
 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88
68

CORRELATION AND REGRESSION

Correlation and regression are concerned with the investigation of two continuous
variables.

Previously we have only considered a single variable - now we look at two variables.

We might want to know:


 If a relationship exists between those variables;
 if so, how strong that relationship is;
 what form that relationship takes.
 Can we make use of that relationship for predictive purposes?

Correlation describes the strength of the relationship. It is not concerned with 'cause'
and 'effect'.

Regression describes the relationship itself in the form of a straight line equation which
best fits the data. Curved relationships are considered in Chapter 9 of Business Statistics.

Methodology

 Some initial insight into the relationship between two continuous variables can be
obtained by plotting a scatter diagram and simply looking at the resulting graph.
Does the relationship appear to be linear or curved?

 If there appears to be a linear relationship, it can be quantified.


A correlation coefficient is calculated as the measure of the strength of this
relationship. Its symbol is usually 'r' and its value always lies between -1 and +1.
Is the relationship strong enough to be useful?

 If the relationship is found to be significantly strong, then its nature can be found
using linear regression, which defines the straight line equation of the line of
'best fit' through the bi-variate data.

 The 'goodness of fit' can be calculated to see how well the line fits the data.

 Once defined by an equation, the relationship can be used to make predictions.


69

Example

As an example we shall consider whether there is any relationship between 'Ice cream
Sales' for an ice-cream manufacturer and 'Average Monthly Temperature'. The collected
data is recorded below:

Month Av.Temp. Sales


(oC) (£'000)
January 4 73
February 4 57
March 7 81
April 8 94
May 12 110
June 15 124
July 16 134
August 17 139
September 14 124
October 11 103
November 7 81
December 5 80

Scatter diagrams

We are looking for a linear relationship with the bivariate points plotted being reasonably
close to the, yet unknown, 'line of best fit'. At this stage no 'causality' is implied but it
makes sense to use the same diagram for the addition of the regression line later, so the
potentially dependent variable should be identified and plotted on the vertical axis.

Scatter diagram (from Minitab)

Sales against Average Monthly Temperature

140
130
120
110
Sales

100
90
80
70
60
50
5 10 15
Av.Temp.

Seems to indicate a straight line relationship, with all points fairly close to a 'line of best
fit'. The strength of the relationship will therefore be quantified by calculating the
correlation coefficient and testing it for significance.
70

Pearson's Product Moment Correlation Coefficient, r.

Pearson's correlation coefficient is calculated for situations in which the data can be
measured quantitatively, i.e. either interval or ratio data. It compares how the variables
vary together with how they each vary individually and is independent of the origin or
units of the measurements. It is a ratio of the combined variance (covariance) with
individual variances and so has no units itself.

The value of the correlation coefficient, r, is best produced directly from a calculator in
LR mode. (See Appendix 1)

(If you do not have a calculator capable of carrying out correlation and regression
calculations it will be necessary for you to calculate them by tabulation and use of the
formulae shown in Appendix 2.)

Correlation Coefficient: For this data the value of r = 0.9833 ( from 'shift (' )
(Calculator usage will be practised in tutorials.)

Is the size of this correlation coefficient, 0.9833, large enough to claim that there is a
relationship between average monthly temperature and sales of ice-cream? The test
statistic obtained above in this case seems very close to 1, but is it close enough?.

It needs to be compared to a table value to see if it was significantly high. Correlation


tables, appended, are used with n - 2 degrees of freedom.

Hypothesis test for a Pearson’s correlation coefficient

Null hypothesis, H0: There is no association between ice-cream sales and average
monthly temperature.

Alternative hypothesis, H1: There is an association between them.

Critical Value: 5%, (12 - 2) = 10 degrees of freedom = 0.576

Test statistic: 0.983

Conclusion: The test statistic exceeds the critical value so we reject the Null Hypothesis
and conclude that there is an association between ice-cream sales and average monthly
temperature.
71

Regression equation

Since we now know there is a significant relationship between the two variables, the next
obvious step is to define it. Having defined it, we can then add it to the scatter diagram
and also, assuming a causal relationship, use it to predict the next month's ice cream sales
if we know the figure for the average temperature which has been forecast for the coming
month.

As with the correlation coefficient, the coefficients of the regression equation, a and b,
can be found directly from the calculator in LR mode (shift 7 and shift 8), otherwise they
need to be calculated as shown in Appendix 2.

The regression line is described, in general, as the straight line of ‘best fit’ with the
equation:
y = a + bx

where x and y are the independent and dependent variables respectively, a the intercept
on the y-axis, and b the slope of the line.

The values of a and b for this data, as produced from a calculator in linear regression
(LR) mode, are 45.52 (shift 7) for a and 5.448 (shift 8) for b giving the regression line:
y = 45.5 + 5.45x

To draw this line on the scatter diagram, any three points are plotted and joined up.
These points, the intercept, (0,a), the centroid, ( x , y ), and/or any other points calculated
from the regression equation as long as these are in the region of the observed data.
e.g. If x = 15 then y = 45.5 + 5.45 x 15 = 127.3
(For any value of x, the corresponding value of y can be produced directly from the
calculator in LR mode, using the key ŷ .)

Scatter diagram with Regression line added

Sales against Av erage Monthly Temperature

140
130
120
110
Sales

100
90
80
70
60
50
5 10 15

Av.Temp.
72

Goodness of Fit
How well does this line fit the data? Goodness of fit is measured by (r2 x 100)%.

The correlation coefficient r was 0.983 so we have (0.983)2 x 100 = 96.6% fit. This high
value indicates that any predictions made about y from a value of x will be good.

The 'goodness of fit' indicates the percentage of the variation in Ice-cream Sales which is
accounted for by the variation in average monthly temperature.

Prediction of Sales
Suppose that the Ice-cream manufacturer knows that the estimated average temperature
for the following month is 14oC, what would he expect his Sales to be?

The best estimate of the Sales is obtained by substituting the value of 14 for that of the
independent variable, x, and calculating the corresponding value of the Sales, y. This can
be more easily be done directly by calculator in LR mode: type in 14, press ŷ .

Estimated Sales = 45.5 + 5.45 x average temperature


= 45.5 + 5.45 x 14
= 121.8

Expected sales would be £122 000

Minitab Regression output


MTB > Correlation 'Av.Temp.' 'Sales'.

Correlation of Av.Temp. and Sales = 0.983

MTB > Regress 'Sales' 1 'Av.Temp.'.

The regression equation is


Sales = 45.5 + 5.45 Av.Temp.

Predictor Coef Stdev t-ratio p


Constant 45.520 3.503 13.00 0.000
Av.Temp. 5.4480 0.3186 17.10 0.000

s = 5.038 R-sq = 96.7% R-sq(adj) = 96.4%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 7420.2 7420.2 292.34 0.000
Error 10 253.8 25.4
Total 11 7674.0

Unusual Observations
Obs. Av.Temp. Sales Fit Stdev.Fit Residual St.Resid
2 4.0 57.00 67.31 2.40 -10.31 -2.33R

R denotes an obs. with a large st. resid.


73

SPSS Regression output


CORRELATIONS
/VARIABLES=av_temp sales
/PRINT=TWOTAIL NOSIG.

Correlations

Average Ice cream


monthly sales
temperature (£'000)
Average monthly Pearson Correlation 1.000 .983**
temperature Sig. (2-tailed) . .000
N 12 12
Ice cream sales (£'000) Pearson Correlation .983** 1.000
Sig. (2-tailed) .000 .
N 12 12
**. Correlation is significant at the 0.01 level (2-tailed).

REGRESSION
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/DEPENDENT sales
/METHOD=ENTER av_temp .

Model Summary

Std. Error
Adjusted R of the
Model R R Square Square Estimate
1 .983a .967 .964 5.04
a. Predictors: (Constant), Average monthly temperature

ANOVAb

Sum of Mean
Model Squares df Square F Sig.
1 Regression 7420.176 1 7420.176 292.335 .000a
Residual 253.824 10 25.382
Total 7674.000 11
a. Predictors: (Constant), Average monthly temperature
b. Dependent Variable: Ice cream sales (£'000)

Coefficientsa

Standardi
zed
Unstandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 45.520 3.503 12.996 .000
Average monthly
5.448 .319 .983 17.098 .000
temperature
a. Dependent Variable: Ice cream sales (£'000)
74

Appendix 1

Calculator Method (only for calculators with linear regression mode.)

The data is entered as pairs of numbers. Within the many memories of the calculator
these numbers and their squares are accumulated and the correlation coefficient is then
calculated using the formulae in the appendix from these stored sums.

The method described below produces the correlation coefficient, r, the regression
coefficients, A and B, and estimates values of y for given values of x for a Casio
calculator. For any other make, and possibly some Casio models, refer to the calculator
handbook.

Data in:
1 Clear all memories Shift AC
2 Enter linear regression mode Mode 3
3 Input variables x and y together for each case x,y DT
Repeat to end of data.
Results out:
4 Check number of pairs entered RCL Red C
5 Output correlation coefficient (r) Shift ( 'open bracket'
6 Output intercept (a) and slope (b) of regression line Shift 7, Shift 8.
7 To get y value for plotting line when x = 15 15 ŷ

8 To estimate Sales (£000) for Av. Temp = 14 14 ŷ


75

Appendix 2
The value of the correlation coefficient is best produced directly from a calculator in LR
mode. Otherwise the following calculations are necessary:

The formula used to find the Pearson’s Product Moment Correlation coefficient is:

Sxy
r    1  r  1
SxxSyy

 x x
whereSxx   x 2 
n
 y y
Syy   y 2 
n
 x y
Sxy   xy 
n
where x means the sum of all the x values, etc.

The following example gives ice-cream sales per month against mean monthly
temperature (Fahrenheit) You need the values of:

 x,  y,  x 2 ,  y 2 ,  xy, n.

Average Ice cream


Month temp. (x) sales (y) x2 y2 xy
Jan 4 73 16 5329 292
Feb 4 57 16 3249 228
Mar 7 81 49 6561 567
Apr 8 94 64 8836 752
May 12 110 144 12100 1320
Jun 15 124 225 15376 1860
Jul 16 134 256 17956 2144
Aug 17 139 289 19321 2363
Sep 14 124 196 15376 1736
Oct 11 103 121 10609 1133
Nov 7 81 49 6561 567
Dec 5 80 80 6400 400
Sums  120 1200 1450 127674 13362
76

Calculations

 x  120,  y  1200,  x 1450,  y  127674,  xy  13362, n  12.


2 2

120  120
Sxx  1450   250
12
1200  1200 1362
Syy  127674   7674 r   0.9833
12 250  7674
120  1200
Sxy  13362   1362
12

Correlation Coefficient: r = 0.983 (Use of calculator will be checked in tutorials.)

Regression equation
As with the correlation coefficient, the regression equation can be produced directly from
a calculator in LR mode. Otherwise further use is made of the values of
 x ,  y,  x 2
, y 2
,  xy, n.
(all produced previously.)

The regression line is described, in general, as the straight line with the equation:
y = a + bx
where x and y are the independent and dependent variables respectively, a the intercept
on the y-axis, and b the slope of the line.

The gradient, b, is calculated from :

S xy  x y  x x
b
S xx
where S xy   xy  n
and S xx   x2  n

Since the regression line passes through the centroid, both means, its equation can be
used to find the value of a, the intercept on the y-axis:

y  a  bx  a  y  bx

From previously: (Sales now assumed to be dependent on temperature)

1362 1200 120


S xy  1362 S xx  250 so b  5.448 a  5.488   45.52
250 12 12

The values of a and b are therefore 45.5 and 5.45 respectively giving the regression
line:
y = 45.5 + 5.45x
77

Table 6 PERCENTAGE POINTS OF CORRELATION COEFFICIENT

One tailed Two tailed

v 

One Tail 5% 2.5% 1% 0.5% 0.1% 0.05%


Two Tails l0% 5% 2% 1% 0.2% 0.01%

2 .900 .950 .980 .990 .998 .999


3 .805 .878 .934 .959 .986 .991
4 .729 .811 .882 .917 .963 .974
5 .669 .754 .833 .875 .935 .951

6 .621 .707 .789 .834 .905 .925


7 .582 .666 .750 .798 .875 .898
8 .549 .632 .715 .765 .847 .872
9 .521 .602 .685 .735 .820. .847
10 .497 .576 .658 .708 .795 .823

11 .476 .553 .634 .684 .772 .801


12 .457 .532 .612 .661 .750 .780
13 .441 .514 .592 .641 .730 .760
14 .426 .497 .574 .623 .711 .742
15 .412 .482 .558 .606 .694 .725

16 .400 .468 .543 .590 .678 .708


17 .389 .456 .529 .575 .662 .693
18 .378 .444 .516 .561 .648 .679
19 .369 .433 .503 .549 .635 .665
20 .360 .423 .492 .537 .622 .652

25 .323 .381 .445 .487 .568 .597


30 .296 .349 .409 .449 .526 .554
40 .257 .304 .358 .393 .463 .490
60 .211 .250 .295 .325 .385 .408
78

CONTINGENCY TABLES AND CHI-SQUARED TESTS

In this type of analysis we have two characteristics, such as gender and eye colour, which
cannot be measured but which can be used to group people by variations within them.
These characteristics may, or may not, be associated in some way. How can we decide?

We can take a random sample from the population, note which variation of each
characteristic is appropriate for each case and then cross-tabulate the data. It is then
analysed in order to see if the proportions of each characteristic in the sub-samples are the
same as the overall proportions - easier to do than describe! As an example, if there is no
relationship between gender and eye colour we would expect similar proportions of males
and females to have blue eyes.

The Variables
If the variables are, as is usually the case, nominal, (described by name only), frequencies
may be cross-tabulated by each category within each variable. Ordinal variables may be
used if there are only a limited number of orders so that each one can be classified as a
separate category. Continuous variables may be grouped and then tabulated similarly
though the results will then vary according to the grouping categories.

Contingency Tables (Cross-Tabs)


You have met this type of table before as a contingency table when calculating
probabilities. As a reminder: cases are allotted to categories and their frequencies cross-
tabulated; e.g. in the gender / eye colour example there might be blue eyed males, blue
eyed females, brown eyed males, brown eyed females, etc. These tables are known as
contingency tables. All possible 'contingencies' are included in the 'cells' which are
themselves mutually exclusive. The table is completed by calculating the 'row totals', the
'column totals' and the 'grand total'.

Expected values
If the two variables, the characteristics under scrutiny, are completely independent, the
proportions within the sub-totals of the contingency table would be expected to be the
same as those of the totals for each variable. In practice we work with frequencies rather
than proportions, distinguishing between 'observed' and 'expected' frequencies by
enclosing the latter within brackets. If gender and eye colour are independent and if a
third of the population has blue eyes, we would expect a third of males to be blue eyed
and a third of females to be blue eyed.

These proportions are obviously contrived so as to be easy to work with. How can we
cope with more awkward numbers? In the on-going example, the proportions are first
calculated as fractions which are then multiplied by the total frequency to find the
expected individual cell frequencies. This produces a formula which is applicable in all
cases:
Row total  Column total
For any cell the expected frequency is calculated by: Overall total
,

where the relevant row and column are those crossing in that particular cell.
79

2
Chi-squared ( ) Test for Independence
The hypothesis test which is carried out in order to see if there is any association between
categorical variables, such as gender and eye colour, is known as the Chi-squared, ( ), test,
2

Example 1
The following table compiled by a personnel manager relates to a random sample of 180
staff taken from the whole workforce of the supermarket chain. We shall, in this example,
test for association between a member of staff's gender and his/her type of job, at the 5%
level of significance.

Male Female Total

Supervisor 20 15

Shelf stacker 20 30

Till operator 10 35

Cleaner 10 40

Total

Completing the row and column totals, as previously with probability, gives the full table.

In this example we randomly selected a sample of 180 supermarket staff and found that
two thirds, 120, of them were female and one third, 60, were male. Assuming there is no
association between gender and job category and finding that we have 45 till operators, we
would expect two thirds, 30, of them to be female and one third, 15, of them to be male.
Note that these figures are a quarter of each gender respectively, which checks since till
operators, 45, form a quarter of the total staff, 180.

We now calculate the other expected frequencies from the probabilities and put them
into the table: (See Section 4.8)

 P (Supervisor) = 35/180
 P (Male) = 60/180
 P (Supervisor and male) = 35/180 x 60/180 assuming independence.

35 60
Therefore the expected number of Male supervisors =   180  11 .67
180 180

This is the expected frequency for members of staff who are both male and a supervisor.
Note that it is a theoretical number which does not have to be an integer.

Row total  Column total 35  60


This simplifies to: ,  11 .67
Overall total 180
80

Calculating the other expected frequencies and inserting them in the table (in brackets):

Male Female Total

Supervisor 20 (11.67) 15 (23.33) 35

Shelf stacker 20 (16.67) 30 (33.33) 50

Till operator 10 (15.00) 35 (30.00) 45

Cleaner 10 (16.67) 40 (33.33) 50

Total 60 120 180

These are the frequencies which would be expected if there is no association between
gender and job category at the supermarket.

If the expected frequencies are observed to actually occur in practice then we can deduce
that the two variables are indeed independent.

We would obviously not expect to get exact agreement with the expected frequencies, so
some critical amount of difference is allowed and we compare the difference from our
observations with that allowed by the use of a standard table. Are the values observed so
different to those expected that we must reject the idea of independence? Or are the
results just due to sampling errors, with the variables actually being independent?
It is to be hoped that you recognise the need for a hypothesis test!

The chisquared (2) Hypothesis test


To find the answer, we analyse the data and compare the result to a standard table figure.
We carry out a formal hypothesis test at 5% significance: the chi-squared test.

1) State Null Hypothesis, H0, (that of no association) and Alternative Hypothesis, H1.
2) Record observed frequencies, O, in each cell of the contingency table.
3) Calculate row, column and grand totals.
4) Calculate expected frequency, E, for each cell : row total x column total
grand total
Note that: No expected frequency should be less than 1 and the number of expected
frequencies below 5 should not be over 20% of the total number of cells. Otherwise
the test is invalid.

5) Find critical value from chi-square table, as appended, with (r - 1)  (c - 1)


degrees of freedom where r and c are the number of rows and columns respectively.

6) Calculate test statistic:


  O  E 2
E

7) Compare the two values and conclude whether the variables are independent or not.
81

In example 1, we have already carried out steps 2, 3 and 4 of the procedure by calculating
the expected values. Whether these are calculated before, or during, the test is up to
personal preference. Some statisticians also prefer to calculate the test statistic, as this
procedure is rather lengthy, before starting the test and to then insert the calculated value
in the formal hypothesis test.

The test statistic is an overall measure of the difference between the expected and
observed frequencies. Each cell difference is squared so that positive and negative
differences have the same weighting and proportioned by the size of the expected cell
contents. When the contributions from each cell are totalled their sum is compared with a
critical value from the chi-squared table – hence the name of this test.

Null Hypothesis (H0): There is no association between gender and job category.
(Remember that 'null' means none.)

Alternative Hypothesis (H1): There is an association between gender and job category.

Critical Value: from the chi-squared table


Number of degrees of freedom () = (r - 1)(c - 1) = (4 - 1)(2 - 1) = 3 x 1 = 3;
2 table, as appended, is always one tailed. Level of significance = 5%

 2 5%,  3 = 7.816

Test statistic
The test statistic is calculated from the contingency table which includes both the
observed and the expected values for the frequency of staff. The data may be tabulated, as
we shall do in this example, or the contribution of each cell may be calculated directly as
 O  E  2 and then the test statistic found as the sum of these contributions:
E

Male Female Total

Supervisor 20 (11.67) 15 (23.33) 35

Shelf stacker 20 (16.67) 30 (33.33) 50

Till operator 10 (15.00) 35 (30.00) 45

Cleaner 10 (16.67) 40 (33.33) 50

Total 60 120 180


82

Test statistic =
  O  E 2
E

O E (O - E) (O - E)2/E
20 11.67 8.33 5.946
15 23.33 -8.33 2.974
20 16.67 3.33 0.665
30 33.33 -3.33 0.333
10 15.00 -5.00 1.667
35 30.00 5.00 0.833
10 16.67 -6.67 2.669
40 33.33 6.67 1.335
Total 16.422

Test Statistic: 16.422


Conclusion: Test statistic > Critical value therefore reject H0 .
Conclude that there is an association between gender and job category in the supermarket
chain.

Looking again at the data we can see that far more males than expected were supervisors
or shelf stackers and more females were cleaners or till operators.
83

Example 2
In this example we first have to set up the contingency table from the following
information collected from a questionnaire:

In a recent survey within a Supermarket chain, a random sample of 160 employees:


stackers, sales staff and administrators, were asked to grade their attitude towards future
wage restraint on the scale:
Very favourable; favourable; unfavourable; very unfavourable.
Of the 40 stackers interviewed, 7 gave the response 'favourable', 24 the response
'unfavourable', and 8 the response 'very unfavourable'. There were 56 sales staff and from
these, 10 responded 'very unfavourable', 9 responded 'favourable' and 3 responded 'very
favourable’. The rest of the sample were administrators. Of these, 16 gave the response
'very favourable' and 2 the response 'very unfavourable'. In the whole survey, exactly half
the employees interviewed responded 'unfavourable'.

We first draw up a contingency table showing these results and then test whether attitude
towards future wage restraint is dependent on the type of employment.

Setting up the table: in this example there are three types of employee giving four
different responses, i. e. we have a 3 x 4 (or a 4 x 3) table.
Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells.

Have a go at compiling the table. As you come to each number in the frequency of
response above insert it into the appropriate place; then find the missing figures by
difference. There is sufficient information here to enable you to complete your table.
When complete, check with that below before calculating the expected values.

V.favourable Favourable Unfavourable V.unfavourable Total

Stackers
Sales staff
Administrators

Total

Row total  Column total


The expected values can next be calculated: Overall total
and inserted.

V.favourable Favourable Unfavourable V.unfavourable Total

Stackers 1 7 24 8 40
Sales staff 3 9 34 10 56
Administrators 16 24 22 2 64

Total 20 40 80 20 160
Hypothesis test
84

Null Hypothesis (H0): There is no association between job category and attitude towards
wage restraint.

Alternative Hypothesis (H1): There is an association between job category and attitude
towards wage restraint.

Level of Significance: 5% Level Of Significance

Critical value:
Number of degrees of freedom () = (r - 1)(c - 1) =
Level of significance = 5%
2 table, 5%, 6 degrees of freedom = 12.59

Test statistic
  O  E 2
E
(Complete the table.)

O E (O - E) (O - E)2/E
1 5 -4 3.200
7 10 -3 0.900
24 20 +4 0.800
8 5 +3 1.800
3 7 -4 2.286
9 14 -5 1.786
34
10
16
24
22
2
Total 32.969

Test static: 32.969

Conclusion: Test statistic > Critical value therefore reject H0


Conclude that there is an association between job category and attitude towards future
wage restraint. The administrators were for it but the others against it.

COMPLETED EXAMPLES FROM LECTURE HANDOUT


Example 2 In this example we first have to set up the contingency table from the
following information collected from a questionnaire:

In a recent survey within a Supermarket chain, a random sample of 160 employees:


stackers, sales staff and administrators, were asked to grade their attitude towards future
wage restraint on the scale:
85

Very favourable; favourable; unfavourable; very unfavourable.


Of the 40 stackers interviewed, 7 gave the response 'favourable', 24 the response
'unfavourable', and 8 the response 'very unfavourable'. There were 56 sales staff and from
these, 10 responded 'very unfavourable', 9 responded 'favourable' and 3 responded 'very
favourable’. The rest of the sample were administrators. Of these, 16 gave the response
'very favourable' and 2 the response 'very unfavourable'. In the whole survey, exactly half
the employees interviewed responded 'unfavourable'.

We first draw up a contingency table showing these results and then test whether attitude
towards future wage restraint is dependent on the type of employment.

Setting up the table: in this example there are three types of employee giving four
different responses, i. e. we have a 3 x 4 (or a 4 x 3) table.
Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells.

Have a go at compiling the table. As you come to each number in the frequency of
response above insert it into the appropriate cell; then find the missing figures by
difference. There is sufficient information here to enable you to complete your table.
When complete, check with that below before calculating the expected values.

V.favourable Favourable Unfavourable V.unfavourable Total

Stackers 1 7 24 8 40
Sales staff 3 9 34 10 56
Administrators 16 24 22 2 64

Total 20 40 80 20 160

Row total  Column total


The expected values can next be calculated: Overall total
and inserted.

V.favourable Favourable Unfavourable V.unfavourable Total

Stackers 1 (5) 7 (10) 24 (20) 8 (5) 40


Sales staff 3 (7) 9 (14) 34 (28) 10 (7) 56
Administrators 16 (8) 24 (16) 22 (32) 2 (8) 64

Total 20 40 80 20 160
Hypothesis test

Null Hypothesis (H0): There is no association between job category and attitude towards
wage restraint.

Alternative Hypothesis (H1): There is an association between job category and attitude
towards wage restraint.
86

Level of Significance: 5% Level of significance

Critical value:
Number of degrees of freedom () = (r - 1)(c - 1) = (3 - 1)(4 - 1) = 2 x 3 = 6
Level of significance = 5%
2 table, Table 5 in Appendix D, 5%, 6 degrees of freedom = 12.59

Test statistic
  O  E 2
E

O E (O - E) (O - E)2/E
1 5 -4 3.200
7 10 -3 0.900
24 20 +4 0.800
8 5 +3 1.800
3 7 -4 2.286
9 14 -5 1.786
34 28 +6 1.286
10 7 +3 1.286
16 8 +8 8.000
24 16 +8 4.000
22 32 -10 3.125
2 8 -6 4.500
Total 32.969

Test static: 32.969

Conclusion: Test statistic > Critical value therefore reject H0


Conclude that there is an association between job category and attitude towards future
wage restraint. The administrators were for it but the others against it.

Table 5 PERCENTAGE POINTS OF THE 2-D1STRIBUTION

 10% 5% 2 5% 1% 0.l%

1 2.706 3.841 5.024 6.635 10.83


2 4.605 5.991 7.378 9.210 13.82
3 6.252 7.816 9.351 11.35 16.27
87

4 7.780 9.488 11.14 13.28 18.47


5 9.236 11.07 12.83 15.08 20.51
6 10.64 12.59 14.45 16.81 22.46
7 12.02 14.07 16.02 18.49 24.36
8 13.36 15.51 17.53 20.09 26.13
9 14.68 16.92 19.02 21.67 27.89
10 15.99 18.31 20.48 23.21 29.59
11 17.28 19.68 21.92 24.72 31.26
12 18.55 21.03 23.34 26.22 32.91
13 19.81 22.36 24.74 27.69 34.51
14 21.06 23.68 26.12 29.14 36.12
15 22.31 25.00 27.49 30.58 37.70
16 23.54 26.30 28.85 32.00 39.25
17 24.77 27.59 30.19 33.41 40.79
18 25.99 28.87 31.53 34.81 42.31
19 27.20 30.14 32.85 36.19 43.82
20 28.41 31.41 34.17 37.57 45.32
21 29.62 32.67 35.48 38.93 46.80
22 30.81 33.92 36.78 40.29 48.27
23 32.01 35.17 38.08 41.64 49.73
24 33.20 36.42 39.36 42.98 51.18
25 34.38 37.65 40.65 44.31 52.62
26 35.56 38.89 41.92 45.64 54.05
27 36.74 40.11 43.19 46.96 55.48
28 37.92 41.34 44.46 48.28 56.89
29 39.09 42.56 45.72 49.59 58.30
30 40.26 43.77 46.98 50.89 59.70
40 51.81 55.76 59.34 63.69 73.42
50 63.17 67.50 71.42 76.15 86.66
60 74.40 79.08 83.30 88.38 99.61
70 85.53 90.53 95.02 100.4 112.3
80 96.58 101.9 106.6 112.3 124.8
90 107.6 113.1 118.1 124.1 137.2
100 118.5 124.3 129.6 135.8 149.5

INDEX NUMBERS
In business, managers may be concerned with the way in which the values of most
variables change over time: prices paid for raw materials; numbers of employees and
customers, annual income and profits, etc. Index numbers are one way of describing
such changes. (See Chapter 11 of Business Statistics)

Index numbers were originally developed by economists for monitoring and comparing
different groups of goods. It is necessary in business to be able to understand and
manipulate the different published index series, and to construct your own index series.

Index numbers measure the changing value of a variable over time in relation to its value
at some fixed point in time, the base period, when it is given the value of 100.
Such indexes are often used to show overall changes in a market, an industry or the
economy. For example, an accountant at a supermarket chain could construct an index of
88

the chain's own sales and compare it to the index of the volume of sales for the overall
supermarket industry. A graph of the two indexes will illustrate the company's
performance within the sector.

Company
Market

100
Time
Base period Parity

Index values are measured in percentage points since the base value is always 100.

Constructing an Index
value in period n
Index for any time period n = value in base period  100

Example 1 Year Value Calculation Index


(Base year is year 1) 1 12380 12380/12380 x100 100.00
2 12490 12490/12380 x 100 100.88
3 12730 12730/12380 x 100
4 13145
89

Finding values from indexes

If we have a series of index numbers describing index linked values and know the value
associated with any one of them, the other values can be found by scaling the known
value in the same ratio as the relevant index numbers.

Example 2 The following table shows the monthly price index for an item:

Month 1 2 3 4 5 6 7 8 9 10 11 12

Index 121 112 98 81 63 57 89 109 131 147 132 126

If the price in month 3 is £240, what is the price in each of the other months?

Month 1: The index numbers in months 3 and 1 are 98 and 121 respectively. 98 is
equivalent to a price of £240 so for month 1 this must be scaled in the same ratio as the
index numbers.

121
Price in month 3 is £240  Price in month 1 is £240   £296.3
98
112
Price in month 2: £240   £274.3
98
Price in month 4:

Price in month 5:

Completing the table:


Month 1 2 3 4 5 6 7 8 9 10 11 12

Index 121 112 98 81 63 57 89 109 131 147 132 126

Price (£) 296 274 240 140 218 267 320 360 323 309

Price indexes
The most commonly quoted index is the Retail Price Index, the RPI, which is a monthly
index indicating the amount spent by a 'typical household'. It gives an aggregate value
based upon the prices of a representative selection of hundreds of products and services
with weightings produced from the 'Family expenditure survey' as a measure of their
importance.
The RPI shows price changes clearly and has many important practical implications
particularly when considering inflation in terms of wage bargaining, index linked
benefits, insurance values, etc.
90

Other Indexes
 An index can be used to measure the way any variable changes over time:
unemployment numbers; number of car registrations; gross national product; etc.
 The choice of base period can be any convenient time - it is periodically adjusted.
 Indexes can be calculated with any convenient frequency: yearly, monthly, daily, etc.
Yearly: G.N.P. Gross National Product
Monthly: Unemployment figures
Daily: Stock market prices, etc.

Measuring changes over time

One of the most important uses of indexes is to show how the price of a product changes
over time due to changing costs of raw materials; variable supply and demand; changes
in production processes; general inflation; etc. It is often of interest to compare it to the
R.P.I. or to indexes of the prices of competitors' products.

Changes over time are measured as either a percentage point change by subtraction
change in index number
or as a percentage change as calculated by original index number  100 .

Example 3

Month Price(£) Price index Percentage Percentage


(Base period Jan.) Point change Change
Jan. 20 100
Feb. 22 22/20  100 = 110 10 (110-100)/100  100 = 10.0
Mar. 25 25/20  100 = 125 15 (125-110)/110  100 = 13.6
Apr. 26 26/20  100 = 130
May 27

Changing the base period


The base period can be chosen as any convenient time. It is usual to update the base
period fairly regularly due to:
 Any significant change which makes comparison with earlier figures meaningless.
 The numbers growing so large that the size of a points change is many times that of a
percentage change.
91

We need therefore to know how to change a base period. Published index numbers may
include a sudden change such as a series 470, 478, 485 becoming series 100, 103, 105,
etc. We need to convert all the numbers to a common base year so that they can be
analysed as a whole continuous series.

In order to change from one index series to another we need the value in the first series
for the same period as that in which the rebasing to 100 for the second takes place. The
ratio of these two values forms the basis of any conversion between them.

Example 4
The amount of money spent on advertising by the supermarket, which is index linked, is
described by the following indexes:

Year 1 2 3 4 5 6 7 8

Index 1 100 138 162 196 220

Index 2 100 125 140 165

Adverts (£) 4860

Typically each index series is completed and then, given the index linked amount spent
on advertising for any one year, it is used to estimate that for any of the other years.

a) Base year for each index: Index 1 = Index 2 =

b) Missing values for Index 1 and Index 2: (Both values known for year 5)
 From year 5: ratio of old : new = 220 : 100 (Note: All new values will be less)
 From year 5: ratio of new : old = 100 : 220 (Note: All old values will be more)

Index 1 for Year 6


220 220
Old index = new index   125   275
100 100

Index 2 for Year 1


100 100
New index = old index   100   45.45
220 220

Now complete both index series checking that each is adjusted in the right direction.
92

c) The values for amount spent on advertising can be calculated for all years if any one
year is known by multiplying that value by the ratio of the relevant index numbers
from either series.

Given that the advertising expenditure was known to be 4860 in year 3, that for:
100 45.45
Year 1 = 4860   3000 or 4860   3000
162 73.64

138 62.73
Year 2 = 4860   4140 or 4860   4140
162 73.64
Now complete the advertising expenditures in the table. There are alternative
ways of doing this depending on which set of index numbers are used.

Simple aggregate and Weighted aggregate indexes

Most published price indexes are produced from many items rather than single ones, as
used by us previously. Simple price aggregate indexes just add together the prices of all
the items included in the 'basket of goods', assuming each occurs just once.

Weighted aggregate indexes, such as the R.P.I., weight each price according to the
quantity bought in a particular time period which may be the base period (a Laspeyres
Index) or the current period (a Paasche Index). The weightings to be applied are usually
arrived at by discussion and mutual agreement, or from other information such as the
'Family expenditure survey'.

We shall work through a simple illustration of the principle of these aggregate indexes
but you must realise that the use of them in the 'real world' is far more complex than this
and needs to be studied further if to be used in practice in a business context.
93

Example 5
A company buys four products with the following characteristics:
Number of units bought Price paid per unit (£)
Items Year 1 Year 2 Year 1 Year 2

A 20 24 10 11
B 55 51 23 25
C 63 84 17 17
D 28 34 19 20

a) Find the simple price indexes for the products for year 2 using year 1 as the base year:
Product A: 11/10  100 = 110.0
Product B: 25/23  100 = 108.7 Note that all these figures come from the prices
Product C: 17/17  100 = 100.0 only and that no weighting has been applied.
Product D: 20/19  100 = 105.3

b) Find the simple aggregate index for year 2 using year 1 as the base year:

Sum of prices in Year 2 11  25  17  20 73


   100  1058
.
Sum of prices in base year 10  23  17  19 69

c) Find the base-weighted aggregate index, the Laspeyres index, for year 2 using year 1 as
the base year.
Sum of prices  weights in Year 2
Index for year 2 
Sum of prices  weights in base year
11  20  25  55  17  63  20  28 3226
   100  10515
.
10  20  23  55  17  63  19  28 3068

d) Find the current period-weighted aggregate index, the Paasche index, for year 2 using
year 1 as the base year.

Sum of prices  weights in Year 2


Index for year 2 
Sum of prices  weights in current year

11  24  25  51  17  84  20  34 3647
   100  104.59
10  24  23  51  17  84  19  34 3847

The type of aggregate index chosen is based on the relative importance of the weightings.
94

TIME SERIES ANALYSIS 1 - SEASONAL DECOMPOSITION

Last week we used time series in the form of index numbers to investigate changes over
time with the main interest being in past time periods and the main objective comparison
of different series.

This week we again study the past but with the main purpose of identifying patterns in
past events which can be used next week to forecast future values and events.

There are many standard methods used in forecasting and they all depend on finding the
best model to fit the past time series of data and then using the same model to forecast
future values. More methods are described in Chapter 12 of Business Statistics.

Generally, if a model can be found which reasonably fits past chronological data it can
be used to predict, under the same conditions, values for them in the near future.

Since the best method to use in any circumstance depends on the form taken by the past
data it is essential that these values are first plotted against time to help us to select the
most appropriate type of model, or models, to fit to this data.

Timeseries of past data  Suitable method  Best fitting model  Forecast

Below we see three typical patterns exhibited by time series with suggested suitable types
of model. (By no means an exhaustive list of either past data time series or suitably
fitting models.)

Time series of past data General description Some suitable methods


80

Linear, Seasonal decomposition,


Quarterly sales (£0,000)

70

60
seasonal additive or multiplicative,
50
Exponential smoothing.
40
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
91 91 91 91 92 92 92 92 93 93 93 93 94 94 94 94 95 95 95 95

Date

42

Linear (?), Linear regression (?),


40

38

36
non-seasonal
34
Non-linear regression (?),
Quarterly Sales

32

30

28
26
Exponential smoothing.
Q1 1994 Q3 1994 Q1 1995 Q3 1995 Q1 1996 Q3 1996
Q2 1994 Q4 1994 Q2 1995 Q4 1995 Q2 1996 Q4 1996

Date

200

180

160
Non-linear, Non-linear regression,
non-seasonal
140
Exponential smoothing.
Monthly Sales

120

100

80
JA AP JU O JA AP JU O JA AP JU O JA AP JU O
N R L CT N R L CT N R L CT N R L CT
19 1 19 1 19 1 19 1 19 1 19 1 19 1 19 1
93 99 93 99 94 99 94 99 95 99 95 99 96 99 96 99
3 3 4 4 5 5 6 6

Date
95

We have previously used regression models to produce equations describing a dependent


variable in terms of an independent one. If the independent variable is 'time' then these
same methods can be used to estimate the value of the dependent at any point in time - in
practice only the near future is valid. Linear regression models work best for non-
seasonal data which appear to be fairly linear when plotted against time. 'Curve fitting’ is
the equivalent non-linear method which is easy to carry out in a computer package such
as SPSS or Minitab. This is described in detail in Chapter 12 of ‘Business Statistics’.

Exponential smoothing is not covered in this lecture but is also described fully in
Chapter 12 of ‘Business Statistics’. It is a method which is lengthy to carry out by hand
but is straight forward with the help of a computer package such as SPSS, Minitab or
Excel.

We shall concentrate in this lecture on seasonal decomposition of data which exhibit a


seasonal pattern as well as a general trend.

Seasonal decomposition of time series


How do we cope with data which exhibits marked seasonality in its scatterplot? We carry
out a procedure called seasonal decomposition which identifies the general trend and
also the seasonal components exhibited by past data so that they can be used later in an
equivalent forecasting model.

Example
A small ice-cream manufacturer who supplies a supermarket chain reports sales figures
of £50 000 for the last quarter.

Is this amount good or bad? Can this single figure be used to forecast future sales?

With only one figure we never have sufficient data to make either a judgement or a
forecast.

We also need to compare the figure with:


 that for the previous quarter;
 that for the corresponding quarter of the previous year;
 that which would be expected from the current trend, if that can be identified.

We need to identify any past pattern and use it for future sales predictions.

We also need to estimate how accurate those predictions are likely to be.
96

In order to understand the situation more fully, the following data was collected. It
represents the quarterly sales figures for the four years preceding the figure of £50 000
quoted above.
Sales (£'000)
Year Quarter 1 Quarter 2 Quarter 3 Quarter 4
1997 40 60 80 35
1998 30 50 60 30
1999 35 60 80 40
2000 50 70 100 50

Method

1 First plot the data in chronological order on a graph against time. (Grid on back page)
What do you think? Does this data exhibit a 'four point pattern'? It does and therefore,
since it is quarterly data, the sales are judged to exhibit a quarterly seasonal effect. We
shall use an additive model to identify how much has been 'added' to the sales because it
is a particular quarter of the year.
Observed Trend Seasonal Random
Additive model = + +
Sales value value factor factor
A = T + S + R

The Trend value, T, is the value of the sales had the seasonal effect been 'averaged out'.
The Seasonal factor, S, is the average effect of it being a particular quarter, e.g. summer.
The Random factor, R, is the random variation due to neither of the previous variables.
This cannot be eliminated but is a useful measure of the goodness of fit of the model and
therefore of the quality of any forecasts made from it.

2 We can calculate the missing trend values from the table on the next page which includes:

 The observed sales figures.


 The cycle average which is the moving average of four consecutive quarters, i.e. one
from each season. As this value is not centred on any quarter its deviation from the
observed sales cannot be found.
 The trend which is the moving average of two consecutive cycle averages and is
centred on a quarter. This second calculation is only necessary for cycles with even
numbers of periods as an odd number would produce a cycle average which was
already centred.
 The first residual is the difference between the observed values and the trend. This will
be largely due to the seasonality of the data. From these first residuals we calculate
the seasonal factor which is the deviation from the trend due to it being a particular
season.
 The fitted values are those produced by the model. i. e. trend + seasonal factor.
 The second residual is the difference between the observed and fitted model values.
97

Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)

1997 Q1 40

Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 ____ ____ ____ ____
____
Q2 70 ____ ____ ____ ____
____
Q3 100

Q4 50

40  50  70  100 260
The first missing cycle average:   65.00
4 4
The second missing cycle average:

60.00  65.00
The first missing moving average trend figure:  62.50
2
The second missing moving average trend figure:

Next plot the centred Trend values (T) on your graph. Note that the first value is plotted
against Quarter 3.
98

3 Next calculate values for the last two First Residuals (Sales - Trend) for each separate
quarter. From the first residuals calculate the average Seasonal factor for each of the four
quarters: (Complete Quarters 3 and 4)

Seasonal Factor (S) = Average seasonal deviation from Trend

Quarter 1 Quarter 2 Quarter 3 Quarter 4


- - +27.50
-16.25 + 6.87 +16.87
-13.75 + 7.50 +24.37
-12.50 + 3.75
Total -42.50 +18.12
Mean -14.17 +6.04

Check that the average of these means is no very far from zero.

4 Calculate the last two Fitted values (F): F = T + S. Plot the fitted values on the same
graph. If the model is a 'perfect fit' the fitted values will be the same as the observed
Sales. We expect them to be close indicating that we have a 'good' model.

5 Calculate the last two Second residuals (R): R = A - F


The second residuals are the random amounts by which the model has 'mismatched' the
observed data in the past and give an indication of how good the forecasts are likely to
be. They should be small and randomly distributed about a value of zero.

6 Assess your model


The fitted and observed values should be similar as judged by their graphs.
The discrepancy between them, the second residuals, should be very small and randomly
distributed about zero. (We shall consider these again next week.)
Compare the means and standard deviations of the observed data, the first residuals and
the second residuals. (These will be discussed further next week.

Summary Statistics (check later with your calculator)

Sales Mean = 54.375 Standard deviation = 20.238


First residuals Mean = - 0.366 Standard deviation = 16.935
Second residuals Mean = - 0.002 Standard deviation = 2.770
99

Sales (£’000)
100

90

80

70

60

50

40

30

20
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1997 1998 1999 2000 2001
100

Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)

1997 Q1 40

Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 62.50 -12.50 48.33 +1.67
65.00
Q2 70 66.25 + 3.75 72.29 -2.29
67.50
Q3 100

Q4 50

Seasonal Factor (S.F.) = Average seasonal deviation from Trend

Quarter 1 Quarter 2 Quarter 3 Quarter 4


- - +27.50 -15.00
-16.25 + 6.87 +16.87 -15.00
-13.75 + 7.50 +24.37 -18.75
-12.50 + 3.75
Total -42.50 +18.12 +68.74 -48.75
Average -14.17 +6.04 +22.91 -16.25
TIME SERIES ANALYSIS 2 - FORECASTING
101

In this lecture we shall produce forecasts using a seasonal model. Non-seasonal forecasting
is covered by Chapter 13 of Business statistics.

Last week we produced an Additive model for the Ice-cream sales data.

Was this appropriate? If so:


1. The plot of the observed data should show a steady repeated pattern and be close to that
of the fitted values;
2. the plot of the trend should be fairly smooth and steady without too many fluctuations;
3. the first residuals, those between the trend and the observed values, should have a mean
which is much smaller than the observed values;
4. the second residuals, those between the fitted and observed values, should have a mean
of zero and a much reduced standard deviation from that of the observed values.

1 and 2 See the incomplete graph from last week repeated on the last page.
3 and 4 Summary Statistics (as calculated last week)

Sales Mean = 54.375 Standard deviation = 20.238


First residuals Mean = - 0.366 Standard deviation = 16.935
Second residuals Mean = - 0.002 Standard deviation = 2.770

This model therefore seems satisfactory so we shall use it to predict future Ice-cream sales.

Forecasting using an Additive model (from last week)

Observed Trend Seasonal Random


Additive model = + +
Sales value value factor factor
A = T + S + R
The Trend value is the value of the sales had the seasonal effect been 'averaged out'.
The Seasonal factor is the average effect of it being a particular quarter.
The Random factor is the random variation due to neither of the previous variables.

When using this model for forecasting the Random factor is left out of the equation for the
single figure forecast but re-introduced when the likely accuracy is estimated as a range.

The Trend value is estimated from the extended trend line over the next four quarters on
your graph. Use your own best judgement in extending this line. If you have any
background information concerning future sales, this should be taken into account.

The Seasonal factor is the average of the previous deviations from the trend line for that
particular season.
Forecast Extended Seasonal
= +
Additive model Sales value Trend value factor
F = T + S
Forecasts
102

Read off the values from your extended trend line, note them in the table below, and add to
them the appropriate seasonal factor to get your forecast.

Additive model Extended Seasonal Forecast Sales


+ =
Trend value factor value
(T) (S) (F)

2001 Quarter 1 + (- 14.17) =


2001 Quarter 2 + (+ 6.08) =
2001 Quarter 3 + (+22.92) =
2001 Quarter 4 + (-16.25) =

Add your forecasts to your graph to check that the past pattern follows into the future.

Maximum likely error in forecasts How good are these forecasts?


The second residuals are the measure of the discrepancy in the past between the model and
the observed data. Assuming that the conditions from the past are continuing into the future,
there seems no reason why they should not remain, on average, the same size. They are
therefore used to estimate the size of the forecasting errors.

If they are shown to be small and randomly distributed about zero, then it can be assumed
that the error in the forecasts is not likely to be more than twice the standard deviation of the
second residuals. (Think 95% confidence interval!)

The standard deviation of the second residuals = 2.770 (£'000) so the maximum likely error
is 2 x 2 700 = £5540.

Your forecasts should therefore be quoted as:


2001 Quarter 1  £5540 =
2001 Quarter 2  £5540 =
2001 Quarter 3  £5540 =
2001 Quarter 4  £5540 =

A final rounding to the nearest hundred seems reasonable.


2001 Quarter 1  £5500 =
2001 Quarter 2  £5500 =
2001 Quarter 3  £5500 =
2001 Quarter 4  £5500 =

Deseasonalising seasonal data


103

In order to compare figures from consecutive quarters the effect due to it being a particular
quarter needs to be eliminated. How can we compare the ice-cream sales in summer with
those in winter? We 'deseasonalise' the observed sales figures.

Observed Seasonal Deseasonalised


- =
Sales factor sales

2000 Quarter 1 50 - =
2000 Quarter 2 70 - =
2000 Quarter 3 100 - =
2000 Quarter 4 50 - =

So having removed the amount due to each quarter we can see that quarters 1 and 2 were
very similar but that quarter 3 showed an increase, in real terms, of £13 120 over quarter 2.
This was followed by a decrease in sales of £10 830 between quarters 3 and 4.

It now makes sense to calculate percentage changes, compare with published indexes, etc.
The deseasonalised values can be compared to the trend value at any point in time to see
whether, say, sales are progressing better or worse than the general trend at that time.
104

Minitab output for this data

Time series plot for Ice-cream Sales

100

90
Ice cream sales
80

70

60

50

40

30

0 5 10 15
Quarters

We can see that this is quarterly data which can be described well by an additive model.

Time series Decomposition


ROW QUARTER THE DATA TREND 1ST.RESD SEASONAL FITTED 2ND.RESD

1 1 40 * * * * *
2 2 60 * * * * *
3 3 80 52.500 27.500 22.9167 75.4167 4.58334
4 4 35 50.000 -15.000 -16.2500 33.7500 1.25000
5 5 30 46.250 -16.250 -14.1667 32.0833 -2.08333
6 6 50 43.125 6.875 6.0417 49.1667 0.83333
7 7 60 43.125 16.875 22.9167 66.0417 -6.04166
8 8 30 45.000 -15.000 -16.2500 28.7500 1.25000
9 9 35 48.750 -13.750 -14.1667 34.5833 0.41667
10 10 60 52.500 7.500 6.0417 58.5417 1.45833
11 11 80 55.625 24.375 22.9167 78.5417 1.45834
12 12 40 58.750 -18.750 -16.2500 42.5000 -2.50000
13 13 50 62.500 -12.500 -14.1667 48.3333 1.66667
14 14 70 66.250 3.750 6.0417 72.2917 -2.29166
15 15 100 * * * * *
16 16 50 * * * * *
105

These figures, with the addition of 'Seasonal', are the same as those calculated last week for
this data. Note the quarterly repetition in the 'Seasonal' column - this is the average
deviation for a quarter so is the 'Seasonal factor' for use in both forecasting and
deseasonalising..
106

Plotting the trend line with the original data shows its smoothed fit.

Time series plot for Ice cream Sales with Trend Fitted values added

100 100

90 90

80
80
70
70

Sales
Sales

60
60
50
50
40
40
30
30
Index 5 10 15
Index 5 10 15

The first diagram could, in theory, be used to estimate the future trend but it would be
difficult to extend the trend line and read values from it with any accuracy. A hand drawn
graph is therefore preferable.
Adding the Fitted values to the previous graph shows how well the model has fitted the
observed data in the past.
The second composite graph indicates a fairly close match but the residuals still need to be
analysed before the accuracy of any predictions can be estimated numerically.

Residual analysis
Remember that residuals are required to be small, with a mean of zero and a 'small' standard
deviation and be randomly distributed in time.

Plot of Second residuals against Time

5
2ND.RESD

-5

0 5 10 15
QUARTER

Summary statistics
 FOR THE DATA
MEAN = 54.375
ST.DEV. = 20.238
107

FOR THE FIRST RESIDUALS


MEAN = -0.36458
ST.DEV. = 16.955

FOR THE SECOND RESIDUALS


MEAN =9.536743E-07
ST.DEV. = 2.7696

Are they randomly distributed chronologically? No obvious seasonal pattern is evident.

These residuals seem to satisfy the required conditions and also give us the additional
information that our forecasts will be accurate to within 2 x 2.77 i.e. £5540

SPSS output for this data

Time series Decomposition

-> VARIABLE LABELS SALES "ice cream sales (£000)".


      
-> DATE YEAR 1997 QUARTER 1 4.
Dates can be described in many ways.
 -> * Seasonal Decomposition.
-> /MODEL=ADDITIVE Years and quarters used here.
-> /MA=CENTERED.
Results of SEASON procedure for variable SALES. Method: Additive with centred moving
Additive Model. Centered MA method. Period = 4. average.

Seasonally Smoothed
Moving Seasonal adjusted trend- Irregular
DATE_ SALES averages Ratios factors series cycle component
Q1 1997 40.000 . . -13.802 53.802 55.573 -1.771
Q2 1997 60.000 . . 6.406 53.594 54.705 -1.111
Q3 1997 80.000 52.500 27.500 23.281 56.719 52.969 3.750
Q4 1997 35.000 50.000 -15.000 -15.885 50.885 50.098 .787
Q1 1998 30.000 46.250 -16.250 -13.802 43.802 45.978 -2.176
Q2 1998 50.000 43.125 6.875 6.406 43.594 43.177 .417
Q3 1998 60.000 43.125 16.875 23.281 36.719 42.413 -5.694
Q4 1998 30.000 45.000 -15.000 -15.885 45.885 45.098 .787
Q1 1999 35.000 48.750 -13.750 -13.802 48.802 48.756 .046
Q2 1999 60.000 52.500 7.500 6.406 53.594 52.622 .972
Q3 1999 80.000 55.625 24.375 23.281 56.719 55.747 .972
Q4 1999 40.000 58.750 -18.750 -15.885 55.885 58.432 -2.546
Q1 2000 50.000 62.500 -12.500 -13.802 63.802 62.645 1.157
Q2 2000 70.000 66.250 3.750 6.406 63.594 65.955 -2.361
Q3 2000 100.000 . . 23.281 76.719 68.733 7.986
Q4 2000 50.000 . . -15.885 65.885 70.122 -4.236

The following new variables are being created:

Name Label

ERR_1 Error for SALES from SEASON, MOD_1 ADD CEN 4


SAS_1 Seas adj ser for SALES from SEASON, MOD_1 ADD CEN 4
SAF_1 Seas factors for SALES from SEASON, MOD_1 ADD CEN 4
STC_1 Trend-cycle for SALES from SEASON, MOD_1 ADD CEN 4
108

The moving averages are the trend figures calculated by hand previously and the ‘Ratios’, a
term more suited to a multiplicative model, are the same first residuals. The 'seasonal
factors' and the 'smoothed trend cycle' describe the average effect due to a particular quarter
and the sales level without that effect. These differ slightly from our calculated figures as
they have been further smoothed and also extended to cover the first and last two time
periods. We shall look at seasonally adjusted series later.

No 'fitted values' from the model have been produced by default but these can easily be
obtained as the sum of 'smoothed trend cycle' and 'seasonal factors' which have
automatically been produced and saved by SPSS as STC_1 and SAF_1. The errors and
deseasonalised sales have also been saved.
109

Time series plot for Ice cream Sales


120

Quarterly Icecream Sales (£'000)


100

80

60

40

20
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01

Date

Plotting the trend line with the original data shows its smoothed fit.

Trend added
120

100

80

60

Icecream Sales
40
(£'000)

20 Trend
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01

Date

This diagram could, in theory, be used to estimate the future trend but it would be difficult
to extend the trend line and read values from it with any accuracy. A hand drawn graph is
therefore preferable.

Adding the Fitted values to the previous graph shows how well the model has fitted the
observed data in the past.
110

Fitted values added to Sales and Trend


120

100

80

Quarterly Icecream S
60 ales (£'000)

TREND
40
Fitted values from t

20 he Additive model
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01

Date

This composite graph indicates a fairly close match but the residuals still need to be
analysed before the accuracy of any predictions can be estimated numerically.

Residual analysis
Remember that residuals are required to be small, with a mean of zero and a 'small' standard
deviation and be randomly distributed in time.

Descriptive Statistics

Std.
N Minimum Maximum Mean Deviation
Quarterly Icecream
16 30 100 54.38 20.24
Sales (£'000)
First residuals 12 -18.75 27.50 -.3658 16.9534
Second residuals 12 -6.04 4.59 1.667E-03 2.7702
Valid N (listwise) 12

Are they randomly distributed


chronologically? Plot of second residuals against time
6

No obvious seasonal pattern is evident. 4

0
These residuals seem to satisfy the
required conditions and also give us -2

the additional information that our -4


forecasts will be accurate to within
RESID2

-6
2 x 2.77 i.e. £5540
-8
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01

Date
111

Sales (£’000)

100

90

80

70

60

50

40

30

20
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1997 1998 1999 2000 2001
112

Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)

1997 Q1 40

Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 62.50 -12.50 48.33 +1.67
65.00
Q2 70 66.25 + 3.75 72.29 -2.29
67.50
Q3 100

Q4 50

Seasonal Factor (S) - Average seasonal deviation from Trend

Quarter 1 Quarter 2 Quarter 3 Quarter 4


- - +27.50 -15.00
-16.25 + 6.87 +16.87 -15.00
-13.75 + 7.50 +24.37 -18.75
-12.50 + 3.75

Total -42.50 +18.12 +68.74 -48.75


Average -14.17 +6.04 +22.91 -16.25
113

SELECTED REVISION QUESTIONS WITH NUMERICAL ANSWERS


(Much larger selection of questions and all the answers in Business Statistics, Chapter 16.
Numbers from that chapter.)

Probability, Contingency tables and Chi-squared tests

16.6 At the end of a semester, the students' grades for Statistics were tabulated in the following
3 x 2 contingency table to see if there is any association between class attendance and
grades received.
Grade received
No. of Days Absent Pass Fail

0 - 3 135 110
4 - 6 36 4
7 - 45 9 6

(a) Calculate the probability that a student selected at random:


(i) was absent for less than four days;
(ii) was absent for less than seven days;
(iii) passed;
(iv) passed given that he/she was absent for less than four days;
(v) passed or was absent for less than four days;
(vi) passed and was absent for less than four days.

(b) Calculate the expected value for the number of days absence.

(c) At 5% significance, do the data indicate that the proportions of students who pass
differ between the three absence categories?

(d) Calculate the 95% confidence intervals for the percentage of students who passed
and the percentage of them who failed. Interpret your findings.

(e) Briefly summarise your results so that they can be understood by a non-
statistician.
{(a) (i) 0.817; (ii) 0.950; (iii) 0.600; (iv) 0.550 (v) 0.967; (vi) 0.450
(b) 3.19 days
(c) 2 = 17.45; 20.05(2) = 5.99; Reject H0; Conclude that the proportions differ.
(d) 54.5% to 65.5%; 34.5% to 45.5%; No overlap, so percentages different.}
114

Normal Distribution

16.10 The weights of sugar in what are nominally '2 kilo' bags are actually normally distributed
with mean 2.1 kg and standard deviation 0.1 kg.
a) What percentage of bags will weigh less than 2.0 kg? (15.9%)
b) What percentage will weigh between 2.0 kg and 2.25 kg? (77.5%)
c) If only 4% of the bags were to contain under the stated minimum weight,
what should this value be? (1.93 kg)
d) If a random sample of 25 bags from another machine has a mean weight of 2.0 kg
and a standard deviation of 0.5 kg would its mean production be any different?
(95% C.I. 1.79 kg to 2.21 kg so not different.)

Confidence Intervals
16.17 The records of a large shoe shop show that over a given period 520 out of 1000 people
who entered the shop bought at least one pair of shoes.

a) Treating this as a random sample of all potential customers find a 95% confidence
interval for the actual percentage of the people entering the shop who will buy at
least one pair of shoes. (48.9% to 55.1%)

b) Does this support the firm’s claim that they actually sell shoes to half of their
potential customers? Why? (Yes, interval includes 50%)

16.19 The same supermarket then decided to investigate the spending habits of husbands and
wives. They were thinking of starting late 'family shopping' evenings and as an
experiment asked both partners to shop separately. The amounts spent by 14 husbands
and their wives were selected randomly from all the pairs with the following results:
(rounded to the nearest £1)

Family A B C D E F G H I J K L M N
Husband (£) 9 20 19 9 26 16 20 13 15 10 18 13 16 23
Wife (£) 23 30 13 23 12 17 29 35 18 26 32 23 26 14

a) Calculate the 95% confidence interval for the mean amount spent by the wives.
(£18.72 to £27.14)

b) Calculate the 99% confidence interval for the mean amount spent by the
husbands. (£12.05 to £20.37)

c) The store manager thinks that, on average, women spend £20 per head.
Does the confidence interval in (a) from the sample support this claim?
Explain your answer. (Yes, £20 in interval)

d) Is there any evidence that the mean amounts spent by wives and their
husbands are different?
(Calculate the 95% confidence interval for the differences in each family.)
(Yes, £0.72 to £12.70)
115

Hypothesis Testing

16.25 A new teaching method, N, is to be compared to a standard teaching method, S, which


was believed to have a mean score of 65. Children were ‘paired’, one of each pair was
taught be each method, and both were given the same reading test with the following
results:

Pair number 1 2 3 4 5 6 7 8 9 10
S method score 56 59 61 48 39 56 75 45 81 60
N method score 63 57 67 52 61 71 70 46 93 75

a) Is 65 a reasonable estimate for the mean score of all children taught by method S?

b) Calculate the 'difference in scores' for each pair of children and test to see if, at
the 5% level, new method N gives a better mean score than standard method S.

{(a) C.V. = 2.26, T.S. t = 1.73, 65 could be the mean mark for all the children.
(b) C.V. = 1.83, T.S. t = 2.80, Method N does give a higher mean score.)}

Analysis of Variance
16.28 Four salesmen in your company are competing for the title ‘Salesman of the year’. Each
has the task of selling ‘cold’ in three different types of location. Their resulting sales, in
£’000, were as follows:

Salesmen
Area A B C D

1 52.8 49.4 58.6 42.9


2 60.1 48.1 61.0 50.3
3 62.0 56.4 63.3 61.2

a) Is there any significant difference, at a 5% level, between the sales of the men if
the location is not taken into account?

b) Is there any significant difference, at a 5% level, between the sales of the men if
the location is taken into account?

c) Which salesman won? Did he do significantly better than his nearest rival?

{(a) Fsalesmen = 2.14, F0.05 (3,8) = 4.07, no significant difference.


(b) Fsalesmen = 5.88, F0.05 (3,6) = 4.76, significant difference when locations
taken into consideration. (c) C, No}
116

Correlation and Regression

16.31 The following data refers to the Sales and Advertising expenditure of a small local
bakery.
Year Advertising (£'000) Sales (£'000)
1990 3 40
1991 3 60
1992 5 100
1993 4 80
1994 6 90
1995 7 110
1996 7 120
1997 8 130
1998 6 100
1999 7 120
2000 10 150

a) Plot a scatter diagram of the data.


b) Calculate the correlation coefficient. (0.958)
c) Test it for significance at the 5% level. (C.V. = 0.602, significant)
d) If significant, find the equation of the regression line.
(Sales = 15.2 + 14.1 Advertising)
e) What practical use could be made of this equation?
f) What percentage of the variation in sales is explained by your model?
(92%)
Index Numbers
16.38
Year Index (a) Construct an index with Year 3 = 100
1 100
2 110 (b) Construct an index with Year 5 = 100
3 120
4 135 (c) If the index concerns the price of a loaf of bread, which
was 65p in year 3, then calculate the price of a similar
5 150
loaf in each of the other years

{ (a) 83.3 91.7 100.0 112.5 125.0 (b) 66.7 73.3 80.0 90.0 100.0
(c) 54p 60p 65p 73p 81p ( to the nearest 1p) }

16.39
Indexes
Year ‘Old’ ‘New’
1 100 (a) Scale to ‘New’ the ‘Old’ indexes for
years 1, 2, and 3.
2 120
3 160
4 190 100 (b) Scale to ‘Old’ the ‘New’ indexes for
years 5, 6, 7 and 8
5 130
6 140
7 150 (c) If a company’s profits were £ 3.5 million in
Year 3, what would they be for the other years?
8 165

{(a) 52.6 63.2 84.2 (b) 247.0 266.0 285.0 313.5


(c) 2.19 2.63 (3.50) 4.16 5.40 5.82 6.23 6.86 £ million}

16.40 The production of refrigerators can be summarised by the following table:

Year Year 1 Year 1 Year 3 Year 4 Year 5


Production (000s) 4500 4713 5151 5566 6000

a) Calculate index numbers for the above data taking Year 3 as the base year.

b) If the production of cookers followed the same index and 550 thousand were
made in year 1, how many thousand were made in each of the other years?
{ (a) 87.4 91.5 100.0 108.1 116.5 (b) (550) 576 630 680 733 }
16.41 The table below refers to the Annual Average Retail Price Index.

Year 1 2 3 4 5 6 7
RPI1 351.8 373.2 385.9 394.5
RPI2 100.0 106.9 115.2 126.1

a) Calculate the percentage change between consecutive years.

b) Calculate the percentage point change between consecutive years after Year 4.

{ (a) 6.1% 3.4% 2.2% 6.9% 7.8% 9.5% (b) 6.9 8.3 10.9 }

16.42 The following table shows the prices of three commodities in three consecutive years:

Prices
Year 1 Year 2 Year 3
Tea 8 12 16
Coffee 15 17 18
Chocolate 22 23 24

a) Calculate the simple aggregate index for Year 2 and Year 3 using Year 1 as the
base year.

b) Calculate the mean price relative index for Year 2 and Year 3 using Year 1 as the
base year.

{ (a) 115.6 128.9 (b) 122.7 143.0 }


Time series analysis and Forecasting

16.44 The following quarterly data represents the number of marriages, in thousands, for four
consecutive years: (Source Office of Population censuses.)

Year Q1 Q2 Q3 Q4
1 71.6 111.0 135.7 79.6
2 62.2 108.8 138.0 78.0
3 61.9 108.9 140.5 78.0
4 61.3 115.0 146.2 73.3

A summary of the analysis performed on this data, using both SPSS and Minitab, is
shown overleaf.

a) Plot the given data on graph paper as a time series, leaving room for your
forecasts for Year 5.
b) Calculate the missing values indicated by the three question marks (?) on the table
in the printout.
c) Plot the Smoothed Trend-cycle, Trend for Minitab, on your graph.
d) State the four seasonal factors and use with your graph to forecast the expected
number of marriages for each of the four quarters of Year 5, stating your
forecasts clearly and showing how they were calculated.
e) Estimate the likely error in your forecasts.
f) Demonstrate how the Seasonally adjusted series for SPSS, or Fitted values for
Minitab, of the number of marriages for each of the quarters of Year 4 have been
found.
g) With reference to your graph and the summary statistics given in the SPSS or
Minitab outputs, discuss the suitability of the additive model.
SPSS output for use with Question 16.44
      
-> * Seasonal Decomposition.
-> /VARIABLES=marriage
-> /MODEL=ADDITIVE
-> /MA=CENTERED.
Seasonally Smoothed
Moving Seasonal adjusted trend- Second
DATE_ MARRIAGE averages Ratios factors series cycle resid
Q1 1 71.600 . . -35.718 107.318 101.004 6.314
Q2 1 111.000 . . 13.207 97.793 99.973 -2.181
Q3 1 135.700 98.300 37.400 40.891 94.809 97.912 -3.103
Q4 1 79.600 96.850 -17.250 -18.380 97.980 96.976 1.005
Q1 2 62.200 96.863 -34.662 -35.718 97.918 96.980 .938
Q2 2 108.800 96.950 11.850 13.207 95.593 96.799 -1.206
Q3 2 138.000 96.713 41.287 40.891 97.109 96.757 .353
Q4 2 78.000 96.688 -18.688 -18.380 96.380 96.653 -.273
Q1 3 61.900 97.012 -35.112 -35.718 97.618 97.080 .538
Q2 3 108.900 97.325 11.575 13.207 95.693 97.144 -1.451
Q3 3 140.500 97.250 43.250 40.891 99.609 97.512 2.097
Q4 3 78.000 97.938 -19.938 -18.380 96.380 97.764 -1.384
Q1 4 61.300 ? ? -35.718 97.018 99.146 ?
Q2 4 115.000 99.538 15.463 13.207 101.793 99.788 2.005
Q3 4 146.200 . . 40.891 105.309 99.594 5.715
Q4 4 73.300 . . -18.380 91.680 99.497 -7.817

Summary Statistics
Variable Mean Std Dev Minimum Maximum N Label
MARRIAGE 98.12 30.79 61.3 146.2 16 Number of marriages (0,000)
FIRST -.04 30.83 -38.0 47.0 16 First residuals
SECOND -.04 3.35 -7.81690 6.31389 16 Second residuals

Minitab output (from an older version) for use with Question 16.44

ROW QUARTER THE DATA TREND 1ST.RESD SEASONAL FITTED 2ND.RESD

1 1 71.6 * * * * *
2 2 111.0 * * * * *
3 3 135.7 98.3000 37.4000 40.6458 138.946 -3.24583
4 4 79.6 96.8500 -17.2500 -18.6250 78.225 1.37500
5 5 62.2 96.8625 -34.6625 -35.9625 60.900 1.30000
6 6 108.8 96.9500 11.8500 12.9625 109.912 -1.11250
7 7 138.0 96.7125 41.2875 40.6458 137.358 0.64166
8 8 78.0 96.6875 -18.6875 -18.6250 78.063 -0.06250
9 9 61.9 97.0125 -35.1125 -35.9625 61.050 0.85000
10 10 108.9 97.3250 11.5750 12.9625 110.287 -1.38750
11 11 140.5 97.2500 43.2500 40.6458 137.896 2.60417
12 12 78.0 97.9375 -19.9375 -18.6250 79.313 -1.31250
13 13 61.3 ? ? -35.9625 63.450 ?
14 14 115.0 99.5375 15.4625 12.9625 112.500 2.50000
15 15 146.2 * * * * *
16 16 73.3 * * * * *

PLOT OF THE DATA AND THE TREND LINE

150+
-
-
-
-
120+
-
-
-
-
90+
-
-
-
-
60+
-
+---------+---------+---------+---------+---------+------
0.0 3.0 6.0 9.0 12.0 15.0

Summary Statistics
FOR THE DATA
MEAN = 98.125
ST.DEV. = 30.791

FOR THE FIRST RESIDUALS


MEAN = -0.24479
ST.DEV. = 30.775

FOR THE SECOND RESIDUALS


MEAN =3.178914E-07
ST.DEV. = 1.8536
Marriages ('000)

140

130

120

110

100

90

80

70

60
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1 2 3 4 5
123

You might also like