Professional Documents
Culture Documents
The Course
This is a one semester course for Business Studies students. It is intended to be user
friendly, with a minimum of jargon and a maximum of explanation. It is accompanied
by a simple text-book: 'Statistics for Business - A One Complete Semester Course',
which contains many more explanations, extra worked examples, practice exercises with
answers, examination examples, etc. Having completed the course you should be able to:
understand the use of most simple statistical techniques used in the world of
business;
understand published graphical presentation of data;
present statistical data to others in graphical form;
summarise and analyse statistical data and interpret the analysis for others;
identify relationships between pairs of variables;
make inferences about a population from a sample;
use some basic forecasting techniques;
use a statistical software package (optional at the discretion of the lecturer).
The emphasis on this course is not on the actual collection of numerical facts (data) but
on their classification, summarisation, display and analysis. These processes are carried
out in order to help us understand the data and make the best use of them.
3
Business statistics:
Any decision making process should be supported by some quantitative measures
produced by the analysis of collected data. Useful data may be on:
Your firm's products, costs, sales or services
Your competitors' products, costs, sales or services
Measurement of industrial processes
Your firm's workforce
Etc.
Once collected, this data needs to be summarised and displayed in a manner which helps
its communication to, and understanding by, others. Only when fully understood can it
profitably become part of the decision making process.
Categorical data
This is generally non-numerical data which is placed into exclusive categories and then
counted rather than measured. People are often categorised by sex or occupation. Cars
can be categorised by make or colour.
Nominal data
The scale of measurement for a variable is nominal if the data describing it are simple
labels or names which cannot be ordered. This is the lowest level of measurement. Even
if it is coded numerically, or takes the form of numbers, these numbers are still only
labels. For example, car registration numbers only serve to identify cars. Numbers on
athletes vests are only nominal and make no value statements.
4
All nominal data are placed in a limited number of exhaustive categories and any
analysis is carried out on the frequencies within these categories. No other arithmetic is
meaningful.
Ordinal Data
If the exhaustive categories into which the set of data is divided can be placed in a
meaningful order, without any measurement being taken on each case, then it is classed
as ordinal data. This is one level up from nominal. We know that the members of one
category are more, or less, than the members of another but we do not know by how
much. For example, the cars can be ordered as: 'small', 'medium', 'large' without the aid
of a tape measure. Degree classifications are only ordinal. Athletes results depend on
their order of finishing in a race, not by 'how much' separates their times.
Questionnaires are often used to collect opinions using the categories: 'Strongly agree',
'Agree', 'No opinion', 'Disagree' or 'Strongly disagree'. The responses may be coded as 1,
2, 3, 4 and 5 for the computer but the differences between these numbers are not claimed
to be equal.
There are very few examples of genuine interval scales. Temperature in degrees
Centigrade provides one example with the 'zero' on this scale being arbitrary. The
difference between 30°C and 50°C is the same as the difference between 40°C and 60°C
but we cannot claim that 60°C is twice as hot as 30°C. It is therefore interval data but
not ratio data. Dates are measured on this scale as again the zero is arbitrary and not
meaningful.
Ratio data must have a meaningful zero as its lowest possible value so, for example, the
time taken for athletes to complete a race would be measured on this scale. Suppose Bill
earns £20 000, Ben earns £15 000 and Bob earns £10 000. The intervals of £5000 between
Bill and Ben and also between Ben and Bob genuinely represent equal amounts of
money. Also the ratio of Bob's earnings to Bill's earnings are genuinely in the same ratio,
1 : 2, as the numbers which represent them since the value of £0 represents 'no money'.
This data set is therefore ratio as well as interval.
The distinction between these types is theoretical rather than practical as the same
numerical and graphical methods are appropriate for both. Money is always 'at least
interval' data.
We have therefore identified three measurement scales - nominal, ordinal and interval.
measuring nominal, ordinal and interval data. Data may be analysed using methods
appropriate to lower levels. For example: interval data may be treated as ordinal but
5
useful information is lost - if we know that Bill earns £20 000 and Ben earns £15 000 we
are throwing information away by only recording that Bill earns 'more than' Ben.
Data cannot be analysed using methods which are only appropriate for higher level data
as the results will be either invalid or meaningless. For example it makes no sense to code
the sex of students as 'male' = 1, 'female' = 2 and then report 'mean value is 1.7'. It is
however quite appropriate to report that '70% of the students are female'.
Analysing a sample instead of the whole population has many advantages such as the
obvious saving of both time and money. It is often the only possibility as the collecting of
data may sometimes destroy the article of interest, e.g. the quality control of rolls of film.
The ideal method of sampling is random sampling. By this method every member of the
population has an equal chance of being selected and every selection is independent of all
the others. This ideal is often not achieved for a variety of reasons and many other
methods are used.
DESCRIPTIVE STATISTICS
If the data available to us cover the whole population of interest, we may describe them
or analyse them in their own right, i.e. we are only interested in the specific group from
which the measurements are taken. The facts and figures usually referred to as 'statistics'
in the media are very often a numerical summary, sometimes accompanied by a graphical
display, of this type of data, e.g. unemployment figures. Much of the data generated by a
business will be descriptive in nature, as will be the majority of sporting statistics. In the
next few weeks you will learn how to display data graphically and summarise it
numerically.
INFERENTIAL STATISTICS
Alternatively, we may have available information from only a sample of the whole
population of interest. In this case the best we can do is to analyse it to produce the
sample statistics from which we can infer various values for the parent population. This
branch of statistics is usually referred to as inferential statistics. For example we use the
proportion of faulty items in a sample taken from a production line to estimate the
corresponding proportion of all the items.
A descriptive measure from the sample is usually referred to as a sample statistic and the
corresponding measure estimated for the population as a population parameter.
The problem with using samples is that each sample taken would produce a different
sample statistic giving us a different estimate for the population parameter. They cannot
all be correct so a margin of error is generally quoted with any sample estimations. This
is particularly important when forecasting future statistics.
In this course you will learn how to draw conclusions about populations from sample
statistics and estimate future values from past data.
(There is no tutorial sheet corresponding to this week's lecture but if your basic maths is
at all questionable it can be checked by working through the 'Basic Mathematics
Revision' sheet which covers all the basic arithmetic which is the only prior learning
required for this course.)
7
1 Evaluate
3 4 5 2 4 1 2 14 15 16 4
a) b) 1 2 3 c) d)
7 21 3 3 5 2 7 25 24 21 7
4 3 7 2 11 7
e) f) 4 1
9 16 12 5 12 8
2 a) Give 489 267 to 3 significant figures. b) Give 489 267 to 2 sig. figs.
c) Give 0.002615 to 2 significant figures. d) Give 0.002615 to 1 sig. figs.
e) Give 0.002615 to 5 decimal places. f) Give 0.002615 to 3 dec. places.
3 Retail outlets in a town were classified as small, medium and large and their
numbers were in the ratio 6 : 11 : 1. If there were 126 retail outlets altogether, how
many were there of each type?
4 Convert:
a) 28% to a fraction in its lowest terms. b) 28% to a decimal.
3 3
c) to a decimal. d) to a percentage.
8 8
e) 0.625 to a fraction in its lowest terms. f) 0.625 to a percentage.
10 Solve for x:
3 1
a) 3x - 1 = 4 - 2x b) 2(x - 3) = 3(1 - 2x) c)
x 1 2
8
Answers
19 29 1 1 1
1 a) 1 b) c) d) 1 e)
21 30 10 3 7
f) 9
7 5
4 a) b) 0.28 c) 0.375 d) 37.5% e) f) 62.5%
25 8
1 y6 3
7 a) x y 4 b) x y7 c) x d) x y
2 12
8 a) 0.2 b) +2 or - 2 c) -0.75 d) -1
9 a) -10 b) 100 c) 36 d) 16 e) 24 f) 45
1
10 a) x = 1 b) 1 c) 7
8
9
A frequency distribution is simply a grouping of the data together, generally in the form
of a frequency distribution table, giving a clearer picture than the individual values.
The most usual presentation is in the form of a histogram and/or a frequency polygon.
Method
Construct a Frequency Distribution Table, grouping the data into a reasonable number
of classes, (somewhere in the order of 10). Too few classes may hide some
information about the data, too many classes may disguise its overall shape.
Make all intervals the same width at this stage.
Construct the frequency distribution table. Tally charts may be used if preferred, but
any method of finding the frequency for each interval is quite acceptable.
Intervals are often left the same width but if the data is scarce at the extremes then
classes may be joined. If frequencies are very high in the middle of the data classes
may be split.
If the intervals are not all the same width, calculate the frequency densities, i.e.
frequency per constant interval. (Often the most commonly occurring interval is used
throughout.) If some intervals are wider than others care must be taken that the areas
of the blocks and not their heights are proportional to the frequencies.
Construct the histogram labelling each axis carefully. For a given frequency
distribution, you may have to decide on sensible limits if the first or last Class
Interval is specified as 'under ....' or 'over....', i.e. is open ended.
Hand drawn histograms usually show the frequency or frequency density vertically.
Computer output may be horizontal as this format is more convenient for line printers.
(Your histogram will be used again next week for estimating the modal mileage.)
A frequency polygon is constructed by joining the midpoints at the top of each column
of the histogram. The final section of the polygon often joins the mid point at the top of
each extreme rectangle to a point on the x-axis half a class interval beyond the rectangle.
This makes the area enclosed by the rectangle the same as that of the histogram.
10
Example: You are working for the Transport manager of a large chain of supermarkets
which hires cars for the use of its staff. Your boss is interested in the weekly distances
covered by these cars. Mileages recorded for a sample of hired vehicles from 'Fleet 1'
during a given week yielded the following data:
138 164 150 132 144 125 149 157 161 150
146 158 140 109 136 148 152 144 145 145
168 126 138 186 163 109 154 165 135 156
146 183 105 108 135 153 140 135 142 128
Minimum = 105 Maximum = 186 Range = 186 - 105 = 81
Nine intervals of 10 miles width seems reasonable, but extreme intervals may be wider.
Next calculate the frequencies within each interval. (Not frequency density yet.)
The data is scarce at both extremes so join the extreme two classes together. This doubles
the interval width so the frequency needs to be halved to produce the frequency per 10
miles, i.e. the frequency density. Calculate all the frequency densities. Next draw the
histogram and add the frequency polygon.
12
10
0
100 120 140 160 180 200
Mileages
11
Method
Construct a frequency table as before.
Use it to construct a cumulative frequency table noting that the end of the interval
is the relevant plotting value.
Calculate a column of cumulative percentages.
Plot the cumulative percentage against the end of the interval and join up the points
with straight lines. N.B. Frequencies are always plotted on the vertical axis.
Using the data in the section on Histograms, we work through the above method for
drawing the cumulative frequency diagram. Next week you will use it to estimate the
median mileage, the interquartile range and the semi-interquartile range of the data.
12
Cumulative % cumulative
Interval less than Frequency
frequency frequency
100 0 0 0.0
120 4 4 10.0
130 3 7 17.5
140 7 14 35.0
150 11 25 62.5
160 8 33 82.5
170 5 38 95.0
190 2 40 100.0
80
60
40
20
0
100 120 140 160 180 200
Mileages
We can estimate from this diagram that the percentage of vehicles travelling, say, less
than 125 miles is 14%.
Next week we shall use the same histogram and cumulative frequency diagram to:
Estimate the mode of the frequency data.
Estimate the Median value by dotting in from 50% on the Cumulative Percentage
axis as far as your ogive and then down to the values on the horizontal axis. The
value indicated on the horizontal axis is the estimated Median.
Estimate the Interquartile Range by dotting in from 25% and down to the horizontal
axis to give the Lower Quartile value. Repeat from 75% for the Upper Quartile
value.
The Interquartile Range is the range between these two Quartile values.
The Semi-interquartile Range is half of the Interquartile range.
13
The same technique can be employed to estimate a particular value from the
percentage of the data which lies above or below it.
Completed diagrams from lecture handout (Not for the student handouts)
Example You are working for the Transport manager of a large chain of supermarkets
which hires cars for the use of its staff. He is interested in the weekly distances covered
by these cars. Mileages recorded for a sample of hired vehicles from Fleet 1during a
given week yielded the following data:
138 164 150 132 144 125 149 157 161 150
146 158 140 109 136 148 152 144 145 145
168 126 138 186 163 109 154 165 135 156
146 183 105 108 135 153 140 135 142 128
Minimum = 105 Maximum = 186 Range = 186 - 105 = 81
Nine intervals of 10 miles width reasonable, but extreme intervals may be wider.
Next calculate the frequencies only within each interval.
12
10
0
14
Cumulative % cumulative
Interval less than Frequency
frequency frequency
100 0 0 0.0
120 4 4 10.0
130 3 7 17.5
140 7 14 35.0
150 11 25 62.5
160 8 33 82.5
170 5 38 95.0
190 2 40 100.0
80
60
40
20
0
100 120 140 160 180 200
Mileages
We can estimate from this diagram that the percentage of vehicles travelling, say, less
than 125 miles is 14%.
16
In this lecture we shall look at summarising data by numbers, rather then by graphs as last
week. We shall use the diagrams drawn in last week's lecture again this week for making
estimations and you will use the diagrams you drew in Tutorial 2 for estimating summary
statistics in Tutorial 3. You will first learn the meanings of the various summary statistics
and then learn how to use your calculator in standard deviation mode to save time and
effort in calculating mean and standard deviation. Further summary statistics are
discussed in Business Statistics, Chapter 3, Section 3.6.
As we saw last week, a large display of data is not easily understood but if we make some
summary statement such as 'the average examination mark was 65' and the 'marks ranged
from 40 to 75' we get a clearer understanding of the situation. A set of data is often
described, or summarised by two parameters: a measure of centrality, or location, and
a measure of spread, or dispersion.
MEASURES OF CENTRALITY
Mode: This is the most commonly occurring value.
The data may be nominal, ordinal, interval or ratio. (Section 1.4 of Business Statistics)
Median: The middle value when all the data are placed in order.
The data themselves must be ordinal or interval.
For an even number of values take the average of the middle two.
The mean may be thought of as the 'typical value' for a set of data and, as such, is the
logical value for use when representing it.
You still work for the Transport Manager, as last week, investigating the usage of cars.
Example 2 Grouped discrete data. Days idle for a whole staff of users of hired cars:
40 employees stated their fuel costs during the previous week to be:
Interquartile Range: The range of the middle half of the ordered data.
Population Standard Deviation: (xn) This is also known as 'Root Mean Square
Deviation' and is calculated by squaring and adding the deviations from the mean, finding
the average of the squared deviations, and then square-rooting the result, or by using
your calculator. As the name suggests we have the whole population of interest
available for analysis.
x x f x x
2 2
s or s for frequency data
n n
x represents the value of the data
f is the frequency of that particular value
x is the shorthand way of writing 'mean'
s is the shorthand way of writing 'standard deviation'
is the shorthand way of writing 'the sum of'.
means 'take the positive square root of' as the negative root has no meaning.
The Standard Deviation is a measure of how closely the data are grouped about the mean.
The larger the value the wider the spread. It is a measure of just how precise the mean is
as the value to represent the whole data set.
2 2 2 2
x x fx fx for frequency data.
s or s
n n n n
n-1
Sample standard deviation (x )
Usually we are interested in the whole population but only have a sample taken from it
available for analysis. It has been shown that the formula above gives a biased value
-always too small. If denominator used is (n - 1) instead of (n), the estimate is found to be
more accurate.
x x f x x
2 2
s or s for frequency data
n 1 n 1
As we shall see shortly on the calculator, the keys for the two standard deviations are
described as xn and xn-1 respectively
The examples below are the same as used previously for the measures of centrality
because these two parameters are generally calculated together.
19
Range: = 6 - 1 = 5 days
Interquartile Range: Lower quartile, one quarter of (9 + 1)= 2.5th = 1 day
Upper quartile, three quarters of (9 + 1) = 4 days
Interquartile range 4 - 1 = 3 days
(A much larger sample should really be used for this type of analysis.)
Standard Deviation:
(See your calculator booklet for the method as calculators vary considerably. See next
page if you have a Casio. If still uncertain, ask in tutorials)
For these nine numbers only = 1.633 days ( x n shift 2 for Casios)
As estimator for the whole population = 1.732 days ( x n 1 shift 3 for Casios)
Example 2. Grouped Discrete Data: The numbers of idle days for 95 cars are:
Interquartile Range: Usually calculated from the quartiles estimated (see below) from a
cumulative frequency diagram - an ogive.
Semi interquartile range: estimated from the interquartile range from the ogive.
Use of a Casio calculator in S.D. Mode for finding mean and standard deviation
Mode (Last week's Mileages of Cars example) We know that the modal mileage is in
the interval between 140 and 150 miles but need to estimate it more accurately.
Draw 'crossed diagonals' on the protruding part of the highest column. These two
diagonals cross at an improved estimate of the modal mileage. By this method we take
into consideration also the frequencies of the columns adjacent to the modal column.
12
10
0
100 110 120 130 140 150 160 170 180 190
Mileages
Estimated Mode:
22
Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative
Percentage axis to your ogive and then down to the values on the horizontal axis. The
value indicated is the estimated Median for the Mileages.
Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to
your ogive and then down to the values on the horizontal axis. The values indicated are
the estimated Upper quartile and Lower quartile respectively of the Mileages.
The difference between these measures is the Interquartile range. Half the Inter-quartile
range is the Semi-interquartile range.
80
60
40
20
0
100 110 120 130 140 150 160 170 180 190 200
Mileages
Estimated Median:
Estimated Quartiles:
Interquartile range:
Semi-Interquartile range:
Minitab output
Example 3
Descriptive Statistics
Variable N Mean Median Tr Mean StDev SE Mean
Costs 34 62.974 63.000 62.970 1.955 0.335
SPSS output
Example 3
Descriptives
Mode (Last week's Mileages of Cars example) We know that the modal mileage is in
the interval between 140 and 150 miles but need to estimate it more accurately.
Draw 'crossed diagonals' on the protruding part of the highest column. These two
diagonals cross at an improved estimate of the modal mileage.
12
10
0
100 110 120 130 140 150 160 170 180 190
Mileages
Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative
Percentage axis to your ogive and then down to the values on the horizontal axis. The
value indicated is the estimated Median for the Mileages.
Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to
your ogive and then down to the values on the horizontal axis. The values indicated are
the estimated Upper quartile and Lower quartile respectively of the Mileages.
The difference between them measures is the Interquartile range. Half the Inter-quartile
range is the Semi-interquartile range.
80
60
40
20
0
100 110 120 130 140 150 160 170 180 190 200
Mileages
% of users who travelled more than, for e.g. > 170 miles = 5%
26
PROBABILITY
In this lecture we will consider the basic concept of probability which is of such
fundamental importance in statistics. We will calculate probabilities from symmetry and
relative frequency for both single and combined events and use expected values in order
to make logical decisions. We will only use contingency tables for combined probabilities
in this lecture so make sure you read Chapter 4 of Business Statistics for descriptions of
tree diagrams. Make sure you complete your tutorial, if not finished in class time.
The concept of probability was initially associated with gambling. Its value was later
realised in field of business, initially in calculations for shipping insurance.
There are two main ways of assessing the probability of a single outcome occurring:
from the symmetry of a situation to give an intuitive expectation,
from past experience using relative frequencies to calculate present or future
probability.
Value of Probability
Probability always has a value between 0, impossibility, and 1, certainty.
Invoking symmetry there is no need to actually do any experiments. This is the basis for
the theory of probability which was developed for use in the gaming situation. – 'a priori'
probability.
Examples
1. Tossing 1 coin:
Possibilities: Head, Tail. P(a head) = 1/2 = 0.5 P(a tail) = 1/2 = 0.5
2. Tossing 2 coins:
Possibilities: Head Head, Head Tail, Tail Head, Tail Tail.
P(2 heads) = P(2 tails) = P(1 head and 1 tail) =
Note that the sum of all the possible probabilities is always one.
This makes sense as one of them must occur and P(certainty) = 1.
The relative frequency of an event is simply the proportion of the possible times it has
occurred.
As the number of trials increases the relative frequency demonstrates a long term
tendency to settle down to a constant value. This constant value is the probability of the
event.
In this experiment the ‘long term’ would be the time necessary for the relative frequency
to settle down to two decimal places.
Proportion of heads
1.1
1.0
.9
Value Proportion of Heads
.8
.7
.6
.5
.4
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
THROW
Contingency tables
Contingency tables display all possible combined outcomes and the frequency with
which each of them has happened in the past. These frequencies are used to calculate
future probabilities.
Expenditure on drink
Mode of travel None 1p and under £20 At least £20 Total
On foot 40 20 10
By bus 30 35 15
By car 25 33 42
Total
Suppose we put all the till invoices into a drum and thoroughly mix them up. If we close
our eyes and take out one invoice, we have selected one customer at random.
These last two values are probably more easily calculated by adding the row and column
totals and then subtracting the intersecting cell which has been included twice:
P(customer spends over £20 or travels by car) = (70 + 95 - 40)/250 = 0.50
P(customer arrives on foot or spends no money on drink) =
Sometimes we need to select more than one row or column:
P(a customer will spend less than £20) = (95 + 88)/250 = 0.732
30
In the examples above all the customers have been under consideration without any
condition being applied which might exclude any of them from the overall ratio. Often
not all are included as some condition applies.
CONDITIONAL PROBABILITY
If the probability of the outcome of a second event depends upon the outcome of a
previous event then the second is conditional on the result of the first. This does not
imply a time sequence but simply that we are asked to find the probability of an event
given additional information - an extra condition, i.e. if we know that the customer
travelled by car all the other customers who did not are excluded from the calculations.
If we need P(a customer spends at least £20, if it is known that he/she travelled by car)
We eliminate from the choice all those who did not arrive in a car, i.e. we are only
interested in the third row of the table. A short hand method of writing 'if it is know
that . .' or 'given that . . .' is '|'.
P(a customer spends at least £20, | he/she travelled by car) = 42/100 = 0.420
P(a customer came by car | he spent at least £20 on drink) = 42/67 = 0.627
Note that
P(spending £20, | he/she travelled by car) P(coming by car | he spent £20 on drink)
P(a customer spends less than £20, | he/she did not travel by car) =
Many variations on the theme of probability have been covered in this lecture but they all
boil down to answering the question 'how many out of how many?'
31
EXPECTED VALUES
A commonly used method in decision making problems is the consideration of expected
values. The expected value for each decision is calculated and the option with the
maximum or minimum value (dependent upon the situation) is selected.
Clearly this is only a theoretical value as in reality you would win either £25, £10 or £0.
Example 6: Two independent operations A and B are started simultaneously. The times
for the operations are uncertain, with the probabilities given below:
Operation A Operation B
Duration (days) (x) Probability (p) Duration (days) (x) Probability (p)
1 0.0 1 0.1
2 0.5 2 0.2
3 0.3 3 0.5
4 0.2 4 0.2
Operation A
E(x) = px = (1 x 0.0) + (2 x 0.5) + (3 x 0.3) + (4 x 0.2) = 2.7 days
Operation B
E(x) = px = (1 x 0.1) + (2 x 0.2) + (3 x 0.5) + (4 x 0.2) = 2.8 days
The normal probability distribution is one of many which can be used to describe a set
of data. The choice of an appropriate distribution depends upon the 'shape' of the data and
whether it is discrete or continuous.
You should by now be able to recognise whether the data is discrete or continuous, and a
brief look at a histogram should give you some idea about its shape. In Chapter 5 of
Business Statistics the four commonly occurring distributions below are studied. These
cover most of the different types of data.
This lecture will concentrate on just the normal distribution, but you should be aware of
the others.
Observed Data
The Normal distribution is found to be a suitable model for many naturally occurring
variables which tend to be symmetrically distributed about a central modal value - the
mean.
There is no single normal curve, but a family of curves, each one defined by its mean, µ,
and standard deviation, ; µ and are called the parameters of the distribution.
33
As we can see the curves may have different centres and/or different spreads but they all
have certain characteristics in common:
The curve is bell-shaped,
it is symmetrical about the mean (µ),
the mean, mode and median coincide.
Note that the curve neither finishes nor meets the horizontal axis at 3, it only
approaches it and actually goes on indefinitely.
Even though all normal distributions have much in common, they will have different
numbers on the horizontal axis depending on the values of the mean and standard
deviation and the units of measurement involved. We cannot therefore make use of the
standard normal tables at this stage.
The formula for calculating the exact number of standard deviations (z) away from the
x
mean () is: z
The process of calculating z is known as standardising, producing the standardised
value which is usually denoted by the letter z, though some tables may differ.
Knowing the value of z enables us to find, from the Normal tables, the area under the
curve between the given value (x) and the mean () therefore providing the probability of
a value being found between these two values. This probability is denoted by the letter Q,
which stands for quantile. (This is easier to understand with numbers!)
For standardised Normal distributions the Normal tables are used. (See back page)
= £50, = £15
x = £80 = £50,
= £15,
x = £80.
5 20 35 50 65 80 95
Q z
5 20 35 50 65 80 95
Q1 z1Q2 z2
x 80 50 30
First standardise: z 2.00 z2
15 15
From tables: z 2.00 Q 0.4722 ( z n the margin, Q in the body of the tables)
b) The probability that a shopper selected at random spends more than £50 per week
No need to do any calculations for this question. The mean is £50 and, because the
distribution is normal, so is the median. Half the shoppers, 250, are therefore expected
to spend more than £50 per week.
c) The percentage of shoppers who are expected to spend between £30 and £80 per
week
We first need P(£30 < x < £80), then we convert the result to a percentage.
= £50, = £15,
x1 = £30, x2 = £80
5 20 35 50 65 80 95 £
z1 Q1 Q2 z2
36
Our normal tables provide the area between a particular value and the mean so the area
between £30 and £80 needs splitting by the mean so that the partial areas, £30 to £50 and
£50 to £80 can be calculated separately and then the two parts recombined.
x 30 50 20
P(£30 < x < £50) z1 1.333
15 15
(The table values are all positive, so when z is negative we invoke the symmetry of the
situation and use its absolute value in the table.)
We first need P(£55 < x < £70), then we convert the result to a percentage.
e) The expected number of shoppers who will spend less than £70 per week
= £50, = £15,
x = £70.
The area we need is that between the mean and £70 plus the 0.5 which falls below the
mean.
First standardise: z
From tables: Q =
f) The expected number of shoppers who will spend between £37.50 and £57.50 per
week
We first need P(£37.50 < x < £57.50). This is then multiplies by the total frequency.
= £50, = £15,
x1 = £37.50, x2 = £57.50
g) The value below which 70% of the shoppers are expected to spend
= £50, = £15,
x = £?
5 20 35 50 65 80 95
0.5 Q = 0.2 x = ?
If your algebra is a bit 'iffy' try using the fact that this value of z, 0.524, tells us that x is
0.524 standard deviations above the mean.
h) The value below which 45% of the shoppers are expected to spend
= £50, = £15,
x = £?
5 20 35 50 65 80 95 £
0.45 0.5 x = ? Q = 0.05
From tables if Q = z =
We know, however, that x is below the mean so z is negative, -0.126.
5 20 35 50 65 80 95 £
= £50, = £15,
x = £?
10% above x Q =
From tables if Q = z =
Alternatively: x = + =
The remainder of the questions can be completed similarly in your own time.
41
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
42
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
Last week you became familiar with the normal distribution. We now estimate the
parameters of a normally distributed population by analysing a sample taken from it. In
this lecture we will be concentrating on the estimation of percentages and means of
populations but do note that any population parameter can be estimated from a sample.
Sampling
Sampling theory takes a whole lecture on its own! Since any result produced from the
sample can be used to estimate the corresponding result for the population it is absolutely
essential that the sample taken is as representative as possible of that population.
Common sense rightly suggests that the larger the sample the more representative it is
likely to be but also the more expensive it is to take and analyse. A random sample is
ideal for statistical analysis but, for various reasons, other methods also have been
devised for when this ideal is not feasible. We will not study sampling in this lecture but
just give a list of the main methods below. More details can be found in Section 6.3 of
Business Statistics.
b) The best estimate of the unknown population mean, , is the sample mean,
x
x . This estimate of is often written ̂ and referred to as 'mu hat'.
n
c) The best estimate of the unknown population standard deviation, , is the sample
standard deviation s, where:
32.53 22.27 33.38 41.47 38.05 25.27 26.78 30.97 38.07 38.06
31.47 38.00 43.16 29.05 22.20 25.11 24.11 43.48 32.93 42.04
4
1) p 100 20% ˆ 20%
20
658.44
2) x £32.92 ˆ £32.92
20
The limits are called the confidence limits and the interval between them the confidence
interval.
The confidence interval is therefore an interval centred on the point estimate, in this case
either a percentage or a mean, within which we expect the population parameter to lie.
The width of the interval is dependent on the confidence we need to have that it does in
fact include the population parameter, the size of the sample, n, and its standard
deviation, s, if estimating means. These last two parameters are used to calculate the
standard error, s/n, which is also referred to as the standard deviation of the mean.
The number of standard errors included in the interval is found from statistical tables -
either the normal or the t-table. Always use the normal tables for percentages which need
large samples. For means the choice of table depends on the sample size and the
population standard deviation:
p100 p p 1 p
The formulae and represent the standard errors of a
n n
percentage and a proportion respectively.
The samples must be large, ( >30), so that the normal table may be used in the formula.
We therefore estimate the confidence limits as being at z standard errors either side of the
sample percentage or proportion. The value of z, from the normal table, depends upon the
degree of confidence, e.g. 95%, required. We are prepared to be incorrect in our estimate
5% of the time and confidence intervals are always symmetrical so, in the tables we look
for Q to be 5%, two tails.
Example 2
In order to investigate shopping preferences at a supermarket a random sample of 175
shoppers were asked whether they preferred the bread baked in-store or that from the
large national bakeries. 112 of those questioned stated that they preferred the bread baked
in-store. Find the 95% confidence interval for the percentage of all the store's customers
who are likely to prefer in-store baked bread.
112
The point estimate for the population percentage,, is p = 100 = 64%
175
p100 p
Use the formula: p z where p = 64 and n = 175
n
From the short normal table 95% confidence 5%, two tails z = 1.96
p100 p 64 36
p z 64 1.96
n 175
<<
47
Confidence interval for the Population Mean, , when the population standard
deviation, , is known.
Example 3: For the small supermarket as a whole it is known that the standard deviation
of the wages for part-time employees is £1.50.
A random sample of 10 employees from the small supermarket gave a mean wage of
£4.15 per hour. Assuming the same standard deviation, calculate the 95% confidence
interval for the average hourly wage for employees of the small branch and use it to see
whether the figure could be the same as for the whole chain of supermarkets which has a
mean value of £4.50.
Confidence Interval: x z where z comes from the short normal table
n
1.50
xz 4.15 1.96
n 10
This interval includes the mean, £4.50, for the whole chain so the average hourly wage
could be the same for all employees of the small supermarket.
Confidence interval for the Population Mean, , when the population standard
deviation is not known, so needs estimating from the sample standard deviation, s.
The population standard deviation is unknown, so the t-table, must be used to compensate
for the probable error in estimating its value from the sample standard deviation.
Example 4: Find the 99% confidence interval for the mean value of all the invoices, in
Example 1, sent out by the small supermarket branch. If the average invoice value for the
whole chain is £38.50, is the small supermarket in line with the rest of the branches?
s
Confidence Interval: x t where the value of t comes from the table of
n
'percentage points of the t-distribution' using n - 1 degrees of freedom ( = n - 1)
s
xt =
n
This interval does not include £38.50, so the small branch is out of line with the rest.
Comparison of Means using 'Overlap' in Confidence Intervals
48
We are going to extend the previous method to see if two populations could have the
same mean or, alternatively, if two samples could have come from the same population as
judged by their means. We assume that the standard deviations of both populations are
the same.
If, for example, a supermarket chain wished to see if a local advertising campaign was
successful or not they could take a sample of customer invoices before the campaign and
another after the campaign and calculate confidence intervals for the mean spending of all
customers at both times. If the intervals were found to overlap the means could be the
same so the campaign might have had no effect. If, on the other hand, they were quite
separate with the later sample giving the higher interval then the campaign must have
been effective in increasing sales.
Example 5
The till slips of supermarket customers were sampled both before and after an advertising
campaign and the results were analysed with the following results:
Has the advertising campaign been successful in increasing the mean spending of all
the supermarket customers? Calculate two 95% confidence intervals and compare the
results.
sB 6.70
Before: B xB t 37.60 2.06 37.60 2.76
nB 25
£34.84 < < £40.36
sA
After: A x A t
nA
Interpretation: The sample mean had risen considerably but, because the confidence
intervals overlap, the mean values for all the sales may lie in the common ground. There
may be no difference between the two means so the advertising campaign has
not been proved to be successful.
.
49
A B C D E F G H I J
Before 42.30 55.76 32.29 10.23 15.79 46.50 32.30 78.65 32.20 15.90
After 43.09 59.20 31.76 20.78 19.50 50.67 37.32 77.80 37.39 17.24
We first need to calculate the differences. The direction doesn't matter but it seems
sensible to take the earlier amounts away from the later ones to find the changes:
A B C D E F G H I J
Diffs. 0.79 3.44 -0.53 10.55 3.71 4.17 5.02 -0.85 5.19 1.34
We can now forget the original two data sets and just work with the differences.
sd 3.37
95% C.I. d x d t 3.28 2.26 3.28 2.41
nd 10
£0.87 < d < £5.69
This interval does not include zero so the possibility of 'no change' has been eliminated.
Because the mean amount spent after the campaign is greater than that before it, there has
been a significant increase in spending.
In general: If the confidence interval includes zero, i.e. it changes from negative to
positive, then there is a possibility that no change has taken place and the original
situation has remained unchanged. If both limits have the same sign, as above, then zero
is excluded and some change must have taken place.
When interpreting this change look carefully at the direction of the change as this will
depend on your order of subtraction. In the above case it is obvious that x d is positive so
there has been an increase in the average spending.
50
Example 1 (Minitab)
Variable N Mean StDev SE Mean 95.0 % CI
Invoices 20 32.92 7.12 1.59 ( 29.59, 36.25)
Descriptives (SPSS)
Example 6 (Minitab)
Variable N Mean StDev SE Mean 95.0 % CI
Diffs. 10 3.28 3.37 1.06 ( 0.87, 5.69)
Descriptives (SPSS)
HYPOTHESIS TESTING
Last week we estimated an unknown population parameter from the corresponding
statistic obtained from the analysis of a sample from the population. This week we shall
similarly analyse a sample and then use its statistic to see whether some claim made
about the population is reasonable or not. We shall test proportions and means.
When an estimate from a sample is used to test some belief, claim or hypothesis about
the population the process is known as hypothesis testing (Significance Testing).
Some conclusion about the population is drawn from evidence provided by the sample.
We shall test for proportions and means. A formal procedure is used for this technique.
1 State the null hypothesis (H0) - This is a statement about the population which may, or
may not, be true. It takes the form of an equation making a claim about the population.
H0: µ = £15 000 where µ is the population mean.
2 State the alternative hypothesis (H1) - the conclusion reached if the Null Hypothesis is
rejected. It will include the terms: not equal to ( ¹ ), greater than ( > ), or less than ( < ).
H1: µ ¹ £15 000, or H1: µ < £15 000, or H1: µ > £15 000
3 State the significance level () of the test - the proportion of the time you are willing to
reject H0 when it is in fact true. If not stated specifically a 5% (0.05) significance level is
used. From computer output the p-value is the probability of H0 being correct.
4 Find the critical value - this is the value from the appropriate table which the test statistic
is required to reach before we can reject H0. It depends on the significance level, , and
whether the test is one- or two-tailed.
5 Calculate the test statistic - This value is calculated from the data obtained from the
sample. From the data we find the sample statistics, such as the mean and standard
deviation, which we need for calculating the test statistic; these are then substituted into
the appropriate formula to find its value. (Each formula is used in one specific type of
hypothesis test only and so needs careful selection.)
6 Reach a conclusion - Compare the values of the Test Statistic and the Critical Value. If
the Test Statistic is greater than the Critical Value then we have enough evidence from
the sample data to cause us to reject the null hypothesis and conclude that the Alternative
Hypothesis is true, e.g. the mean salary is not £15 000. If the Test Statistic is not greater
than the Critical Value then we do not have sufficient evidence and so do not reject the
null hypothesis, e.g. the mean salary for the whole firm could be £15 000.
53
(NB We can prove that the Null Hypothesis is false but we can never prove it to be true
since we do not have all the evidence possible as only a sample is available for analysis.)
Graphical interpretation of conclusion
If the test statistic falls into the critical region there is sufficient evidence to cause the null
hypothesis to be rejected. The total shaded area is the significance level, (alpha), as
either a percentage (5%) or a decimal (0.05). The methodology will seem simpler with
specific numerical examples!
Test Statistic: p where p and are the sample and population
Calculated by 1 proportions respectively, and n the sample size.
n
p
For a percentage use 100
n
Conclusion: H0 is rejected if the test statistic is so large that the sample could not have
come from a population with the hypothesised proportion.
Example 1
A company manufacturing a certain brand of breakfast cereal claims that 60% of all
housewives prefer its brand to any other. A random sample of 300 housewives include
165 who do prefer the brand. Is the true percentage as the company claims or lower at the
5% significance level?
165
Given Information: 60% ?, p 100 55%, n 300
300
H0: = 60% H1: 60%
Critical Value: Normal tables, 1 tail, 5% significance, 1.645
p 55 60 5
177
.
Test Statistic: 100 60 40 8
n 300
Conclusion: Test statistic > critical value so reject H0. The % using the brand is < 60%
54
Example 2
An auditor claims that 10% of invoices for a certain company are incorrect. To test this
claim a random sample of 200 invoices are checked and 24 are found to be incorrect. Test
at the 1% significant level to see if the auditor's claim is supported by the sample
evidence.
H0: p H1 : p
Test Statistic
p
100
n
As with confidence intervals, before deciding which test to use we have to ask whether
the sample size is large or small and whether the population standard deviation is known
or has to be estimated from the sample. The methods used are basically the same but
different formulae and tables need to be used.
Method
A mean value is hypothesised for the population. A sample is taken from that population
and its mean value calculated. This sample mean is then used to see if the value
hypothesised for the population is reasonable or not. If the population standard deviation
is not known then that from the sample is calculated and used to estimate it. The
appropriate test to use, either a t-test or a z-test, depends on whether this estimated value
has been used or not.
If we know the population standard deviation and do not need to estimate it the normal
table is used. If we need to estimate it then a slightly larger critical value is found from
the t-table.
x x
Test statistic z or t
/ n s/ n
where x and m are the means of the sample and population respectively and s and s are
their standard deviations.
Conclusion Compare the test statistic with the critical value. Decide whether to reject Ho
or not.
Conclude in terms of question.
Example 3
A packaging device is set to fill detergent packets with a mean weight of 150g. The
standard deviation is known to be 5.0g. It is important to check the machine periodically
because if it is overfilling it increases the cost of the materials, whereas if it is
underfilling the firm is liable to prosecution. A random sample of 25 filled boxes is
weighed and shows a mean net weight of 152.5g. Can we conclude that the machine is no
longer producing the 150g. quantities? Use a 5% significance level. (0.05 sig. level)
Given µ = 150 g? s = 5 g, n = 25, x = 152.5 g
s is known so a z-test is appropriate. The machine can 'no longer produce the 150 g
quantities' with quantities which are either too heavy or too light therefore the appropriate
test is two tailed.
H0: µ = 150 g H1: µ ¹ 150 g
Example 4
The mean and standard deviation of the weights produced by this same packaging device
set to fill detergent packets with a mean weight of 150g, are known to drift upwards over
time due to the normal wearing of some bearings. Obviously it cannot be allowed to drift
too far so a large random sample of 100 boxes is taken and the contents weighed. This
sample has a mean weight of 151.0 g and a standard deviation of 6.5g. Can we conclude
that the mean weight produced by the machine has increased? Use a 5% significance
level. (0.05 sig. level)
We are only interested in whether the mean weight has increased or not so a ? tailed
test is appropriate.
H0: µ H1: µ
Critical value: s is unknown but the sample is large therefore use the normal tables,
5%, one tailed,
x
Test statistic: z =
s/ n
Test this hypothesis against the alternative that the mean mark is less than 100, at the 1%
significance level.
H0: µ H1: µ
x
Test statistic: t
s/ n
Method
Example 6 A training manager wishes to see if there has been any alteration in the
aptitude of his trainees after they have been on a course. He gives each an aptitude test
before they start the course and an equivalent one after they have completed it. The
scores are recorded below. Has any change taken place at a 5% significance level?
Trainee A B C D E F G H I
Score before Training 74 69 45 67 67 42 54 67 76
Score after Training 69 76 56 59 78 63 54 76 75
First the 'changes' are computed and then a simple t-test is carried out on the result.
H0: µd = 0; H1: µd 0.
xd 0
Test statistic xd = , sd = , n = 9.
sd / n d
Conclusion T.S. less extreme than C.V. so do not reject H0. There may be no change.
Only the most commonly used of hypothesis tests have been included in this lecture.
You will meet others in the next few weeks and a further selection, including non-
parametric tests, are included in 'Business Statistics', Chapter 7.
59
Problem: Why can't we just carry out repeated t-tests on pairs of the variables?
If many independent tests are carried out pairwise then the probability of being correct
for the combined results is greatly reduced. For example, if we compare the average
marks of two students at the end of a semester to see if their mean scores are significantly
different we would have, at a 5% level, 0.95 probability of being correct. Comparing
more students:
Solution: We need therefore to use methods of analysis which will allow the variation
between all n means to be tested simultaneously giving an overall probability of 0.95
of being correct at the 5% level. This type of analysis is referred to as Analysis of
Variance or 'ANOVA' in short. In general, it:
measures the overall variation within a variable;
finds the variation between its group means;
combines these to calculate a single test statistic;
uses this to carry out a hypothesis test in the usual manner.
In general, Analysis of Variance, ANOVA, compares the variation between groups and
the variation within samples by analysing their variances.
One-way ANOVA: Is there any difference between the average sales at various
departmental stores within a company?
Two-way ANOVA: Is there any difference between the average sales at various stores
within a company and/or the types of department? The overall variation is split 'two
ways'.
61
One-way ANOVA
Total variation
(SST)
Two-way ANOVA
Total variation
(SST)
where SST = Total Sum of Squares; SSG = Treatment Sum of Squares between the
groups; SSBl = Blocks Sum of Squares; SSE = Sum of Squares of Errors.
(At this stage just think of 'sums of squares' as being a measure of variation.)
The method of measuring this variation is variance, which is standard deviation squared.
Total variance = between groups variance + variance due to the errors
It follows that: Total sum of = Sum of squares between + Sum of squares due
squares (SST) the groups (SSG) to the errors (SSE)
If we find any two of the three sums of squares then the other can be found by difference.
In practice we calculate SST and SSG and then find SSE by difference.
Since the method is much easier to understand with a numerical example, it will be
explained in stages using theory and a numerical example simultaneously.
62
Example 1
One important factor in selecting software for word processing and database management
systems is the time required to learn how to use a particular system. In order to evaluate
three database management systems, a firm devised a test to see how many training hours
were needed for five of its word processing operators to become proficient in each of
three systems.
System A 16 19 14 13 18 hours
System B 16 17 13 12 17 hours
System C 24 22 19 18 22 hours
Using a 5% significance level, is there any difference between the training time needed
for the three systems?
In this case the 'groups' are the three database management systems. These account for
some, but not all, of the total variance. Some, however, is not explained by the difference
between them. The residual variance is referred to as that due to the ‘errors’.
In the lecture on summary statistics we saw that the standard deviation is calculated by:
x x
2
sn so x x 2 = ns 2n with s from the calculator using [xn]
n
x x
2
or s n 1 so x x 2 = n 1 s 2n 1 with s from the calculator using
n 1
[xn-1]
Both methods estimate exactly the same value for the total sum of squares.
TotalSS Input all the data individually and output the values for n , x and n from
the calculator in SD mode. Use these values to calculate 2n and n 2n .
n x n n2 n n2 .
63
n x n n2 n n2 .
SS for Systems 15 17.33 2.625 6.889 103.3 = SSSys
SSG MSG
Between groups SSG k-1 MSG F
k 1 MSE
SSE
Errors SSE (N-1) - (k-1) MSE
Nk
Method
Fill in the total sum of squares, SST, and the between groups sum of squares, SSG,
after calculation; find the sum of squares due to the errors, SSE, by difference;
the degrees of freedom, d.f., for the total and the groups are one less than the total
number of values and the number of groups respectively; find the error degree of
freedom by difference;
the mean sums of squares, M.S.S., is found in each case by dividing the sum of
squares, SS, by the corresponding degrees of freedom.
The test statistic, F, is the ratio of the mean sum of squares due to the differences
between the group means and that due to the errors.
64
Total 175.3 15 - 1 = 14
The null hypothesis, H0, is that all the group means are equal. H0: 1 = 2 = 3 = 4 etc.
The alternative hypothesis, H1, is that at least two of the group means are different.
The critical value is from the F-tables, F 1 , 2 , with the two degrees of freedom
from the groups, 1, and the errors, 2.
The test statistic is the F-value calculated from the sample in the ANOVA table.
The conclusion is reached by comparing the test statistic with the critical value and
rejecting the null hypothesis if the test statistic is the larger of the two.
Example 1 (cont.)
Example1 (cont.)
1 1 1 1
CD t MSE 1.78 6.00 2.76
n1 n 2 5 5
Two-way ANOVA
In the above example it might have been reasonable to suggest that the five Operators
might have different learning speeds and were therefore responsible for some of the
variation in the time needed to master the three Systems. By extending the analysis from
one-way ANOVA to two-way ANOVA we can find our whether Operator variability is a
significant factor or whether the differences found previously were just due to the
Systems.
Example 2 Operators
1 2 3 4 5
System A 16 19 14 13 18
System B 16 17 13 12 17
System C 24 22 19 18 22
Again we ask the same question: using a 5% level, is there any difference between the
training time for the three systems? We can use the Operator variation just to explain
some of the unexplained error thereby reducing it, 'blocked' design, or we can consider it
in a similar manner to the System variation in the last example in order to see if there is a
difference between the Operators.
In the first case the 'groups' are the three database management systems and the 'blocks'
being used to reduce the error are the different operators who themselves may differ in
speed of learning. In the second we have a second set of groups - the Operators.
In 2-way ANOVA we find SST, SSSys, SSOps and then find SSE by difference.
SSE = SST - SSSys - SSOps
We already have SST and SSSys from 1-way ANOVA but still need to find SSOps.
Operators
66
1 2 3 4 5 Means
System A 16 19 14 13 18 16.00
System B 16 17 13 12 17 15.00
System C 24 22 19 18 22 21.00
Means 18.67 19.33 15.33 14.33 19.00 17.33
Correlation and regression are concerned with the investigation of two continuous
variables.
Previously we have only considered a single variable - now we look at two variables.
Correlation describes the strength of the relationship. It is not concerned with 'cause'
and 'effect'.
Regression describes the relationship itself in the form of a straight line equation which
best fits the data. Curved relationships are considered in Chapter 9 of Business Statistics.
Methodology
Some initial insight into the relationship between two continuous variables can be
obtained by plotting a scatter diagram and simply looking at the resulting graph.
Does the relationship appear to be linear or curved?
If the relationship is found to be significantly strong, then its nature can be found
using linear regression, which defines the straight line equation of the line of
'best fit' through the bi-variate data.
The 'goodness of fit' can be calculated to see how well the line fits the data.
Example
As an example we shall consider whether there is any relationship between 'Ice cream
Sales' for an ice-cream manufacturer and 'Average Monthly Temperature'. The collected
data is recorded below:
Scatter diagrams
We are looking for a linear relationship with the bivariate points plotted being reasonably
close to the, yet unknown, 'line of best fit'. At this stage no 'causality' is implied but it
makes sense to use the same diagram for the addition of the regression line later, so the
potentially dependent variable should be identified and plotted on the vertical axis.
140
130
120
110
Sales
100
90
80
70
60
50
5 10 15
Av.Temp.
Seems to indicate a straight line relationship, with all points fairly close to a 'line of best
fit'. The strength of the relationship will therefore be quantified by calculating the
correlation coefficient and testing it for significance.
70
Pearson's correlation coefficient is calculated for situations in which the data can be
measured quantitatively, i.e. either interval or ratio data. It compares how the variables
vary together with how they each vary individually and is independent of the origin or
units of the measurements. It is a ratio of the combined variance (covariance) with
individual variances and so has no units itself.
The value of the correlation coefficient, r, is best produced directly from a calculator in
LR mode. (See Appendix 1)
(If you do not have a calculator capable of carrying out correlation and regression
calculations it will be necessary for you to calculate them by tabulation and use of the
formulae shown in Appendix 2.)
Correlation Coefficient: For this data the value of r = 0.9833 ( from 'shift (' )
(Calculator usage will be practised in tutorials.)
Is the size of this correlation coefficient, 0.9833, large enough to claim that there is a
relationship between average monthly temperature and sales of ice-cream? The test
statistic obtained above in this case seems very close to 1, but is it close enough?.
Null hypothesis, H0: There is no association between ice-cream sales and average
monthly temperature.
Conclusion: The test statistic exceeds the critical value so we reject the Null Hypothesis
and conclude that there is an association between ice-cream sales and average monthly
temperature.
71
Regression equation
Since we now know there is a significant relationship between the two variables, the next
obvious step is to define it. Having defined it, we can then add it to the scatter diagram
and also, assuming a causal relationship, use it to predict the next month's ice cream sales
if we know the figure for the average temperature which has been forecast for the coming
month.
As with the correlation coefficient, the coefficients of the regression equation, a and b,
can be found directly from the calculator in LR mode (shift 7 and shift 8), otherwise they
need to be calculated as shown in Appendix 2.
The regression line is described, in general, as the straight line of ‘best fit’ with the
equation:
y = a + bx
where x and y are the independent and dependent variables respectively, a the intercept
on the y-axis, and b the slope of the line.
The values of a and b for this data, as produced from a calculator in linear regression
(LR) mode, are 45.52 (shift 7) for a and 5.448 (shift 8) for b giving the regression line:
y = 45.5 + 5.45x
To draw this line on the scatter diagram, any three points are plotted and joined up.
These points, the intercept, (0,a), the centroid, ( x , y ), and/or any other points calculated
from the regression equation as long as these are in the region of the observed data.
e.g. If x = 15 then y = 45.5 + 5.45 x 15 = 127.3
(For any value of x, the corresponding value of y can be produced directly from the
calculator in LR mode, using the key ŷ .)
140
130
120
110
Sales
100
90
80
70
60
50
5 10 15
Av.Temp.
72
Goodness of Fit
How well does this line fit the data? Goodness of fit is measured by (r2 x 100)%.
The correlation coefficient r was 0.983 so we have (0.983)2 x 100 = 96.6% fit. This high
value indicates that any predictions made about y from a value of x will be good.
The 'goodness of fit' indicates the percentage of the variation in Ice-cream Sales which is
accounted for by the variation in average monthly temperature.
Prediction of Sales
Suppose that the Ice-cream manufacturer knows that the estimated average temperature
for the following month is 14oC, what would he expect his Sales to be?
The best estimate of the Sales is obtained by substituting the value of 14 for that of the
independent variable, x, and calculating the corresponding value of the Sales, y. This can
be more easily be done directly by calculator in LR mode: type in 14, press ŷ .
Analysis of Variance
SOURCE DF SS MS F p
Regression 1 7420.2 7420.2 292.34 0.000
Error 10 253.8 25.4
Total 11 7674.0
Unusual Observations
Obs. Av.Temp. Sales Fit Stdev.Fit Residual St.Resid
2 4.0 57.00 67.31 2.40 -10.31 -2.33R
Correlations
REGRESSION
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/DEPENDENT sales
/METHOD=ENTER av_temp .
Model Summary
Std. Error
Adjusted R of the
Model R R Square Square Estimate
1 .983a .967 .964 5.04
a. Predictors: (Constant), Average monthly temperature
ANOVAb
Sum of Mean
Model Squares df Square F Sig.
1 Regression 7420.176 1 7420.176 292.335 .000a
Residual 253.824 10 25.382
Total 7674.000 11
a. Predictors: (Constant), Average monthly temperature
b. Dependent Variable: Ice cream sales (£'000)
Coefficientsa
Standardi
zed
Unstandardized Coefficien
Coefficients ts
Model B Std. Error Beta t Sig.
1 (Constant) 45.520 3.503 12.996 .000
Average monthly
5.448 .319 .983 17.098 .000
temperature
a. Dependent Variable: Ice cream sales (£'000)
74
Appendix 1
The data is entered as pairs of numbers. Within the many memories of the calculator
these numbers and their squares are accumulated and the correlation coefficient is then
calculated using the formulae in the appendix from these stored sums.
The method described below produces the correlation coefficient, r, the regression
coefficients, A and B, and estimates values of y for given values of x for a Casio
calculator. For any other make, and possibly some Casio models, refer to the calculator
handbook.
Data in:
1 Clear all memories Shift AC
2 Enter linear regression mode Mode 3
3 Input variables x and y together for each case x,y DT
Repeat to end of data.
Results out:
4 Check number of pairs entered RCL Red C
5 Output correlation coefficient (r) Shift ( 'open bracket'
6 Output intercept (a) and slope (b) of regression line Shift 7, Shift 8.
7 To get y value for plotting line when x = 15 15 ŷ
Appendix 2
The value of the correlation coefficient is best produced directly from a calculator in LR
mode. Otherwise the following calculations are necessary:
The formula used to find the Pearson’s Product Moment Correlation coefficient is:
Sxy
r 1 r 1
SxxSyy
x x
whereSxx x 2
n
y y
Syy y 2
n
x y
Sxy xy
n
where x means the sum of all the x values, etc.
The following example gives ice-cream sales per month against mean monthly
temperature (Fahrenheit) You need the values of:
x, y, x 2 , y 2 , xy, n.
Calculations
120 120
Sxx 1450 250
12
1200 1200 1362
Syy 127674 7674 r 0.9833
12 250 7674
120 1200
Sxy 13362 1362
12
Regression equation
As with the correlation coefficient, the regression equation can be produced directly from
a calculator in LR mode. Otherwise further use is made of the values of
x , y, x 2
, y 2
, xy, n.
(all produced previously.)
The regression line is described, in general, as the straight line with the equation:
y = a + bx
where x and y are the independent and dependent variables respectively, a the intercept
on the y-axis, and b the slope of the line.
S xy x y x x
b
S xx
where S xy xy n
and S xx x2 n
Since the regression line passes through the centroid, both means, its equation can be
used to find the value of a, the intercept on the y-axis:
y a bx a y bx
The values of a and b are therefore 45.5 and 5.45 respectively giving the regression
line:
y = 45.5 + 5.45x
77
v
In this type of analysis we have two characteristics, such as gender and eye colour, which
cannot be measured but which can be used to group people by variations within them.
These characteristics may, or may not, be associated in some way. How can we decide?
We can take a random sample from the population, note which variation of each
characteristic is appropriate for each case and then cross-tabulate the data. It is then
analysed in order to see if the proportions of each characteristic in the sub-samples are the
same as the overall proportions - easier to do than describe! As an example, if there is no
relationship between gender and eye colour we would expect similar proportions of males
and females to have blue eyes.
The Variables
If the variables are, as is usually the case, nominal, (described by name only), frequencies
may be cross-tabulated by each category within each variable. Ordinal variables may be
used if there are only a limited number of orders so that each one can be classified as a
separate category. Continuous variables may be grouped and then tabulated similarly
though the results will then vary according to the grouping categories.
Expected values
If the two variables, the characteristics under scrutiny, are completely independent, the
proportions within the sub-totals of the contingency table would be expected to be the
same as those of the totals for each variable. In practice we work with frequencies rather
than proportions, distinguishing between 'observed' and 'expected' frequencies by
enclosing the latter within brackets. If gender and eye colour are independent and if a
third of the population has blue eyes, we would expect a third of males to be blue eyed
and a third of females to be blue eyed.
These proportions are obviously contrived so as to be easy to work with. How can we
cope with more awkward numbers? In the on-going example, the proportions are first
calculated as fractions which are then multiplied by the total frequency to find the
expected individual cell frequencies. This produces a formula which is applicable in all
cases:
Row total Column total
For any cell the expected frequency is calculated by: Overall total
,
where the relevant row and column are those crossing in that particular cell.
79
2
Chi-squared ( ) Test for Independence
The hypothesis test which is carried out in order to see if there is any association between
categorical variables, such as gender and eye colour, is known as the Chi-squared, ( ), test,
2
Example 1
The following table compiled by a personnel manager relates to a random sample of 180
staff taken from the whole workforce of the supermarket chain. We shall, in this example,
test for association between a member of staff's gender and his/her type of job, at the 5%
level of significance.
Supervisor 20 15
Shelf stacker 20 30
Till operator 10 35
Cleaner 10 40
Total
Completing the row and column totals, as previously with probability, gives the full table.
In this example we randomly selected a sample of 180 supermarket staff and found that
two thirds, 120, of them were female and one third, 60, were male. Assuming there is no
association between gender and job category and finding that we have 45 till operators, we
would expect two thirds, 30, of them to be female and one third, 15, of them to be male.
Note that these figures are a quarter of each gender respectively, which checks since till
operators, 45, form a quarter of the total staff, 180.
We now calculate the other expected frequencies from the probabilities and put them
into the table: (See Section 4.8)
P (Supervisor) = 35/180
P (Male) = 60/180
P (Supervisor and male) = 35/180 x 60/180 assuming independence.
35 60
Therefore the expected number of Male supervisors = 180 11 .67
180 180
This is the expected frequency for members of staff who are both male and a supervisor.
Note that it is a theoretical number which does not have to be an integer.
Calculating the other expected frequencies and inserting them in the table (in brackets):
These are the frequencies which would be expected if there is no association between
gender and job category at the supermarket.
If the expected frequencies are observed to actually occur in practice then we can deduce
that the two variables are indeed independent.
We would obviously not expect to get exact agreement with the expected frequencies, so
some critical amount of difference is allowed and we compare the difference from our
observations with that allowed by the use of a standard table. Are the values observed so
different to those expected that we must reject the idea of independence? Or are the
results just due to sampling errors, with the variables actually being independent?
It is to be hoped that you recognise the need for a hypothesis test!
1) State Null Hypothesis, H0, (that of no association) and Alternative Hypothesis, H1.
2) Record observed frequencies, O, in each cell of the contingency table.
3) Calculate row, column and grand totals.
4) Calculate expected frequency, E, for each cell : row total x column total
grand total
Note that: No expected frequency should be less than 1 and the number of expected
frequencies below 5 should not be over 20% of the total number of cells. Otherwise
the test is invalid.
7) Compare the two values and conclude whether the variables are independent or not.
81
In example 1, we have already carried out steps 2, 3 and 4 of the procedure by calculating
the expected values. Whether these are calculated before, or during, the test is up to
personal preference. Some statisticians also prefer to calculate the test statistic, as this
procedure is rather lengthy, before starting the test and to then insert the calculated value
in the formal hypothesis test.
The test statistic is an overall measure of the difference between the expected and
observed frequencies. Each cell difference is squared so that positive and negative
differences have the same weighting and proportioned by the size of the expected cell
contents. When the contributions from each cell are totalled their sum is compared with a
critical value from the chi-squared table – hence the name of this test.
Null Hypothesis (H0): There is no association between gender and job category.
(Remember that 'null' means none.)
Alternative Hypothesis (H1): There is an association between gender and job category.
2 5%, 3 = 7.816
Test statistic
The test statistic is calculated from the contingency table which includes both the
observed and the expected values for the frequency of staff. The data may be tabulated, as
we shall do in this example, or the contribution of each cell may be calculated directly as
O E 2 and then the test statistic found as the sum of these contributions:
E
Test statistic =
O E 2
E
O E (O - E) (O - E)2/E
20 11.67 8.33 5.946
15 23.33 -8.33 2.974
20 16.67 3.33 0.665
30 33.33 -3.33 0.333
10 15.00 -5.00 1.667
35 30.00 5.00 0.833
10 16.67 -6.67 2.669
40 33.33 6.67 1.335
Total 16.422
Looking again at the data we can see that far more males than expected were supervisors
or shelf stackers and more females were cleaners or till operators.
83
Example 2
In this example we first have to set up the contingency table from the following
information collected from a questionnaire:
We first draw up a contingency table showing these results and then test whether attitude
towards future wage restraint is dependent on the type of employment.
Setting up the table: in this example there are three types of employee giving four
different responses, i. e. we have a 3 x 4 (or a 4 x 3) table.
Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells.
Have a go at compiling the table. As you come to each number in the frequency of
response above insert it into the appropriate place; then find the missing figures by
difference. There is sufficient information here to enable you to complete your table.
When complete, check with that below before calculating the expected values.
Stackers
Sales staff
Administrators
Total
Stackers 1 7 24 8 40
Sales staff 3 9 34 10 56
Administrators 16 24 22 2 64
Total 20 40 80 20 160
Hypothesis test
84
Null Hypothesis (H0): There is no association between job category and attitude towards
wage restraint.
Alternative Hypothesis (H1): There is an association between job category and attitude
towards wage restraint.
Critical value:
Number of degrees of freedom () = (r - 1)(c - 1) =
Level of significance = 5%
2 table, 5%, 6 degrees of freedom = 12.59
Test statistic
O E 2
E
(Complete the table.)
O E (O - E) (O - E)2/E
1 5 -4 3.200
7 10 -3 0.900
24 20 +4 0.800
8 5 +3 1.800
3 7 -4 2.286
9 14 -5 1.786
34
10
16
24
22
2
Total 32.969
We first draw up a contingency table showing these results and then test whether attitude
towards future wage restraint is dependent on the type of employment.
Setting up the table: in this example there are three types of employee giving four
different responses, i. e. we have a 3 x 4 (or a 4 x 3) table.
Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells.
Have a go at compiling the table. As you come to each number in the frequency of
response above insert it into the appropriate cell; then find the missing figures by
difference. There is sufficient information here to enable you to complete your table.
When complete, check with that below before calculating the expected values.
Stackers 1 7 24 8 40
Sales staff 3 9 34 10 56
Administrators 16 24 22 2 64
Total 20 40 80 20 160
Total 20 40 80 20 160
Hypothesis test
Null Hypothesis (H0): There is no association between job category and attitude towards
wage restraint.
Alternative Hypothesis (H1): There is an association between job category and attitude
towards wage restraint.
86
Critical value:
Number of degrees of freedom () = (r - 1)(c - 1) = (3 - 1)(4 - 1) = 2 x 3 = 6
Level of significance = 5%
2 table, Table 5 in Appendix D, 5%, 6 degrees of freedom = 12.59
Test statistic
O E 2
E
O E (O - E) (O - E)2/E
1 5 -4 3.200
7 10 -3 0.900
24 20 +4 0.800
8 5 +3 1.800
3 7 -4 2.286
9 14 -5 1.786
34 28 +6 1.286
10 7 +3 1.286
16 8 +8 8.000
24 16 +8 4.000
22 32 -10 3.125
2 8 -6 4.500
Total 32.969
10% 5% 2 5% 1% 0.l%
INDEX NUMBERS
In business, managers may be concerned with the way in which the values of most
variables change over time: prices paid for raw materials; numbers of employees and
customers, annual income and profits, etc. Index numbers are one way of describing
such changes. (See Chapter 11 of Business Statistics)
Index numbers were originally developed by economists for monitoring and comparing
different groups of goods. It is necessary in business to be able to understand and
manipulate the different published index series, and to construct your own index series.
Index numbers measure the changing value of a variable over time in relation to its value
at some fixed point in time, the base period, when it is given the value of 100.
Such indexes are often used to show overall changes in a market, an industry or the
economy. For example, an accountant at a supermarket chain could construct an index of
88
the chain's own sales and compare it to the index of the volume of sales for the overall
supermarket industry. A graph of the two indexes will illustrate the company's
performance within the sector.
Company
Market
100
Time
Base period Parity
Index values are measured in percentage points since the base value is always 100.
Constructing an Index
value in period n
Index for any time period n = value in base period 100
If we have a series of index numbers describing index linked values and know the value
associated with any one of them, the other values can be found by scaling the known
value in the same ratio as the relevant index numbers.
Example 2 The following table shows the monthly price index for an item:
Month 1 2 3 4 5 6 7 8 9 10 11 12
If the price in month 3 is £240, what is the price in each of the other months?
Month 1: The index numbers in months 3 and 1 are 98 and 121 respectively. 98 is
equivalent to a price of £240 so for month 1 this must be scaled in the same ratio as the
index numbers.
121
Price in month 3 is £240 Price in month 1 is £240 £296.3
98
112
Price in month 2: £240 £274.3
98
Price in month 4:
Price in month 5:
Price (£) 296 274 240 140 218 267 320 360 323 309
Price indexes
The most commonly quoted index is the Retail Price Index, the RPI, which is a monthly
index indicating the amount spent by a 'typical household'. It gives an aggregate value
based upon the prices of a representative selection of hundreds of products and services
with weightings produced from the 'Family expenditure survey' as a measure of their
importance.
The RPI shows price changes clearly and has many important practical implications
particularly when considering inflation in terms of wage bargaining, index linked
benefits, insurance values, etc.
90
Other Indexes
An index can be used to measure the way any variable changes over time:
unemployment numbers; number of car registrations; gross national product; etc.
The choice of base period can be any convenient time - it is periodically adjusted.
Indexes can be calculated with any convenient frequency: yearly, monthly, daily, etc.
Yearly: G.N.P. Gross National Product
Monthly: Unemployment figures
Daily: Stock market prices, etc.
One of the most important uses of indexes is to show how the price of a product changes
over time due to changing costs of raw materials; variable supply and demand; changes
in production processes; general inflation; etc. It is often of interest to compare it to the
R.P.I. or to indexes of the prices of competitors' products.
Changes over time are measured as either a percentage point change by subtraction
change in index number
or as a percentage change as calculated by original index number 100 .
Example 3
We need therefore to know how to change a base period. Published index numbers may
include a sudden change such as a series 470, 478, 485 becoming series 100, 103, 105,
etc. We need to convert all the numbers to a common base year so that they can be
analysed as a whole continuous series.
In order to change from one index series to another we need the value in the first series
for the same period as that in which the rebasing to 100 for the second takes place. The
ratio of these two values forms the basis of any conversion between them.
Example 4
The amount of money spent on advertising by the supermarket, which is index linked, is
described by the following indexes:
Year 1 2 3 4 5 6 7 8
Typically each index series is completed and then, given the index linked amount spent
on advertising for any one year, it is used to estimate that for any of the other years.
b) Missing values for Index 1 and Index 2: (Both values known for year 5)
From year 5: ratio of old : new = 220 : 100 (Note: All new values will be less)
From year 5: ratio of new : old = 100 : 220 (Note: All old values will be more)
Now complete both index series checking that each is adjusted in the right direction.
92
c) The values for amount spent on advertising can be calculated for all years if any one
year is known by multiplying that value by the ratio of the relevant index numbers
from either series.
Given that the advertising expenditure was known to be 4860 in year 3, that for:
100 45.45
Year 1 = 4860 3000 or 4860 3000
162 73.64
138 62.73
Year 2 = 4860 4140 or 4860 4140
162 73.64
Now complete the advertising expenditures in the table. There are alternative
ways of doing this depending on which set of index numbers are used.
Most published price indexes are produced from many items rather than single ones, as
used by us previously. Simple price aggregate indexes just add together the prices of all
the items included in the 'basket of goods', assuming each occurs just once.
Weighted aggregate indexes, such as the R.P.I., weight each price according to the
quantity bought in a particular time period which may be the base period (a Laspeyres
Index) or the current period (a Paasche Index). The weightings to be applied are usually
arrived at by discussion and mutual agreement, or from other information such as the
'Family expenditure survey'.
We shall work through a simple illustration of the principle of these aggregate indexes
but you must realise that the use of them in the 'real world' is far more complex than this
and needs to be studied further if to be used in practice in a business context.
93
Example 5
A company buys four products with the following characteristics:
Number of units bought Price paid per unit (£)
Items Year 1 Year 2 Year 1 Year 2
A 20 24 10 11
B 55 51 23 25
C 63 84 17 17
D 28 34 19 20
a) Find the simple price indexes for the products for year 2 using year 1 as the base year:
Product A: 11/10 100 = 110.0
Product B: 25/23 100 = 108.7 Note that all these figures come from the prices
Product C: 17/17 100 = 100.0 only and that no weighting has been applied.
Product D: 20/19 100 = 105.3
b) Find the simple aggregate index for year 2 using year 1 as the base year:
c) Find the base-weighted aggregate index, the Laspeyres index, for year 2 using year 1 as
the base year.
Sum of prices weights in Year 2
Index for year 2
Sum of prices weights in base year
11 20 25 55 17 63 20 28 3226
100 10515
.
10 20 23 55 17 63 19 28 3068
d) Find the current period-weighted aggregate index, the Paasche index, for year 2 using
year 1 as the base year.
11 24 25 51 17 84 20 34 3647
100 104.59
10 24 23 51 17 84 19 34 3847
The type of aggregate index chosen is based on the relative importance of the weightings.
94
Last week we used time series in the form of index numbers to investigate changes over
time with the main interest being in past time periods and the main objective comparison
of different series.
This week we again study the past but with the main purpose of identifying patterns in
past events which can be used next week to forecast future values and events.
There are many standard methods used in forecasting and they all depend on finding the
best model to fit the past time series of data and then using the same model to forecast
future values. More methods are described in Chapter 12 of Business Statistics.
Generally, if a model can be found which reasonably fits past chronological data it can
be used to predict, under the same conditions, values for them in the near future.
Since the best method to use in any circumstance depends on the form taken by the past
data it is essential that these values are first plotted against time to help us to select the
most appropriate type of model, or models, to fit to this data.
Below we see three typical patterns exhibited by time series with suggested suitable types
of model. (By no means an exhaustive list of either past data time series or suitably
fitting models.)
70
60
seasonal additive or multiplicative,
50
Exponential smoothing.
40
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
91 91 91 91 92 92 92 92 93 93 93 93 94 94 94 94 95 95 95 95
Date
42
38
36
non-seasonal
34
Non-linear regression (?),
Quarterly Sales
32
30
28
26
Exponential smoothing.
Q1 1994 Q3 1994 Q1 1995 Q3 1995 Q1 1996 Q3 1996
Q2 1994 Q4 1994 Q2 1995 Q4 1995 Q2 1996 Q4 1996
Date
200
180
160
Non-linear, Non-linear regression,
non-seasonal
140
Exponential smoothing.
Monthly Sales
120
100
80
JA AP JU O JA AP JU O JA AP JU O JA AP JU O
N R L CT N R L CT N R L CT N R L CT
19 1 19 1 19 1 19 1 19 1 19 1 19 1 19 1
93 99 93 99 94 99 94 99 95 99 95 99 96 99 96 99
3 3 4 4 5 5 6 6
Date
95
Exponential smoothing is not covered in this lecture but is also described fully in
Chapter 12 of ‘Business Statistics’. It is a method which is lengthy to carry out by hand
but is straight forward with the help of a computer package such as SPSS, Minitab or
Excel.
Example
A small ice-cream manufacturer who supplies a supermarket chain reports sales figures
of £50 000 for the last quarter.
Is this amount good or bad? Can this single figure be used to forecast future sales?
With only one figure we never have sufficient data to make either a judgement or a
forecast.
We need to identify any past pattern and use it for future sales predictions.
We also need to estimate how accurate those predictions are likely to be.
96
In order to understand the situation more fully, the following data was collected. It
represents the quarterly sales figures for the four years preceding the figure of £50 000
quoted above.
Sales (£'000)
Year Quarter 1 Quarter 2 Quarter 3 Quarter 4
1997 40 60 80 35
1998 30 50 60 30
1999 35 60 80 40
2000 50 70 100 50
Method
1 First plot the data in chronological order on a graph against time. (Grid on back page)
What do you think? Does this data exhibit a 'four point pattern'? It does and therefore,
since it is quarterly data, the sales are judged to exhibit a quarterly seasonal effect. We
shall use an additive model to identify how much has been 'added' to the sales because it
is a particular quarter of the year.
Observed Trend Seasonal Random
Additive model = + +
Sales value value factor factor
A = T + S + R
The Trend value, T, is the value of the sales had the seasonal effect been 'averaged out'.
The Seasonal factor, S, is the average effect of it being a particular quarter, e.g. summer.
The Random factor, R, is the random variation due to neither of the previous variables.
This cannot be eliminated but is a useful measure of the goodness of fit of the model and
therefore of the quality of any forecasts made from it.
2 We can calculate the missing trend values from the table on the next page which includes:
Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)
1997 Q1 40
Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 ____ ____ ____ ____
____
Q2 70 ____ ____ ____ ____
____
Q3 100
Q4 50
40 50 70 100 260
The first missing cycle average: 65.00
4 4
The second missing cycle average:
60.00 65.00
The first missing moving average trend figure: 62.50
2
The second missing moving average trend figure:
Next plot the centred Trend values (T) on your graph. Note that the first value is plotted
against Quarter 3.
98
3 Next calculate values for the last two First Residuals (Sales - Trend) for each separate
quarter. From the first residuals calculate the average Seasonal factor for each of the four
quarters: (Complete Quarters 3 and 4)
Check that the average of these means is no very far from zero.
4 Calculate the last two Fitted values (F): F = T + S. Plot the fitted values on the same
graph. If the model is a 'perfect fit' the fitted values will be the same as the observed
Sales. We expect them to be close indicating that we have a 'good' model.
Sales (£’000)
100
90
80
70
60
50
40
30
20
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1997 1998 1999 2000 2001
100
Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)
1997 Q1 40
Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 62.50 -12.50 48.33 +1.67
65.00
Q2 70 66.25 + 3.75 72.29 -2.29
67.50
Q3 100
Q4 50
In this lecture we shall produce forecasts using a seasonal model. Non-seasonal forecasting
is covered by Chapter 13 of Business statistics.
Last week we produced an Additive model for the Ice-cream sales data.
1 and 2 See the incomplete graph from last week repeated on the last page.
3 and 4 Summary Statistics (as calculated last week)
This model therefore seems satisfactory so we shall use it to predict future Ice-cream sales.
When using this model for forecasting the Random factor is left out of the equation for the
single figure forecast but re-introduced when the likely accuracy is estimated as a range.
The Trend value is estimated from the extended trend line over the next four quarters on
your graph. Use your own best judgement in extending this line. If you have any
background information concerning future sales, this should be taken into account.
The Seasonal factor is the average of the previous deviations from the trend line for that
particular season.
Forecast Extended Seasonal
= +
Additive model Sales value Trend value factor
F = T + S
Forecasts
102
Read off the values from your extended trend line, note them in the table below, and add to
them the appropriate seasonal factor to get your forecast.
Add your forecasts to your graph to check that the past pattern follows into the future.
If they are shown to be small and randomly distributed about zero, then it can be assumed
that the error in the forecasts is not likely to be more than twice the standard deviation of the
second residuals. (Think 95% confidence interval!)
The standard deviation of the second residuals = 2.770 (£'000) so the maximum likely error
is 2 x 2 700 = £5540.
In order to compare figures from consecutive quarters the effect due to it being a particular
quarter needs to be eliminated. How can we compare the ice-cream sales in summer with
those in winter? We 'deseasonalise' the observed sales figures.
2000 Quarter 1 50 - =
2000 Quarter 2 70 - =
2000 Quarter 3 100 - =
2000 Quarter 4 50 - =
So having removed the amount due to each quarter we can see that quarters 1 and 2 were
very similar but that quarter 3 showed an increase, in real terms, of £13 120 over quarter 2.
This was followed by a decrease in sales of £10 830 between quarters 3 and 4.
It now makes sense to calculate percentage changes, compare with published indexes, etc.
The deseasonalised values can be compared to the trend value at any point in time to see
whether, say, sales are progressing better or worse than the general trend at that time.
104
100
90
Ice cream sales
80
70
60
50
40
30
0 5 10 15
Quarters
We can see that this is quarterly data which can be described well by an additive model.
1 1 40 * * * * *
2 2 60 * * * * *
3 3 80 52.500 27.500 22.9167 75.4167 4.58334
4 4 35 50.000 -15.000 -16.2500 33.7500 1.25000
5 5 30 46.250 -16.250 -14.1667 32.0833 -2.08333
6 6 50 43.125 6.875 6.0417 49.1667 0.83333
7 7 60 43.125 16.875 22.9167 66.0417 -6.04166
8 8 30 45.000 -15.000 -16.2500 28.7500 1.25000
9 9 35 48.750 -13.750 -14.1667 34.5833 0.41667
10 10 60 52.500 7.500 6.0417 58.5417 1.45833
11 11 80 55.625 24.375 22.9167 78.5417 1.45834
12 12 40 58.750 -18.750 -16.2500 42.5000 -2.50000
13 13 50 62.500 -12.500 -14.1667 48.3333 1.66667
14 14 70 66.250 3.750 6.0417 72.2917 -2.29166
15 15 100 * * * * *
16 16 50 * * * * *
105
These figures, with the addition of 'Seasonal', are the same as those calculated last week for
this data. Note the quarterly repetition in the 'Seasonal' column - this is the average
deviation for a quarter so is the 'Seasonal factor' for use in both forecasting and
deseasonalising..
106
Plotting the trend line with the original data shows its smoothed fit.
Time series plot for Ice cream Sales with Trend Fitted values added
100 100
90 90
80
80
70
70
Sales
Sales
60
60
50
50
40
40
30
30
Index 5 10 15
Index 5 10 15
The first diagram could, in theory, be used to estimate the future trend but it would be
difficult to extend the trend line and read values from it with any accuracy. A hand drawn
graph is therefore preferable.
Adding the Fitted values to the previous graph shows how well the model has fitted the
observed data in the past.
The second composite graph indicates a fairly close match but the residuals still need to be
analysed before the accuracy of any predictions can be estimated numerically.
Residual analysis
Remember that residuals are required to be small, with a mean of zero and a 'small' standard
deviation and be randomly distributed in time.
5
2ND.RESD
-5
0 5 10 15
QUARTER
Summary statistics
FOR THE DATA
MEAN = 54.375
ST.DEV. = 20.238
107
These residuals seem to satisfy the required conditions and also give us the additional
information that our forecasts will be accurate to within 2 x 2.77 i.e. £5540
Seasonally Smoothed
Moving Seasonal adjusted trend- Irregular
DATE_ SALES averages Ratios factors series cycle component
Q1 1997 40.000 . . -13.802 53.802 55.573 -1.771
Q2 1997 60.000 . . 6.406 53.594 54.705 -1.111
Q3 1997 80.000 52.500 27.500 23.281 56.719 52.969 3.750
Q4 1997 35.000 50.000 -15.000 -15.885 50.885 50.098 .787
Q1 1998 30.000 46.250 -16.250 -13.802 43.802 45.978 -2.176
Q2 1998 50.000 43.125 6.875 6.406 43.594 43.177 .417
Q3 1998 60.000 43.125 16.875 23.281 36.719 42.413 -5.694
Q4 1998 30.000 45.000 -15.000 -15.885 45.885 45.098 .787
Q1 1999 35.000 48.750 -13.750 -13.802 48.802 48.756 .046
Q2 1999 60.000 52.500 7.500 6.406 53.594 52.622 .972
Q3 1999 80.000 55.625 24.375 23.281 56.719 55.747 .972
Q4 1999 40.000 58.750 -18.750 -15.885 55.885 58.432 -2.546
Q1 2000 50.000 62.500 -12.500 -13.802 63.802 62.645 1.157
Q2 2000 70.000 66.250 3.750 6.406 63.594 65.955 -2.361
Q3 2000 100.000 . . 23.281 76.719 68.733 7.986
Q4 2000 50.000 . . -15.885 65.885 70.122 -4.236
Name Label
The moving averages are the trend figures calculated by hand previously and the ‘Ratios’, a
term more suited to a multiplicative model, are the same first residuals. The 'seasonal
factors' and the 'smoothed trend cycle' describe the average effect due to a particular quarter
and the sales level without that effect. These differ slightly from our calculated figures as
they have been further smoothed and also extended to cover the first and last two time
periods. We shall look at seasonally adjusted series later.
No 'fitted values' from the model have been produced by default but these can easily be
obtained as the sum of 'smoothed trend cycle' and 'seasonal factors' which have
automatically been produced and saved by SPSS as STC_1 and SAF_1. The errors and
deseasonalised sales have also been saved.
109
80
60
40
20
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01
Date
Plotting the trend line with the original data shows its smoothed fit.
Trend added
120
100
80
60
Icecream Sales
40
(£'000)
20 Trend
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01
Date
This diagram could, in theory, be used to estimate the future trend but it would be difficult
to extend the trend line and read values from it with any accuracy. A hand drawn graph is
therefore preferable.
Adding the Fitted values to the previous graph shows how well the model has fitted the
observed data in the past.
110
100
80
Quarterly Icecream S
60 ales (£'000)
TREND
40
Fitted values from t
20 he Additive model
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01
Date
This composite graph indicates a fairly close match but the residuals still need to be
analysed before the accuracy of any predictions can be estimated numerically.
Residual analysis
Remember that residuals are required to be small, with a mean of zero and a 'small' standard
deviation and be randomly distributed in time.
Descriptive Statistics
Std.
N Minimum Maximum Mean Deviation
Quarterly Icecream
16 30 100 54.38 20.24
Sales (£'000)
First residuals 12 -18.75 27.50 -.3658 16.9534
Second residuals 12 -6.04 4.59 1.667E-03 2.7702
Valid N (listwise) 12
0
These residuals seem to satisfy the
required conditions and also give us -2
-6
2 x 2.77 i.e. £5540
-8
Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20 20 20 20
97 97 97 97 98 98 98 98 99 99 99 99 00 00 00 00 01 01 01 01
Date
111
Sales (£’000)
100
90
80
70
60
50
40
30
20
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1997 1998 1999 2000 2001
112
Sales (A) Cycle Trend (T) First Resid. Fitted value 2nd Resid.
Date
(£'000) Average Moving average S = (A - T) F = (T + S) R = (A - F)
1997 Q1 40
Q2 60
53.75
Q3 80 52.50 +27.50 75.42 +4.58
51.25
Q4 35 50.00 -15.00 33.75 +1.25
48.75
1998 Q1 30 46.25 -16.25 32.08 -2.08
43.75
Q2 50 43.13 + 6.87 49.17 +0.83
42.50
Q3 60 43.13 +16.87 66.04 -6.04
43.75
Q4 30 45.00 -15.00 28.75 +1.25
46.25
1999 Q1 35 48.75 -13.75 34.58 +0.42
51.25
Q2 60 52.50 + 7.50 58.54 +1.46
53.75
Q3 80 55.63 +24.37 78.54 +1.46
57.50
Q4 40 58.75 -18.75 42.50 -2.50
60.00
2000 Q1 50 62.50 -12.50 48.33 +1.67
65.00
Q2 70 66.25 + 3.75 72.29 -2.29
67.50
Q3 100
Q4 50
16.6 At the end of a semester, the students' grades for Statistics were tabulated in the following
3 x 2 contingency table to see if there is any association between class attendance and
grades received.
Grade received
No. of Days Absent Pass Fail
0 - 3 135 110
4 - 6 36 4
7 - 45 9 6
(b) Calculate the expected value for the number of days absence.
(c) At 5% significance, do the data indicate that the proportions of students who pass
differ between the three absence categories?
(d) Calculate the 95% confidence intervals for the percentage of students who passed
and the percentage of them who failed. Interpret your findings.
(e) Briefly summarise your results so that they can be understood by a non-
statistician.
{(a) (i) 0.817; (ii) 0.950; (iii) 0.600; (iv) 0.550 (v) 0.967; (vi) 0.450
(b) 3.19 days
(c) 2 = 17.45; 20.05(2) = 5.99; Reject H0; Conclude that the proportions differ.
(d) 54.5% to 65.5%; 34.5% to 45.5%; No overlap, so percentages different.}
114
Normal Distribution
16.10 The weights of sugar in what are nominally '2 kilo' bags are actually normally distributed
with mean 2.1 kg and standard deviation 0.1 kg.
a) What percentage of bags will weigh less than 2.0 kg? (15.9%)
b) What percentage will weigh between 2.0 kg and 2.25 kg? (77.5%)
c) If only 4% of the bags were to contain under the stated minimum weight,
what should this value be? (1.93 kg)
d) If a random sample of 25 bags from another machine has a mean weight of 2.0 kg
and a standard deviation of 0.5 kg would its mean production be any different?
(95% C.I. 1.79 kg to 2.21 kg so not different.)
Confidence Intervals
16.17 The records of a large shoe shop show that over a given period 520 out of 1000 people
who entered the shop bought at least one pair of shoes.
a) Treating this as a random sample of all potential customers find a 95% confidence
interval for the actual percentage of the people entering the shop who will buy at
least one pair of shoes. (48.9% to 55.1%)
b) Does this support the firm’s claim that they actually sell shoes to half of their
potential customers? Why? (Yes, interval includes 50%)
16.19 The same supermarket then decided to investigate the spending habits of husbands and
wives. They were thinking of starting late 'family shopping' evenings and as an
experiment asked both partners to shop separately. The amounts spent by 14 husbands
and their wives were selected randomly from all the pairs with the following results:
(rounded to the nearest £1)
Family A B C D E F G H I J K L M N
Husband (£) 9 20 19 9 26 16 20 13 15 10 18 13 16 23
Wife (£) 23 30 13 23 12 17 29 35 18 26 32 23 26 14
a) Calculate the 95% confidence interval for the mean amount spent by the wives.
(£18.72 to £27.14)
b) Calculate the 99% confidence interval for the mean amount spent by the
husbands. (£12.05 to £20.37)
c) The store manager thinks that, on average, women spend £20 per head.
Does the confidence interval in (a) from the sample support this claim?
Explain your answer. (Yes, £20 in interval)
d) Is there any evidence that the mean amounts spent by wives and their
husbands are different?
(Calculate the 95% confidence interval for the differences in each family.)
(Yes, £0.72 to £12.70)
115
Hypothesis Testing
Pair number 1 2 3 4 5 6 7 8 9 10
S method score 56 59 61 48 39 56 75 45 81 60
N method score 63 57 67 52 61 71 70 46 93 75
a) Is 65 a reasonable estimate for the mean score of all children taught by method S?
b) Calculate the 'difference in scores' for each pair of children and test to see if, at
the 5% level, new method N gives a better mean score than standard method S.
{(a) C.V. = 2.26, T.S. t = 1.73, 65 could be the mean mark for all the children.
(b) C.V. = 1.83, T.S. t = 2.80, Method N does give a higher mean score.)}
Analysis of Variance
16.28 Four salesmen in your company are competing for the title ‘Salesman of the year’. Each
has the task of selling ‘cold’ in three different types of location. Their resulting sales, in
£’000, were as follows:
Salesmen
Area A B C D
a) Is there any significant difference, at a 5% level, between the sales of the men if
the location is not taken into account?
b) Is there any significant difference, at a 5% level, between the sales of the men if
the location is taken into account?
c) Which salesman won? Did he do significantly better than his nearest rival?
16.31 The following data refers to the Sales and Advertising expenditure of a small local
bakery.
Year Advertising (£'000) Sales (£'000)
1990 3 40
1991 3 60
1992 5 100
1993 4 80
1994 6 90
1995 7 110
1996 7 120
1997 8 130
1998 6 100
1999 7 120
2000 10 150
{ (a) 83.3 91.7 100.0 112.5 125.0 (b) 66.7 73.3 80.0 90.0 100.0
(c) 54p 60p 65p 73p 81p ( to the nearest 1p) }
16.39
Indexes
Year ‘Old’ ‘New’
1 100 (a) Scale to ‘New’ the ‘Old’ indexes for
years 1, 2, and 3.
2 120
3 160
4 190 100 (b) Scale to ‘Old’ the ‘New’ indexes for
years 5, 6, 7 and 8
5 130
6 140
7 150 (c) If a company’s profits were £ 3.5 million in
Year 3, what would they be for the other years?
8 165
a) Calculate index numbers for the above data taking Year 3 as the base year.
b) If the production of cookers followed the same index and 550 thousand were
made in year 1, how many thousand were made in each of the other years?
{ (a) 87.4 91.5 100.0 108.1 116.5 (b) (550) 576 630 680 733 }
16.41 The table below refers to the Annual Average Retail Price Index.
Year 1 2 3 4 5 6 7
RPI1 351.8 373.2 385.9 394.5
RPI2 100.0 106.9 115.2 126.1
b) Calculate the percentage point change between consecutive years after Year 4.
{ (a) 6.1% 3.4% 2.2% 6.9% 7.8% 9.5% (b) 6.9 8.3 10.9 }
16.42 The following table shows the prices of three commodities in three consecutive years:
Prices
Year 1 Year 2 Year 3
Tea 8 12 16
Coffee 15 17 18
Chocolate 22 23 24
a) Calculate the simple aggregate index for Year 2 and Year 3 using Year 1 as the
base year.
b) Calculate the mean price relative index for Year 2 and Year 3 using Year 1 as the
base year.
16.44 The following quarterly data represents the number of marriages, in thousands, for four
consecutive years: (Source Office of Population censuses.)
Year Q1 Q2 Q3 Q4
1 71.6 111.0 135.7 79.6
2 62.2 108.8 138.0 78.0
3 61.9 108.9 140.5 78.0
4 61.3 115.0 146.2 73.3
A summary of the analysis performed on this data, using both SPSS and Minitab, is
shown overleaf.
a) Plot the given data on graph paper as a time series, leaving room for your
forecasts for Year 5.
b) Calculate the missing values indicated by the three question marks (?) on the table
in the printout.
c) Plot the Smoothed Trend-cycle, Trend for Minitab, on your graph.
d) State the four seasonal factors and use with your graph to forecast the expected
number of marriages for each of the four quarters of Year 5, stating your
forecasts clearly and showing how they were calculated.
e) Estimate the likely error in your forecasts.
f) Demonstrate how the Seasonally adjusted series for SPSS, or Fitted values for
Minitab, of the number of marriages for each of the quarters of Year 4 have been
found.
g) With reference to your graph and the summary statistics given in the SPSS or
Minitab outputs, discuss the suitability of the additive model.
SPSS output for use with Question 16.44
-> * Seasonal Decomposition.
-> /VARIABLES=marriage
-> /MODEL=ADDITIVE
-> /MA=CENTERED.
Seasonally Smoothed
Moving Seasonal adjusted trend- Second
DATE_ MARRIAGE averages Ratios factors series cycle resid
Q1 1 71.600 . . -35.718 107.318 101.004 6.314
Q2 1 111.000 . . 13.207 97.793 99.973 -2.181
Q3 1 135.700 98.300 37.400 40.891 94.809 97.912 -3.103
Q4 1 79.600 96.850 -17.250 -18.380 97.980 96.976 1.005
Q1 2 62.200 96.863 -34.662 -35.718 97.918 96.980 .938
Q2 2 108.800 96.950 11.850 13.207 95.593 96.799 -1.206
Q3 2 138.000 96.713 41.287 40.891 97.109 96.757 .353
Q4 2 78.000 96.688 -18.688 -18.380 96.380 96.653 -.273
Q1 3 61.900 97.012 -35.112 -35.718 97.618 97.080 .538
Q2 3 108.900 97.325 11.575 13.207 95.693 97.144 -1.451
Q3 3 140.500 97.250 43.250 40.891 99.609 97.512 2.097
Q4 3 78.000 97.938 -19.938 -18.380 96.380 97.764 -1.384
Q1 4 61.300 ? ? -35.718 97.018 99.146 ?
Q2 4 115.000 99.538 15.463 13.207 101.793 99.788 2.005
Q3 4 146.200 . . 40.891 105.309 99.594 5.715
Q4 4 73.300 . . -18.380 91.680 99.497 -7.817
Summary Statistics
Variable Mean Std Dev Minimum Maximum N Label
MARRIAGE 98.12 30.79 61.3 146.2 16 Number of marriages (0,000)
FIRST -.04 30.83 -38.0 47.0 16 First residuals
SECOND -.04 3.35 -7.81690 6.31389 16 Second residuals
Minitab output (from an older version) for use with Question 16.44
1 1 71.6 * * * * *
2 2 111.0 * * * * *
3 3 135.7 98.3000 37.4000 40.6458 138.946 -3.24583
4 4 79.6 96.8500 -17.2500 -18.6250 78.225 1.37500
5 5 62.2 96.8625 -34.6625 -35.9625 60.900 1.30000
6 6 108.8 96.9500 11.8500 12.9625 109.912 -1.11250
7 7 138.0 96.7125 41.2875 40.6458 137.358 0.64166
8 8 78.0 96.6875 -18.6875 -18.6250 78.063 -0.06250
9 9 61.9 97.0125 -35.1125 -35.9625 61.050 0.85000
10 10 108.9 97.3250 11.5750 12.9625 110.287 -1.38750
11 11 140.5 97.2500 43.2500 40.6458 137.896 2.60417
12 12 78.0 97.9375 -19.9375 -18.6250 79.313 -1.31250
13 13 61.3 ? ? -35.9625 63.450 ?
14 14 115.0 99.5375 15.4625 12.9625 112.500 2.50000
15 15 146.2 * * * * *
16 16 73.3 * * * * *
150+
-
-
-
-
120+
-
-
-
-
90+
-
-
-
-
60+
-
+---------+---------+---------+---------+---------+------
0.0 3.0 6.0 9.0 12.0 15.0
Summary Statistics
FOR THE DATA
MEAN = 98.125
ST.DEV. = 30.791
140
130
120
110
100
90
80
70
60
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
1 2 3 4 5
123