1

MASTERS FOR LECTURE HANDOUTS

2

INTRODUCTION TO THIS COURSE AND TO STATISTICS
The Course This is a one semester course for Business Studies students. It is intended to be user friendly, with a minimum of jargon and a maximum of explanation. It is accompanied by a simple text-book: 'Statistics for Business - A One Complete Semester Course', which contains many more explanations, extra worked examples, practice exercises with answers, examination examples, etc. Having completed the course you should be able to:
y y y y y y y y

understand the use of most simple statistical techniques used in the world of business; understand published graphical presentation of data; present statistical data to others in graphical form; summarise and analyse statistical data and interpret the analysis for others; identify relationships between pairs of variables; make inferences about a population from a sample; use some basic forecasting techniques; use a statistical software package (optional at the discretion of the lecturer).

The main themes of this course are:
y y y y

Descriptive statistics: graphical and numerical. Inferential statistics: confidence intervals and hypothesis tests. Pairwise relationships between variables: correlation, regression and chi-squared tests. Forecasting: modelling and extension of time series.

Statistics (According to the Concise Oxford Dictionary)
y y

Numerical facts systematically collected The science of collecting, classifying and using statistics

The emphasis on this course is not on the actual collection of numerical facts (data) but on their classification, summarisation, display and analysis. These processes are carried out in order to help us understand the data and make the best use of them.

3 Business statistics: Any decision making process should be supported by some quantitative measures produced by the analysis of collected data. Useful data may be on:
y y y y y

Your firm's products, costs, sales or services Your competitors' products, costs, sales or services Measurement of industrial processes Your firm's workforce Etc.

Once collected, this data needs to be summarised and displayed in a manner which helps its communication to, and understanding by, others. Only when fully understood can it profitably become part of the decision making process.

TYPES OF DATA AND SCALES OF MEASUREMENT
Any data you use can take a variety of values or belong to various categories, either numerical or non-numerical. The 'thing' being described by the data is therefore known as a variable. The values or descriptions you use to measure or categorise this 'thing' are the measurements. These are of different types, each with its own appropriate scale of measurement which has to be considered when deciding on suitable methods of graphical display or numerical analysis. Variable - something whose 'value' can 'vary'.
y y y y

A car could be red, blue, green, etc It could be classed as small, medium or large. Its year of manufacture could be 1991, 1995, 1999, etc. It would have a particular length.

These values all describe the car, but are measured on different scales. Categorical data This is generally non-numerical data which is placed into exclusive categories and then counted rather than measured. People are often categorised by sex or occupation. Cars can be categorised by make or colour. Nominal data The scale of measurement for a variable is nominal if the data describing it are simple labels or names which cannot be ordered. This is the lowest level of measurement. Even if it is coded numerically, or takes the form of numbers, these numbers are still only labels. For example, car registration numbers only serve to identify cars. Numbers on athletes vests are only nominal and make no value statements. All nominal data are placed in a limited number of exhaustive categories and any analysis is carried out on the frequencies within these categories. No other arithmetic is meaningful.

The distinction between these types is theoretical rather than practical as the same numerical and graphical methods are appropriate for both. Suppose Bill earns £20 000. 3. Questionnaires are often used to collect opinions using the categories: 'Strongly agree'. on the other hand. or less. measuring nominal. 2. Dates are measured on this scale as again the zero is arbitrary and not meaningful. Interval and Ratio data In both these scales any numbers are defined by standard units of measurement. For example: interval data may be treated as ordinal but useful information is lost . There are very few examples of genuine interval scales. as the numbers which represent them since the value of £0 represents 'no money'. without any measurement being taken on each case. 1 : 2. Degree classifications are only ordinal. 'No opinion'. . ordinal and interval data.if we know that Bill earns £20 000 and Ben earns £15 000 we are throwing information away by only recording that Bill earns 'more than' Ben. the zero is not meaningful the data are interval only. ordinal and interval. If there is also a meaningful zero. If. the cars can be ordered as: 'small'. Ben earns £15 000 and Bob earns £10 000. These data are known as ratio data. then it is classed as ordinal data. The difference between 30°C and 50°C is the same as the difference between 40°C and 60°C but we cannot claim that 60°C is twice as hot as 30°C. We know that the members of one category are more. so equal difference between numbers genuinely means equal distance between measurements. Ratio data must have a meaningful zero as its lowest possible value so. Athletes results depend on their order of finishing in a race. 'Agree'. 'medium'. the time taken for athletes to complete a race would be measured on this scale. The responses may be coded as 1. 4 and 5 for the computer but the differences between these numbers are not claimed to be equal. This is one level up from nominal. not by 'how much' separates their times. The intervals of £5000 between Bill and Ben and also between Ben and Bob genuinely represent equal amounts of money. For example. Temperature in degrees Centigrade provides one example with the 'zero' on this scale being arbitrary. 'Disagree' or 'Strongly disagree'. than the members of another but we do not know by how much. Money is always 'at least interval' data. Data may be analysed using methods appropriate to lower levels.4 Ordinal Data If the exhaustive categories into which the set of data is divided can be placed in a meaningful order. We have therefore identified three measurement scales . for example. Also the ratio of Bob's earnings to Bill's earnings are genuinely in the same ratio.nominal. 'large' without the aid of a tape measure. It is therefore interval data but not ratio data. then the fact that one number is twice as big as another means that the measurements are also in that ratio. This data set is therefore ratio as well as interval.

It is however quite appropriate to report that '70% of the students are female'. Care must be taken when selecting the sample as it must be representative of the whole population under consideration otherwise it doesn't tell us anything relevant to that particular population.5 Data cannot be analysed using methods which are only appropriate for higher level data as the results will be either invalid or meaningless. It is often the only possibility as the collecting of data may sometimes destroy the article of interest. the quality control of rolls of film. but not always. so we use a sample. nominal. to examine every member of a population. If the variable can take any value within a range it is continuous. such as is carried out every ten years in the British Isles to produce a complete enumeration of the whole population. 'female' = 2 and then report 'mean value is 1.7'. This is not confined to people. as is usual in the non-statistical sense. Occasionally the whole population is investigated by a census. For example it makes no sense to code the sex of students as 'male' = 1. All definitions agree that interval or ratio data are quantitative. If the values which can be taken by a variable change in steps the data are discrete. A more usual method of collecting information is by a survey in which only a sample is selected from the population of interest and its data examined. The data are gathered from the whole population. Problems of definition could arise with numbers. whole numbers. The number of people shopping in a supermarket is discrete but the amount they spend is continuous. POPULATIONS AND SAMPLES The population is the entire group of interest. use the term qualitative to refer to words only.g. a smaller selection taken from that population. Qualitative and Quantitative data Various definitions exist for the distinction between these two types of data. Examples of this are the Gallup polls produced from a sample of the electorate when attempting to forecast the result of a general election. or not practical. It is not usually possible. which identify or rank rather than measure. Analysing a sample instead of the whole population has many advantages such as the obvious saving of both time and money. however. Quantitative or metric data which describes some measurement or quantity is always numerical and measured on the interval or ratio scales. . These discrete values are often. whilst others also include nominal or ordinal numbers. Discrete and Continuous data Quantitative data may be discrete or continuous. Examples may include such 'things' as all the houses in a local authority area rather than the people living in them. e. to estimate some value or characteristic of the whole population. Nonnumerical. The number of children in a family is discrete but a baby's birth weight is continuous. such as house numbers. data is always described as being qualitative or non-metric as the data is being described some quality but not measured. Some text-books.

i. This ideal is often not achieved for a variety of reasons and many other methods are used. INFERENTIAL STATISTICS Alternatively. of this type of data. we are only interested in the specific group from which the measurements are taken.) . For example we use the proportion of faulty items in a sample taken from a production line to estimate the corresponding proportion of all the items. The facts and figures usually referred to as 'statistics' in the media are very often a numerical summary. we may describe them or analyse them in their own right. we may have available information from only a sample of the whole population of interest. sometimes accompanied by a graphical display.6 The ideal method of sampling is random sampling. A descriptive measure from the sample is usually referred to as a sample statistic and the corresponding measure estimated for the population as a population parameter. This branch of statistics is usually referred to as inferential statistics. DESCRIPTIVE STATISTICS If the data available to us cover the whole population of interest. The problem with using samples is that each sample taken would produce a different sample statistic giving us a different estimate for the population parameter. Much of the data generated by a business will be descriptive in nature. e. They cannot all be correct so a margin of error is generally quoted with any sample estimations. In the next few weeks you will learn how to display data graphically and summarise it numerically. as will be the majority of sporting statistics.g.e. In this course you will learn how to draw conclusions about populations from sample statistics and estimate future values from past data. This is particularly important when forecasting future statistics. In this case the best we can do is to analyse it to produce the sample statistics from which we can infer various values for the parent population. By this method every member of the population has an equal chance of being selected and every selection is independent of all the others. unemployment figures. (There is no tutorial sheet corresponding to this week's lecture but if your basic maths is at all questionable it can be checked by working through the 'Basic Mathematics Revision' sheet which covers all the basic arithmetic which is the only prior learning required for this course.

0 e) 1 25000 f) 1 0.2x b) 2(x .625 to a fraction in its lowest terms. b) Give 489 267 to 2 sig.002615 to 1 sig. expanding brackets if appropriate.2x) c) 3 ! x1 1 2 . figs.00025 3 4 b) 28% to a decimal.000296 c) 0. figs. d) Give 0. a) 3a + b + 2a .3) = 3(1 . 8 f) 0.4b b) 2a +4ab +3a2 + ab c) a2(3a + 4b + 2a) d) (x + 2)(x + 4) e) (x + 2)2 f) (x + 1)(x . places.4 b) y = x2 .4590 d) 459. e) Give 0. c) Give 0. 8 e) 0.002615 to 2 significant figures.x e) (x + z)(2y . If there were 126 retail outlets altogether.7 Exercise 1 1 Basic Mathematics Revision Evaluate 3 4 5 a)   7 21 3 4 3 7 e) v z 9 16 12 1 4 2 b) 1  2  3 2 5 3 2 11 7 f) 4 z v 1 5 12 8 c) 2 14 15 v v 7 25 24 d) 16 4 z 21 7 2 a) Give 489 267 to 3 significant figures. Retail outlets in a town were classified as small. y = 5 and z = 4 a) xy b) (xy)2 2 d) zy . how many were there of each type? Convert: a) 28% to a fraction in its lowest terms.002615 to 5 decimal places. 5 6 Reduce the following expressions to their simplest form. a) 296000 b) 0. 3 c) to a decimal. a) y = 5x .625 to a percentage. Express the following in standard form. medium and large and their numbers were in the ratio 6 : 11 : 1.002615 to 3 dec.x) c) (xy + z)2 f) x2 + y2 + z2 10 Solve for x: a) 3x .1 = 4 . 9 Evaluate the following when x = -2. 3 d) to a percentage.7 3 d) y = x 7 c) y = 2(3 + 6x) 8 Find the value of x in the formulae in number 7 when y = -3.1) Make x the subject of the formula. f) Give 0.

375 b) 2. 77 medium. 7 large 4 a) 7 25 b) 0.28 c) 0.96 x 10-4 f) 4.0026 d) 0.96 x 105 e) 4.8 Answers 1 a) 1 19 21 b) 29 30 c) 1 10 d) 1 1 3 e) 1 7 f) 9 2 a) 489 000 b) 490 000 f) 0.59 x 102 d) x2 + 6x +8 5 a) 2.0 x 103 d) 37.003 e) 0.5% d) 4.3b e) x2 + 4x + 4 7 a) x! c) 4.002 62 3 42 small.59 x 10-1 c) 5a3 + 4a2 b b) 3a2 + 5ab +2a f) x2 .1 y7 1 .003 c) 0.0 x 10-5 6 a) 5a .5% e) 5 8 f) 62.

y  4 b) x ! 2 c) x ! y6 12 d) x ! 3 y 8 a) 0.2 b) +2 or .75 d) -1 9 a) -10 10 a) x = 1 b) 1 1 8 b) 100 c) 7 c) 36 d) 16 e) 24 f) 45 .2 c) -0.

piecharts. y Hand drawn histograms usually show the frequency or frequency density vertically. lengths rather than colours.. The final section of the polygon often joins the mid point at the top of each extreme rectangle to a point on the x-axis half a class interval beyond the rectangle.. If frequencies are very high in the middle of the data classes may be split.. grouping the data into a reasonable number of classes. Tally charts may be used if preferred.e.' or 'over. histograms and cumulative frequency diagrams. y Intervals are often left the same width but if the data is scarce at the extremes then classes may be joined. i. rather than its height. (Your histogram will be used again next week for estimating the modal mileage. y Construct the frequency distribution table. Too few classes may hide some information about the data.'.. . is open ended. so if one column is twice the width of another it needs to be only half the height to represent the same frequency.) If some intervals are wider than others care must be taken that the areas of the blocks and not their heights are proportional to the frequencies.. frequency per constant interval. too many classes may disguise its overall shape. generally in the form of a frequency distribution table.9 GRAPHICAL REPRESENTATION OF DATA In this lecture two methods. e. and stem-and-leaf diagrams. For a given frequency distribution.. frequency polygons. Method y Construct a Frequency Distribution Table. dotplots. You also need to read Chapter 2 of 'Business Statistics ± A One Semester Course' which describes other presentation methods: barcharts. boxplots.) A frequency polygon is constructed by joining the midpoints at the top of each column of the histogram. (somewhere in the order of 10). i. A Histogram is a pictorial method of representing data. y If the intervals are not all the same width.g. is drawn proportional to the Frequency. A frequency distribution is simply a grouping of the data together. but any method of finding the frequency for each interval is quite acceptable. y Construct the histogram labelling each axis carefully. y Make all intervals the same width at this stage. Computer output may be horizontal as this format is more convenient for line printers. you may have to decide on sensible limits if the first or last Class Interval is specified as 'under . calculate the frequency densities. are described in detail for presenting frequency data. giving a clearer picture than the individual values. y The Area of a block. Also make sure that all the diagrams in this handout are completed before next week's lecture as they will be used again for estimating summary statistics. (Often the most commonly occurring interval is used throughout. This makes the area enclosed by the rectangle the same as that of the histogram. The most usual presentation is in the form of a histogram and/or a frequency polygon. It appears similar to a Bar Chart but has two fundamental differences: y The data must be measurable on a standard scale.e.

the frequency density.0 3. Mileages recorded for a sample of hired vehicles from 'Fleet 1' during a given week yielded the following data: 138 146 168 146 164 158 126 183 150 140 138 105 132 109 186 108 144 136 163 135 125 148 109 153 149 152 154 140 157 144 165 135 161 145 135 142 150 145 156 128 Minimum = 105 Maximum = 186 Range = 186 . but extreme intervals may be wider. (Not frequency density yet.0 8.105 = 81 Nine intervals of 10 miles width seems reasonable. Freq.e. Calculate all the frequency densities. Next draw the histogram and add the frequency polygon. per 10 mile interval 12 10 8 6 4 2 0 100 120 140 160 Mileages 180 200 Histogram and frequency polygon .0 The data is scarce at both extremes so join the extreme two classes together.0 5. This doubles the interval width so the frequency needs to be halved to produce the frequency per 10 miles. Your boss is interested in the weekly distances covered by these cars.) Frequency distribution table Class interval 100 and < 110 110 and < 120 120 and < 130 130 and < 140 140 and < 150 150 and < 160 160 and < 170 170 and < 180 180 and < 190 Tally |||| ||| |||| || |||| |||| | |||| ||| |||| || Total Frequency 4 0 3 7 11 8 5 0 2 40 Frequency density 2.0 7.10 Example: You are working for the Transport manager of a large chain of supermarkets which hires cars for the use of its staff.0 11. Next calculate the frequencies within each interval. i.0 1.

Frequencies are always plotted on the vertical axis. (the units). . y Plot the cumulative percentage against the end of the interval and join up the points with straight lines. in the same example as above. These cumulative frequencies are often calculated as percentages of the total frequency. though it tends to be shown horizontally. Example (Same data as for the previous histogram) Frequency 4 0 3 7 11 8 5 0 2 Stem Leaf 10 | 5789 11 | 12 | 568 13 | 2555688 14 | 00244556689 15 | 00234678 16 | 13458 17 | 18 | 36 Cumulative frequency diagrams (Cumulative frequency polygons. the interquartile range and the semi-interquartile range of the data. The 'stem' indicates. Next week you will use it to estimate the median mileage. Ogives) A Cumulative Frequency Diagram is a graphical method of representing the accumulated frequencies up to and including a particular value . y Calculate a column of cumulative percentages.B. It can also be used to estimate the percentage of the data above or below a certain value. is used for estimating median and quartile values and hence the interquartile or semi-interquartile ranges of the data. This method. Using the data in the section on Histograms. (hundreds and tens).a running total. The lowest value below is 105 miles.11 Stem and leaf plots A stem and leaf plot displays the data in the same 'shape' as the histogram. Method y Construct a frequency table as before. The main difference is that it retains all the original information as the numbers themselves are included in the diagram so that no information is 'lost'. we work through the above method for drawing the cumulative frequency diagram. as we shall see next week. y Use it to construct a cumulative frequency table noting that the end of the interval is the relevant plotting value. Sometimes the frequency on each stem is included. the first two digits. and the 'leaf' the last one. N.

100 Cumulative frequency diagram 80 60 40 20 0 100 120 140 160 Mileages 180 200 We can estimate from this diagram that the percentage of vehicles travelling.0 % C.5 95. y Estimate the Interquartile Range by dotting in from 25% and down to the horizontal axis to give the Lower Quartile value.0 10.5 35.0 100. Next week we shall use the same histogram and cumulative frequency diagram to: y Estimate the mode of the frequency data. The Semi-interquartile Range is half of the Interquartile range. . y The same technique can be employed to estimate a particular value from the percentage of the data which lies above or below it. less than 125 miles is 14%.0 62. The value indicated on the horizontal axis is the estimated Median. y Estimate the Median value by dotting in from 50% on the Cumulative Percentage axis as far as your ogive and then down to the values on the horizontal axis. say. Repeat from 75% for the Upper Quartile value.0 17.F.5 82. The Interquartile Range is the range between these two Quartile values.12 Cumulative frequency table Interval less than 100 120 130 140 150 160 170 190 Frequency 0 4 3 7 11 8 5 2 Cumulative frequency 0 4 7 14 25 33 38 40 % cumulative frequency 0.

0 3. Next calculate the frequencies only within each interval. per 10 mile interval 12 10 8 6 4 2 0 100 120 140 160 Mileages 180 200 Histogram and frequency polygon . the frequency density. Frequency distribution table Class interval 100 and < 110 110 and < 120 120 and < 130 130 and < 140 140 and < 150 150 and < 160 160 and < 170 170 and < 180 180 and < 190 Tally |||| ||| |||| || |||| |||| | |||| ||| |||| || Total Frequency 4 0 3 7 11 8 5 0 2 40 Frequency density 2. but extreme intervals may be wider.13 Completed diagrams from lecture handout (Not for the student handouts) Example You are working for the Transport manager of a large chain of supermarkets which hires cars for the use of its staff. Freq.e.0 1.0 8.0 11. This doubles the interval width so the frequency needs to be halved to produce the frequency per 10 miles. i.0 5. Next draw the histogram and add the frequency polygon. He is interested in the weekly distances covered by these cars.0 The data is scarce at both extremes so join the extreme two classes together. Mileages recorded for a sample of hired vehicles from Fleet 1during a given week yielded the following data: 138 146 168 146 164 158 126 183 150 140 138 105 132 109 186 108 144 136 163 135 125 148 109 153 149 152 154 140 157 144 165 135 161 145 135 142 150 145 156 128 Minimum = 105 Maximum = 186 Range = 186 .0 7.105 = 81 Nine intervals of 10 miles width reasonable.

say.F. .0 10.0 100.5 82.14 Cumulative frequency table Interval less than 100 120 130 140 150 160 170 190 Frequency 0 4 3 7 11 8 5 2 Cumulative frequency 0 4 7 14 25 33 38 40 % cumulative frequency 0.0 62.5 95. less than 125 miles is 14%.0 17.5 35. 100 80 60 40 20 0 100 120 140 160 Mileages 180 200 We can estimate from this diagram that the percentage of vehicles travelling.0 Cumulative frequency diagram % C.

We shall use the diagrams drawn in last week's lecture again this week for making estimations and you will use the diagrams you drew in Tutorial 2 for estimating summary statistics in Tutorial 3. The data must be measurable on an interval scale. As we saw last week. and a measure of spread.6. The data may be nominal. or summarised by two parameters: a measure of centrality. or dispersion. rather then by graphs as last week. A set of data is often described.15 NUMERICAL SUMMARY OF DATA In this lecture we shall look at summarising data by numbers. Further summary statistics are discussed in Business Statistics. (Section 1. x! §x n or x! § . For an even number of values take the average of the middle two. a large display of data is not easily understood but if we make some summary statement such as 'the average examination mark was 65' and the 'marks ranged from 40 to 75' we get a clearer understanding of the situation. Section 3.4 of Business Statistics) Median: The middle value when all the data are placed in order. The data themselves must be ordinal or interval. This is calculated by dividing the 'sum of the values' by the 'number of the values'. interval or ratio. MEASURES OF CENTRALITY Mode: This is the most commonly occurring value. or location. Mean: The Arithmetic Average. You will first learn the meanings of the various summary statistics and then learn how to use your calculator in standard deviation mode to save time and effort in calculating mean and standard deviation. ordinal. Chapter 3.

i. therefore. or interval for continuous data . not continuous. All measures are suitable for interval or ratio data so we shall consider numbers in order that all the measures can be illustrated.each value stated separately. The choice of measure of centrality depends. The mean may be thought of as the 'typical value' for a set of data and. on the scale of measurement of the data. y grouped .fx n x f x 7 represents the value of the data is the frequency of that particular value is the shorthand way of writing 'mean' (sigma) is the shorthand way of writing 'the sum of'.e. Data values may also be y discrete. as such. For example it makes no sense to find the mean of ordinal or nominal data. .people. y continuous .heights. is the logical value for use when representing it.frequencies stated for the number of cases in each value group. y single .

0 2141. 48th.5 66.5 246.5 64.5 196. as last week. Median = 2. Days idle for a whole staff of users of hired cars: No.5 63.0 375.0 317.5 Total No.97 Mid-value (x) 59. number of days Total number of days Total number of staff ! 218 ! 2. investigating the usage of cars. Example 1 Non-grouped discrete data Nine members of staff were selected at random from those using hire cars.5 133. 1 1 1 2 2 3 4 4 6 Mean = x ! § x 24 ! 2.0 302.16 You still work for the Transport Manager.5 61.5 60. of Employees(f) 2 5 4 6 5 7 3 2 34 fx 119.0 . The number of days he/she left the car unused in the garage in the previous week was: 2 6 2 4 1 4 3 1 1 Rearranged in numerical order: Mode = 1.295 } 2. of Staff (f) 5 24 30 19 10 5 2 95 = = fx 0 24 60 57 40 25 12 218 2 days 2 days Most common number of days Middle. of Days unused (x) 0 1 2 3 4 5 6 Total Mode Median Mean = = = No.67 ! 9 n Example 2 Grouped discrete data.5 451.5 65.3 days 95 Example 3 Grouped continuous data 40 employees stated their fuel costs during the previous week to be: Value (£) 59 and < 60 60 and < 61 61 and < 62 62 and < 63 63 and < 64 64 and < 65 65 and < 66 66 and < 67 Modal Class = £64 to £65 Medial Class = Class including 17/l8th user = £62 to £64 Mean = Total value of invoices Total number of employees ! 2141 34 ! £62.5 62.

or by using your calculator. Population Standard Deviation: (xWn) This is also known as 'Root Mean Square Deviation' and is calculated by squaring and adding the deviations from the mean.smallest value.17 MEASURES OF SPREAD (THE DATA MUST BE AT LEAST ORDINAL) Range: Largest value . finding the average of the squared deviations. s ! § . As the name suggests we have the whole population of interest available for analysis. Interquartile Range: The range of the middle half of the ordered data. and then square-rooting the result. Semi interquartile Range: Half the value of the interquartile range.

x  x n 2 or s ! § f .

If denominator used is (n . means 'take the positive square root of' as the negative root has no meaning. The larger the value the wider the spread. s ! § . An equivalent formula which is often used is: §x ¨§x¸ © ¹ or n ª n º 2 2 s ! s ! § fx ¨ § fx ¸ © ¹ n ª n º 2 2 for frequency data. Sample standard deviation (xWn-1 ) Usually we are interested in the whole population but only have a sample taken from it available for analysis. the estimate is found to be more accurate.1) instead of (n). It is a measure of just how precise the mean is as the value to represent the whole data set.x  x n 2 for frequency data x represents the value of the data f is the frequency of that particular value x is the shorthand way of writing 'mean' s is the shorthand way of writing 'standard deviation' § is the shorthand way of writing 'the sum of'. It has been shown that the formula above gives a biased value always too small. The Standard Deviation is a measure of how closely the data are grouped about the mean.

x  x n 1 2 or s ! § f .

x  x n 1 2 for frequency data As we shall see shortly on the calculator. the keys for the two standard deviations are described as xWn and xWn-1 respectively The examples below are the same as used previously for the measures of centrality because these two parameters are generally calculated together. .

345 days As estimator for population = 1. Grouped Discrete Data: The numbers of idle days for 95 cars are: No.732 days ( xW n 1 shift 3 for Casios) Example 2. See next page if you have a Casio.352 days (xW n shift 2) ( xW n 1 shift 3) .5th = 1 day Upper quartile.633 days ( xW n shift 2 for Casios) As estimator for the whole population = 1. If still uncertain.1 = 5 days Interquartile Range: Lower quartile.5 days (A much larger sample should really be used for this type of analysis. three quarters of (9 + 1) = 4 days Interquartile range 4 . of Days Idle (x) 0 1 2 3 4 5 6 Total Range: = 6 .1 = 3 days Semi interquartile Range: half interquartile range = 1.1 = 2 days No.0 = 6 Interquartile Range: Lower quartile = 95  1 = 24th = 1 day 4 3(95  1) Upper quartile = = 72nd = 3 days 4 Interquartile Range = 3 . of Staff (f) 5 24 30 19 10 5 2 95 Cumulative frequency 5 29 59 78 88 93 95 Semi interquartile range: 2/2 = 1 day Standard Deviation: (from calculator) For this sample only = 1. ask in tutorials) For these nine numbers only = 1.18 Example 1 Discrete Data: The number of idle days for nine cars: 2 6 2 4 1 4 3 1 1 Rearranged in numerical order: 1 1 1 2 2 3 4 4 6 Range: = 6 .) Standard Deviation: (See your calculator booklet for the method as calculators vary considerably. one quarter of (9 + 1)= 2.

Mode for finding mean and standard deviation Input data (x is the number being keyed in. of Employees 2 5 4 6 5 7 3 2 34 fx 119. The sample standard deviation.5 66.929 ( xW n shift 2) ( xW n 1 shift 3) As estimator for population = £1. Check sample size (n) 5.5 .5 196. Output population standard deviation .59.5 64.5 133.0 375.5 61.D. xW n 1 . Use of a Casio calculator in S.5 60.19 Example 3 Grouped Continuous Data Mid-value (x) 59.5 451.5 246.0 2141.) Output results 4.0 (estimate) The invoices for 34 cars: No.5 65. Output mean ( x ) (6.5 Total Range: = 66.0 Value (£) 59 and < 60 60 and < 61 61 and < 62 62 and < 63 63 and < 64 64 and < 65 65 and < 66 66 and < 67 Interquartile Range: Usually calculated from the quartiles estimated (see below) from a cumulative frequency diagram .958 N. Input data a) Single numbers b) Grouped data (where x is the value and f the frequency.5 63. Get into Standard Deviation Mode 3.B.) 1. Clear all memories 2.0 317. Standard Deviation: (from calculator) For sample only = £1.5 = £7. is the standard deviation produced by default from both Minitab and SPSS analysis. Semi interquartile range: estimated from the interquartile range from the ogive.5 62.0 302.an ogive.

f DT etc. x . SHIFT MODE AC 2 .xW n 7. Output sample standard deviation ( xW n 1 ) RCL SHIFT SHIFT SHIFT Red C (hyp) 1 2) 3 x DT x DT etc. f DT x .

Draw 'crossed diagonals' on the protruding part of the highest column. per 10 mile interval 12 10 8 6 4 2 0 100 110 120 130 140 150 160 170 Mileages 180 190 Estimated Mode: . These two diagonals cross at an improved estimate of the modal mileage. Histogram of Car mileage as drawn last week Freq.20 Estimation of the Mode from a Histogram Mode (Last week's Mileages of Cars example) We know that the modal mileage is in the interval between 140 and 150 miles but need to estimate it more accurately. By this method we take into consideration also the frequencies of the columns adjacent to the modal column.

21 Estimation the Median and Interquartile range from a Cumulative frequency diagram Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative Percentage axis to your ogive and then down to the values on the horizontal axis. 100 80 60 40 20 0 100 110 120 130 140 150 160 Mileages 170 180 190 200 Estimated Median: Estimated Quartiles: Interquartile range: Semi-Interquartile range: % of users who travelled more than 170 miles . Half the Inter-quartile range is the Semi-interquartile range.F. Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to your ogive and then down to the values on the horizontal axis. The value indicated is the estimated Median for the Mileages. The values indicated are the estimated Upper quartile and Lower quartile respectively of the Mileages. Cumulative frequency diagram % C. The difference between these measures is the Interquartile range.

336 Lower Bound Upper Bound .899 Std.500 Tr Mean 62.967 63.958 59. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Statistic 62.403 .500 StDev 1.970 Q3 64.500 Median 63.288 63.5 7.043 -.335 SPSS output Example 3 Descriptives INVOICES Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std.000 3.974 Max 66.22 COMPUTER OUTPUT: NUMERICAL SUMMARY OF EXAMPLE 3 Minitab output Example 3 Descriptive Statistics Variable Costs Variable Costs N 34 Min 59.0 3.832 1.971 62.788 .000 -.500 Mean 62.955 SE Mean 0. Error .654 62.000 Q1 61.5 66.

Histogram of Car mileage as drawn last week Freq.23 Completed diagrams from lecture handout (Not for student handouts) Estimation of the Mode from a Histogram Mode (Last week's Mileages of Cars example) We know that the modal mileage is in the interval between 140 and 150 miles but need to estimate it more accurately. These two diagonals cross at an improved estimate of the modal mileage. per 10 mile interval 12 10 8 6 4 2 0 100 110 120 130 Mileages 140 150 160 170 180 190 Estimated Mode: 146 miles . Draw 'crossed diagonals' on the protruding part of the highest column.

> 170 miles = 5% . 100 80 60 40 20 0 100 110 120 130 140 150 160 Mileages 170 180 190 200 Estimated Median: 146 miles Estimated Quartiles: 134 and 156 miles Interquartile range: 156 .134 = 22 miles Semi-Interquartile range: 22/2 = 11 miles .g.24 Estimation the Median and Interquartile range from a Cumulative frequency diagram Median (Last week's Mileages of Cars example) Dot in from 50% on the Cumulative Percentage axis to your ogive and then down to the values on the horizontal axis. Half the Inter-quartile range is the Semi-interquartile range. The difference between them measures is the Interquartile range. The values indicated are the estimated Upper quartile and Lower quartile respectively of the Mileages. % of users who travelled more than. Quartiles Dot in from 75% and the 25% values on the Cumulative Percentage axis to your ogive and then down to the values on the horizontal axis. for e. Cumulative frequency diagram % C.F. The value indicated is the estimated Median for the Mileages.

In decision making. Value of Probability Probability always has a value between 0. Sample design for sample surveys. the use of decision trees. impossibility.g. The concept of probability was initially associated with gambling. conditional probability calculated from tree diagrams. We will only use contingency tables for combined probabilities in this lecture so make sure you read Chapter 4 of Business Statistics for descriptions of tree diagrams. probable demand for stock before delivery of next order.8 1. Make sure you complete your tutorial.g. if not finished in class time. Other areas of business: y y y y Quality control: Sampling schemes based on probability.0 corresponds to µimpossible¶ corresponds to µextremely unlikely¶ corresponds to µevens chance¶ corresponds to µvery likely¶ corresponds to µcertainty¶ . and 1. Its value was later realised in field of business.) There are two main ways of assessing the probability of a single outcome occurring: y from the symmetry of a situation to give an intuitive expectation. Probability of 0. e. etc. conditional probability calculated from contingency tables. acceptance sampling. or a single outcome. An event may be a set of outcomes.0 0. e. simple probability calculated from frequencies. Probability is measure of uncertainty Probability measures the extent to which an event is likely to occur and can be calculated from the ratio of favourable outcomes to the whole number of all possible outcomes. i. Stock control. 'an even number'. We will calculate probabilities from symmetry and relative frequency for both single and combined events and use expected values in order to make logical decisions. the result of rolling a die might be a six. y from past experience using relative frequencies to calculate present or future probability. certainty. Probability can be calculated from various sources: y y y y y simple probability calculated from symmetry.25 PROBABILITY In this lecture we will consider the basic concept of probability which is of such fundamental importance in statistics.5 0.1 0.g. initially in calculations for shipping insurance. e. 'a six'.e. (An outcome is the result of an trial.

26 Probability from symmetry We can define the probability of an event A taking place as: P.

6 2. Tossing 1 coin: Possibilities: Head.6 5.1 6.1 4. 4.5 3.2 1.5 2.2 5.4 6. Tail.1 5.known as total sample space) 1.1 3.5 2.1 2.3 3. P(6) = P(anything but 6) = P(an even number) = P(an odd number) = Again each pair of probabilities sums to 1.4 2. Tail Head.event A ! Number of equally likely outcomes in which A occurs Total number of equally likely outcomes Invoking symmetry there is no need to actually do any experiments.6 3.1 P(double six) = P(more than 8) = 1.2 3.4 5.6 6. 6. 2.5 5. 4.3 2. 3.4 3. P(2 heads) = P(2 tails) = P(1 head and 1 tail) = Note that the sum of all the possible probabilities is always one.6 4.2 4. 5. This makes sense as one of them must occur and P(certainty) = 1. Head Tail. Rolling a single die Sample space: 1. 3. This is the basis for the theory of probability which was developed for use in the gaming situation.5 4.4 4.3 4.4 1. P(a head) = 1/2 = 0. Tail Tail.2 2. Rolling a pair of dice (all possibilities shown below .5 1.3 6. ± 'a priori' probability. Tossing 2 coins: Possibilities: Head Head.6 P(less than 4) = P(an even total) = P(any double) = P(at least one six) = .3 1.3 5.2 6.5 P(a tail) = 1/2 = 0. Examples 1.5 6.

Pr .27 Probability from Frequency An alternative way to look at Probability would be to carry out an experiment to determine the proportion of favourable outcomes in the long run. It is the long-term tendency which gives the estimated probability.43 (0. This constant value is the probability of the event.the larger n is. . . rt Pr NOTE: The outcome of an individual trial is either 1 or 0 heads.50 0.00 1.50 0. Also: n must be a large number . . As the number of trials increases the relative frequency demonstrates a long term tendency to settle down to a constant value.60 0.00 0. . P(Event E) = Number of times E occurs Number of possible occurrences This is the frequency definition of Probability. © ¨                     H §¦¥¤   £ ¢  ¡  rt fh W  )  0 1   ( "% '& ! " #" " !   $ . The relative frequency of an event is simply the proportion of the possible times it has occurred. . But how long is this 'long term'? Table of Relative Frequencies from one experiment of flipping a coin: Total number of throws 1 2 3 4 5 6 7 8 Number of heads 1 2 2 2 3 3 3 (4) etc. . Proportion of heads 1. In this experiment the µlong term¶ would be the time necessary for the relative frequency to settle down to two decimal places. Result H H T T H T T (if H) etc. the better the estimate of the probability. fH .57) etc.67 0.

168 P(customer arrives on foot and spends no money on drink) = P(customer spends over £20 or travels by car) = (40 + 20 + 10 + 30 + 25)/250 = 0. These frequencies are used to calculate future probabilities. Example 5 A new supermarket is to be built in Bradfield. this time from past experience. In order to estimate the requirements of the local community.28 Contingency tables Another alternative is that probabilities can be estimated from contingency tables. Contingency tables display all possible combined outcomes and the frequency with which each of them has happened in the past. Part of the results of this are summarised in the table below: Expenditure on drink Mode of travel On foot By bus By car Total Suppose we put all the till invoices into a drum and thoroughly mix them up.50 P(customer arrives on foot or spends no money on drink) = None 40 30 25 1p and under £20 20 35 33 At least £20 10 15 42 Total These last two values are probably more easily calculated by adding the row and column totals and then subtracting the intersecting cell which has been included twice: P(customer spends over £20 or travels by car) = (70 + 95 .50 P(customer arrives on foot or spends no money on drink) = . We first complete all the row and column totals. for example: P(customer spends at least £20) = 67/250 = 0. If we close our eyes and take out one invoice.40)/250 = 0.268 P(customer will travel by car) = From the µcells¶ we can calculate. for example: P(customer spends over £20 and travels by car) = 42/250 = 0. a survey was carried out with a similar community in a neighbouring area. we have selected one customer at random. From the 'sub-totals' we can now calculate. Relative frequency is used as before.

i. A short hand method of writing 'if it is know that .an extra condition. i. | he/she spent at least £20 on drink) = Many variations on the theme of probability have been covered in this lecture but they all boil down to answering the question 'how many out of how many?' . CONDITIONAL PROBABILITY If the probability of the outcome of a second event depends upon the outcome of a previous event then the second is conditional on the result of the first. . This does not imply a time sequence but simply that we are asked to find the probability of an event given additional information .' or 'given that . | he/she did not travel by car) = P(a customer came on foot. P(a customer spends at least £20. if it is known that he/she travelled by car) We eliminate from the choice all those who did not arrive in a car. we are only interested in the third row of the table.' is '|'.732 P(a customer will not travel by car) = P(a customer will not travel by car and will spend less than £20) = P(a customer will not travel by car or will spend less than £20) = In the examples above all the customers have been under consideration without any condition being applied which might exclude any of them from the overall ratio. . If we need P(a customer spends at least £20.29 Sometimes we need to select more than one row or column: P(a customer will spend less than £20) = (95 + 88)/250 = 0.e.627 Note that P(spending u £20.420 P(a customer came by car | he spent at least £20 on drink) = 42/67 = 0. | he/she travelled by car) { P(coming by car | he spent u £20 on drink) P(a customer spends at least £20. Often not all are included as some condition applies. | he/she travelled by car) = 42/100 = 0.e. | he/she travelled by bus) = P(a customer spends less than £20. if we know that the customer travelled by car all the other customers who did not are excluded from the calculations. .

30

EXPECTED VALUES
A commonly used method in decision making problems is the consideration of expected values. The expected value for each decision is calculated and the option with the maximum or minimum value (dependent upon the situation) is selected. The expected value of each decision is defined by: E(x) = §px where x is the value associated with each out outcome, E(x) is the expected value of the event x and p is the probability of it happening. If you bought 5 out of 1000 raffle tickets for a prize of £25 the expected value of your winnings would be 0.005 x £25 = £0.125. If there was also a £10 prize, you would also expect to win 0.005 x £10 = £0.05 giving £0.175 in total. (Strictly speaking there are only 999 tickets left but the difference is negligible.) Clearly this is only a theoretical value as in reality you would win either £25, £10 or £0. Example 6: Two independent operations A and B are started simultaneously. The times for the operations are uncertain, with the probabilities given below: Operation A Operation B Duration (days) (x) Probability (p) Duration (days) (x) Probability (p) 1 0.0 1 0.1 2 0.5 2 0.2 3 0.3 3 0.5 4 0.2 4 0.2 Determine whether A or B has the shorter expected completion time. Operation A E(x) = §px = (1 x 0.0) + (2 x 0.5) + (3 x 0.3) + (4 x 0.2) = 2.7 days Operation B E(x) = §px = (1 x 0.1) + (2 x 0.2) + (3 x 0.5) + (4 x 0.2) = 2.8 days Hence Operation A has the shorter completion time. Example 7 A marketing manager is considering whether it would be more profitable to distribute his company's product on a national or on a more regional basis. Given the following data, what decision should be made? National distribution Net profit Prob. that £m (x) demand is met (p) 4.0 0.50 2.0 0.25 0.5 0.25 Regional distribution Net Prob. that profit demand is £m (x) met (p) High 2.5 0.50 Medium 2.0 0.25 Low 1.2 0.25 Level of Demand

Level of Demand High Medium Low

px

px

Expected profits = §px

31

NORMAL PROBABILITY DISTRIBUTIONS
The normal probability distribution is one of many which can be used to describe a set of data. The choice of an appropriate distribution depends upon the 'shape' of the data and whether it is discrete or continuous. You should by now be able to recognise whether the data is discrete or continuous, and a brief look at a histogram should give you some idea about its shape. In Chapter 5 of Business Statistics the four commonly occurring distributions below are studied. These cover most of the different types of data. This lecture will concentrate on just the normal distribution, but you should be aware of the others. Observed Data Type Shape Distribution Symmetric Binomial Discrete Skewed Poisson Continuous Symmetric Normal Skewed Exponential

The Normal distribution is found to be a suitable model for many naturally occurring variables which tend to be symmetrically distributed about a central modal value - the mean. If the variable described is continuous, its probability function is described as a probability density function which always has a smooth curve under which the area enclosed is unity.

The Characteristics of any Normal Distribution The normal distribution approximately fits the actual observed frequency distributions of many naturally occurring phenomena e.g. human characteristics such as height, weight, IQ etc. and also the output from many processes, e.g. weights, volumes, etc. There is no single normal curve, but a family of curves, each one defined by its mean, , and standard deviation, W; and W are called the parameters of the distribution.

32

As we can see the curves may have different centres and/or different spreads but they all have certain characteristics in common:
y y y

The curve is bell-shaped, it is symmetrical about the mean ( ), the mean, mode and median coincide.

The Area beneath the Normal Distribution Curve No matter what the values of and W are for a normal probability distribution, the total area under the curve is equal to one. We can therefore consider partial areas under the curve as representing probabilities. The partial areas between a stated number of standard deviation below and above the mean is always the same, as illustrated below. The exact figures for the whole distribution are to be found in standard normal tables.

Note that the curve neither finishes nor meets the horizontal axis at Q s 3W, it only approaches it and actually goes on indefinitely. Even though all normal distributions have much in common, they will have different numbers on the horizontal axis depending on the values of the mean and standard deviation and the units of measurement involved. We cannot therefore make use of the standard normal tables at this stage.

The Standardised Normal Distribution No matter what units are used to measure the original data, the first step in any calculation is to transform the data so that it becomes a standardised normal variate following the standard distribution which has a mean of zero and a standard deviation of one. The effect of this transformation is to state how many standard deviations a particular given value is away from the mean. This standardised normal variate is without units as they cancel out in the calculation. The standardised normal distribution is symmetrically distributed about zero; uses 'z-values' to describe the number of standard deviations any value is away from the mean; and approaches the x-axis when the value of z exceeds 3.

Example 1 (Most possible variations of questions are included here!) In order to estimate likely expenditure by customers at a new supermarket. as required by the question. to give z. Knowing the value of z enables us to find.33 The formula for calculating the exact number of standard deviations (z) away from the xQ mean (Q ) is: z! W The process of calculating z is known as standardising. The percentage of shoppers who are expected to: c) d) spend between £30 and £80 per week. (This is easier to understand with numbers!) For standardised Normal distributions the Normal tables are used. combine the area found with another to give the required area. convert this to: a probability. multiplying the area by the total frequency. producing the standardised value which is usually denoted by the letter z. from the Normal tables. use the standard tables to find the area under the curve associated with z. This probability is denoted by the letter Q. Using this knowledge we can find the following information for shoppers at the new supermarket. multiplying the area by 100 since total area = 100%. if necessary.50 per week. a sample of till slips from a similar supermarket describing the weekly amounts spent by 500 randomly selected customers was analysed.50 and £57. The probability that any shopper selected at random: a) b) spends more than £80 per week. spends less than £50 per week. These data were found to be approximately normally distributed with a mean of £50 and a standard deviation of £15. which stands for quantile. standardise the value of interest. The expected number of shoppers who will: e) f) spend less than £70 per week. spend between £55 and £70 per week. using the area as found since total area = 1. . spend between £37. (See back page) Finding Probabilities under a Normal Curve The steps in the procedure are: y y y y y Draw a sketch of the situation. or a percentage. though some tables may differ. or a frequency. the area under the curve between the given value (x) and the mean (Q ) therefore providing the probability of a value being found between these two values. x.

00 80 z2 95 £ 3 3 . The mean is £50 and. £30 to £50 and £50 to £80 can be calculated separately and then the two parts recombined.00   Q ! 0. Q = £50. are therefore expected to spend more than £50 per week.34 a) The probability that a shopper selected at random spends more than £80 per week We need P(x > £80).5 . 250. the probability that a customer spends over £80 Q = £50. Half the shoppers.5 c) The percentage of shoppers who are expected to spend between £30 and £80 per week We first need P(£30 < x < £80). Q in the body of the tables) Therefore: P(x > £80) = 0. W = £15 x = £80 First standardise: z ! xQ W ! 80  50 15 ! From tables: z ! 2.0228 b) The probability that a shopper selected at random spends more than £50 per week No need to do any calculations for this question. so is the median. 2 5 20 35 50 65 80 z 95 30 15 ! 2.4772 = 0. then we convert the result to a percentage.0. x 2 = £80 5 20 z1 35 50 65 1 2 Our normal tables provide the area between a particular value and the mean so the area between £30 and £80 needs splitting by the mean so that the partial areas. W = £15. x1 = £30. because the distribution is normal.4722 ( z n the margin. Using the frequency definition of probability: 250/500 = 0.

4082 and 0. Q = £50.4088 + 0.4772 = 0.00   Q2 = 0.4082 so 0. so when z is negative we invoke the symmetry of the situation and use its absolute value in the table. 0.4088) Therefore: P(£30 < x < £80) = Q1 + Q2 = 0. being nearer to 0. then we convert the result to a percentage.) From tables: z1 !  1.0017.8860 of it = 88.Q1 = % of shoppers expected to spend between £55 and £70 = . P(£55 x1 = £30.4088 (by interpolation) (Q1 lies between 0. W = £15.4082 + 0.6% d) Percentage of shoppers expected to spend between £55 and £70 We first need P(£55 < x < £70). x 2 = £70 5 20 35 50 65 80 95 We now need to find the area between the mean and £70 and then subtract from it the area between the mean and £55.0006 = 0.35 P(£50 < x < £80) From (a) x = 80   z2 = 2. P(£50 < x < £55) z1 ! x Q W x Q W ! From tables Q1 = P(£50 < x < £70 ) z2 ! ! From tables Q2 = Therefore: P(£55 < x < £70) = Q2 .333 (The table values are all positive.8860 The whole area is equivalent to100% so 0.4099 and is a third of the distance between them.333   Q1 ! 0.4772 z1 ! P(£30 < x < £50) xQ W ! 30  50 15 !  20 15 !  1.

50 < x < £57. W = £15. x1 = £37.50). P(£37. This is then multiplies by the total frequency. x = £70. W = £15.50 per week We first need P(£37. Q = £50.50 and £57. 5 20 35 50 65 80 95 The area we need is that between the mean and £70 plus the 0.36 e) The expected number of shoppers who will spend less than £70 per week Q = £50. P(£70 < x < £50) First standardise: From tables: Q = Therefore: P(x < £70) = z ! The expected number of shoppers spending less than £70 is f) The expected number of shoppers who will spend between £37.50 5 20 35 50 65 80 95 £ .50.5 < x < £50) x2 = £57.5 which falls below the mean.

100 of the shoppers are expected to spend.5 50 65 80 Q = 0.37 Finding Values from Given Proportions In this example in parts (a) to (f).53 and is a little nearer to 0. 70% of the total area. in this case the value of the shopping basket.2000 From tables. The value expected to be exceeded by: i) j) 10% of the customers. even if indirectly. 80% of the customers. g) The value below which 70% of the shoppers are expected to spend Q = £50. W = £15. 1. Carrying on with the same example we shall find the following information: The value below which: g) 70% of the customers are expected to spend. x = £? 5 20 35 0. is below x   20% is between Q and x   Q = 0. and asks for the associated value of x.00. Another type of problem gives the area. we were given the value of x and had to find the area under the normal curve associated with it. h) 45% of the customers are expected to spend. by interpolation as z lies between 0.52 and 0.524 ! W 15 15 v 0.86 . The value below which: k) 1) 350 of the shoppers are expected to spend.52) We now know the value of z and need to find x.2000 ( in the body of the table)   z = 0.524 ! x  50 7.524 (in the margin of the table. Using standardising formula: x Q x  50 z !   0. used the other way round: If Q = 0.86 ! x The value below which 70% of customers spend is £57.86 ! x  50 £57.2 x = ? 95 We first need to find the value of Q.

05 80 95 £ 45% below x   5% between x and Q so Q = 0.05 From tables if Q =   z = We know.86 h) The value below which 45% of the shoppers are expected to spend Q = £50. 0.5 x = ? Q = 0. Its value is therefore Q + 0.126 standard deviations below the mean x = Q .126W = 50 .524.126.0.45 0. Using the standard formula: xQ z !   W The value below which 45% of customers spend is Alternatively x is 0.38 If your algebra is a bit 'iffy' try using the fact that this value of z. x = £? 5 20 35 50 65 0. tells us that x is 0. that x is below the mean so z is negative.0.524W = 50 + 0.11 . however. -0.524 x 15 = £57. W = £15.524 standard deviations above the mean.126 x 15 = £48.

Answers to all the questions: (a) 0.39 i) The value expected to be exceeded by 10% of till slips.23 (b) 0. x = £? 5 20 35 50 65 80 95 £ 10% above x   Q = From tables if Q = Using the standard formula: x Q z !   W   z = The value exceeded by 10% of the till slips is Alternatively: x = Q + W = The remainder of the questions can be completed similarly in your own time.88 (d) 27.37 .88 (k) £57.6% (g) £57.5 (f) 245 (j) £37.11 (l) £37. W = £15.0228 (e) 454 (i) £69.8% (h) £48. Q = £50.37 (c) 88.

8 0.4981 0.4868 0.4750 0.4979 0.4994 0.4916 0.4901 0.2291 0.4803 0.4591 0.3485 0.0040 0.3438 0.0 2.4977 0.4952 0.8 1.4896 0.1808 0.4821 0.4582 0.1554 0.4987 0.7 0.2123 0.3212 0.3 1.4996 0.3665 0.4906 0.1026 0.4842 0.4633 0.4738 0.4974 0.3051 0.4968 0.4131 0.4980 0.4898 0.4983 0.4406 0.4887 0.6 1.06 0.0753 0.4699 0.4989 0.8 2.4970 0.3413 0.4994 0.2995 0.4783 0.1064 0.4554 0.4599 0.4971 0.4767 0.1700 0.0793 0.4545 0.0636 0.4875 0.0000 0.3729 0.4986 0.0 0.2794 0.4279 0.3810 0.4943 0.4881 0.04 0.9 3.2486 0.4974 0.3315 0.4997 0.6 2.4535 0.4978 0.2224 0.4884 0.4920 0.3159 0.2704 0.0478 0.4857 0.4929 0.4726 0.3599 0.2157 0.3 3.2257 0.4986 0.4292 0.4956 0.4671 0.4997 0.4564 0.4995 0.2019 0.4015 0.3023 0.2389 0.4744 0.4925 0.0 3.2 0.4989 0.1915 0.4973 0.4972 0.4938 0.4997 0.4 1.4871 0.4975 0.2673 0.3 2.1141 0.0714 0.4826 0.2054 0.0557 0.2967 0.4656 0.1 0.4 2.1 2.4964 0.4967 0.1331 0.3 0.0319 0.4992 0.4995 0.4162 0.4988 0.4990 0.4991 0.4099 0.4989 0.5 2.0279 0.4966 0.01 0.1628 0.4984 0.40 Table 1 AREAS UNDER THE STANDARD NORMAL CURVE z 0.4649 0.4987 0.1985 0.4918 0.4948 0.4959 0.4706 0.4997 0.4192 0.0675 0.3133 0.4864 0.4934 0.4854 0.3554 0.4936 0.4995 0.3980 0.4778 0.4992 0.4993 0.4994 0.4474 0.4761 0.4965 0.1736 0.4994 0.3508 0.4993 0.4979 0.1591 0.3238 0.4976 0.4382 0.4913 0.2190 0.4996 0.3907 0.4996 0.4992 0.4996 0.4 0.3577 0.4982 0.2823 0.08 0.4981 0.0398 0.4995 0.4911 0.1950 0.4812 0.4997 0.4222 0.3849 0.4515 0.3078 0.4946 0.4977 0.4505 0.1772 0.4997 0.4985 0.4332 0.4996 0.4961 0.4987 0.4484 0.2324 0.4931 0.2939 0.4066 0.0517 0.2517 0.4625 0.4236 0.4147 0.4608 0.4984 0.4357 0.4993 0.4846 0.4904 0.1179 0.3365 0.1443 0.3888 0.3289 0.2852 0.2734 0.4616 0.4993 0.4082 0.1217 0.1480 0.3621 0.4265 0.4834 0.4641 0.6 0.0080 0.7 2.1 1.1255 0.05 0.2454 0.4962 0.4963 0.0359 0.0871 0.3997 0.4830 0.4495 0.4772 0.4370 0.4953 0.07 0.3962 0.4960 0.0199 0.4793 0.4949 0.4878 0.2 3.4927 0.1844 0.4306 0.2549 0.4798 0.3531 0.4678 0.4441 0.4032 0.2580 0.4990 0.4525 0.4693 0.1517 0.3770 0.2 1.4788 0.1293 0.3869 0.3264 0.3461 0.03 0.5 0.2422 0.1368 0.4985 0.2088 0.4394 0.0160 0.4890 0.0596 0.4992 0.2611 0.2881 0.4997 0.4 0.4996 0.4969 0.3389 0.0120 0.00 0.0987 0.4932 0.4850 0.4998 .4982 0.3790 0.1103 0.0948 0.4115 0.4991 0.1 3.4418 0.4838 0.4997 0.4995 0.5 1.2642 0.3944 0.4463 0.4732 0.4429 0.2910 0.4991 0.4664 0.4997 0.4988 0.09 0.4207 0.1879 0.4941 0.3643 0.9 2.2357 0.4909 0.4995 0.3708 0.4922 0.4994 0.1406 0.0910 0.9 1.4319 0.7 1.2764 0.4955 0.0 1.0832 0.2 2.0438 0.4686 0.4573 0.4177 0.4049 0.4452 0.4808 0.4997 0.4817 0.4345 0.4951 0.4957 0.4861 0.4756 0.4893 0.3686 0.3749 0.3340 0.3925 0.4719 0.3830 0.3186 0.3106 0.4945 0.4940 0.1664 0.0239 0.02 0.4990 0.4713 0.4251 0.

We will not study sampling in this lecture but just give a list of the main methods below.one particular value. A random sample is ideal for statistical analysis but. Common sense rightly suggests that the larger the sample the more representative it is likely to be but also the more expensive it is to take and analyse. We now estimate the parameters of a normally distributed population by analysing a sample taken from it. In this lecture we will be concentrating on the estimation of percentages and means of populations but do note that any population parameter can be estimated from a sample. for various reasons. In other words instead of claiming that the mean cost of buying a small house is. We also use our knowledge of samples to estimate limits within which we can expect the 'truth' about the population to lie and state how confident we are about this estimation. to estimate the 'something' we need to know about the population itself. exactly £75 000 we say that it lies between £70 000 and £80 000. other methods also have been devised for when this ideal is not feasible. The sample will not provide us with the exact 'truth' but it is the best we can do.3 of Business Statistics. Types of Parameter estimates These two types of estimate of a population parameter are referred to as: y y Point estimate . More details can be found in Section 6. y y y y y y Simple Random Sampling Systematic Sampling Stratified Random Sampling Multistage Sampling Cluster Sampling Quota Sampling It is usually neither possible nor practical to examine every member of a population so we use the data from a sample.41 ESTIMATION OF POPULATION PARAMETERS Last week you became familiar with the normal distribution. . taken from the same population. Interval estimate . Sampling Sampling theory takes a whole lecture on its own! Since any result produced from the sample can be used to estimate the corresponding result for the population it is absolutely essential that the sample taken is as representative as possible of that population.an interval centred on the point estimate. say.

42 Point Estimates of Population Parameters From the sample. where: § . n c) The best estimate of the unknown population standard deviation. T . is the sample percentage. W . a) The best estimate of the population percentage. Q . is the sample mean. p. is the sample standard deviation s. §x . b) The best estimate of the unknown population mean. a value is calculated which serves as a point estimate for the population parameter of interest. x ! Ö This estimate of Q is often written Q and referred to as 'mu hat'.

x  x 2 .

n  1 § .

x  x 2 .

04 4 v 100 ! 20% 20 658. s ! from xWn is not used as it underestimates the value of W .93 38.n s ! This is obtained from the xWn-1 key on the calculator.06 42. the population mean. In order to obtain an estimate of this information. a sample of twenty invoices is randomly selected from the whole population of invoices.47 29.B.05 22.97 43.05 38.16 41. which exceed £40.20 25.27 38. the population standard deviation.78 24. Q. Use the results below obtained from them to obtain point estimates for: 1) 2) 3) the percentage of invoices in the population. W.11 26.47 1) p ! 22.38 43.12 Ö W ! £7.92 Ö Q ! £32.12 .53 31. T. N.92 3) s (from xW n 1 ) ! £7. Values of Invoices (£): 32.11 30.44 20 Ö T ! 20% 2) x ! ! £32.48 38.07 32.00 33.27 25. Example 1: The Accountant at a very small supermarket wishes to obtain some information about all the invoices sent out to its account customers.

These last two parameters are used to calculate the standard error. s. c) the amount of variation among the members of the sample. i. e. if estimating means. 95%.43 Interval Estimate of Population Parameter (Confidence interval) Sometimes it is more useful to quote two limits between which the parameter is expected to lie. The limits are called the confidence limits and the interval between them the confidence interval. and its standard deviation. The width of the confidence interval depends on three sensible factors: a) the degree of confidence we wish to have in it.g. b) the size of the sample. in this case either a percentage or a mean. The width of the interval is dependent on the confidence we need to have that it does in fact include the population parameter. which is also referred to as the standard deviation of the mean. In practice we tend to say that we are 95% confident that our interval includes the true population value. the size of the sample.g. the probability of it including the 'truth'. for means this the standard deviation. Always use the normal tables for percentages which need large samples. The number of standard errors included in the interval is found from statistical tables either the normal or the t-table. s/˜ n. it is the variation between samples which gives the range of confidence intervals. Note that there is only one true value for the population mean. within which we expect the population parameter to lie. . The confidence interval is therefore an interval centred on the point estimate. n. for a 95% confidence interval. n. together with the probability of it lying in that range. e. For means the choice of table depends on the sample size and the population standard deviation: Population standard deviation W s Known: standard error = Unknown: standard error = n n Normal tables Normal tables Normal tables t-tables Sample size Large Small Interpretation of Confidence intervals How do we interpret a confidence interval? If 100 similar samples were taken and analysed then. we are confident that 95 of the intervals calculated would include the true population mean. s.e.

44 Confidence Intervals for a Percentage or Proportion The only difference between calculating the interval for percentages or for proportions is that the former total 100 and the latter total 1. The confidence interval for a population percentage or a proportion. Percentages are probably the more commonly calculated so in Example 2 we will estimate a population percentage. otherwise the methods are identical. T. This difference is reflected in the formulae used. is given by: p.

 p 100 p.

p is the sample percentage or proportion. i. n is the sample size. the point estimate for T. z is the appropriate value from the normal tables.e. T ! psz p.  p 1 for a percentage or T ! p s z for a proportion n n where: T is the unknown population percentage or proportion being estimated.

The formulae p.  p 100 and n and a proportion respectively.

We are prepared to be incorrect in our estimate 5% of the time and confidence intervals are always symmetrical so. from the normal table. 95%. Example 2 In order to investigate shopping preferences at a supermarket a random sample of 175 shoppers were asked whether they preferred the bread baked in-store or that from the large national bakeries. so that the normal table may be used in the formula. 112 of those questioned stated that they preferred the bread baked in-store.  p 1 represent the standard errors of a percentage n The samples must be large. in the tables we look for Q to be 5%. ( >30). The value of z. e. Find the 95% confidence interval for the percentage of all the store's customers who are likely to prefer in-store baked bread. depends upon the degree of confidence.T. is p = 175 p. two tails. required. We therefore estimate the confidence limits as being at z standard errors either side of the sample percentage or proportion.g. 112 v 100 = 64% The point estimate for the population percentage.

two tails   z = 1.96 p s z p.  p 100 Use the formula: T ! p s z where p = 64 and n = 175 n From the short normal table 95% confidence   5%.

T.96 v n 175 ! the confidence limits for the population percentage. are <T< and .  p 100 64 v 36   64 s 1.

50 10 ! n   4.50.1 degrees of freedom (R = n . Q. is known. The population standard deviation is unknown. when the population standard deviation is not known.92.45 Confidence interval for the Population Mean. Q.50.12. £4. The normal table can therefore be used to find the number of standard errors in the interval.50.1) From Example 1 x ! £32. n ! 20. Assuming the same standard deviation. must be used to compensate for the probable error in estimating its value from the sample standard deviation.50.15 s 1. so needs estimating from the sample standard deviation. W. s ! £7. If the average invoice value for the whole chain is £38. 99% confidence. when the population standard deviation. . in Example 1. degrees of freedom = (20 ± 1) = 19.15 per hour. Confidence interval for the Population Mean. calculate the 95% confidence interval for the average hourly wage for employees of the small branch and use it to see whether the figure could be the same as for the whole chain of supermarkets which has a mean value of £4. so the t-table. is the small supermarket in line with the rest of the branches? Confidence Interval: Q ! x s t where the value of t comes from the table of n 'percentage points of the t-distribution' using n . so the small branch is out of line with the rest.96 v This interval includes the mean. from tables t = 2. A random sample of 10 employees from the small supermarket gave a mean wage of £4. sent out by the small supermarket branch.87 Q ! xst s s = n This interval does not include £38. Confidence Interval: Q ! x s z Q ! xsz W W n where z comes from the short normal table 1. Example 3: For the small supermarket as a whole it is known that the standard deviation of the wages for part-time employees is £1. for the whole chain so the average hourly wage could be the same for all employees of the small supermarket. As we actually know the population standard deviation we do not need to estimate it from the sample standard deviation. Example 4: Find the 99% confidence interval for the mean value of all the invoices.50. s.

for example.06 v 6. they were quite separate with the later sample giving the higher interval then the campaign must have been effective in increasing sales.70. If.60 s 2. Example 5 The till slips of supermarket customers were sampled both before and after an advertising campaign and the results were analysed with the following results: Before: x = £37. n = 25 so 24 degrees of freedom giving t = 2.36 After: QA ! xA s t sA nA   Interpretation: The sample mean had risen considerably but. if two samples could have come from the same population as judged by their means.60 s 2. . Before: QB ! xB s t sB nB   37. s = £6. There may be no difference between the two means so the advertising campaign has not been proved to be successful. For both. because the confidence intervals overlap. the mean values for all the sales may lie in the common ground. If the intervals were found to overlap the means could be the same so the campaign might have had no effect. If. a supermarket chain wished to see if a local advertising campaign was successful or not they could take a sample of customer invoices before the campaign and another after the campaign and calculate confidence intervals for the mean spending of all customers at both times. We shall improve on this method in the next example. 2-tails. . s = £5.30.46 Comparison of Means using 'Overlap' in Confidence Intervals We are going to extend the previous method to see if two populations could have the same mean or. We assume that the standard deviations of both populations are the same. on the other hand.06 for 5%.78.76 £34.60. n = 25 After: x = £41.84 < Q < £40.70 25 ! 37. n = 25 Has the advertising campaign been successful in increasing the mean spending of all the supermarket customers? Calculate two 95% confidence intervals and compare the results. alternatively.

e. s d ! £3. i.30 78. 'before' and 'after'. In other words these two populations.28 s 2.41 £0. i.02 H -0.65 32.30 55. The direction doesn't matter but it seems sensible to take the earlier amounts away from the later ones to find the changes: Diffs. then there is a possibility that no change has taken place and the original situation has remained unchanged.53 D E 10.55 3.e. Because the mean amount spent after the campaign is greater than that before it. i.23 15. When interpreting this change look carefully at the direction of the change as this will depend on your order of subtraction. After the campaign.90 After 43.79 46. and collected their till slips.50 32. Using the paired data.39 17. 'before' and 'after'. s x d .50 50.e. slips from the same 10 customers were collected and both sets of data recorded. then this method should be used as a smaller interval is produced for the same percentage confidence giving a more precise estimate.37 10   3. as above.76 32. there has been a significant increase in spending.34 We can now forget the original two data sets and just work with the differences.47 Confidence Intervals for Paired Data If two measures are taken from each case.I.78 19. . he took a random sample of 10 customers.28.44 C -0.76 20. In the above case it is obvious that x d is positive so there has been an increase in the average spending.28 s 2.29 10. then zero is excluded and some change must have taken place.85 I 5. s d and n d refer to the calculated difference s .80 37. If both limits have the same sign.17 G 5. in every instance then the 'change' or 'difference' for each case can be calculated and a confidence interval for the mean of the 'changes' calculated.71 F 4.67 37. Q d ! xd s t sd nd   3. are not independent. has there been any mean change at a 95% confidence level? A B C D E F G H I J Before 42. In general: If the confidence interval includes zero.32 77.24 We first need to calculate the differences. Even though the overall spending seemed to have increased the high spenders still spent more than the low spenders and that the individual increases would show a smaller spread.79 B 3.09 59. it is not independent.20 15. it changes from negative to positive.37 and n d ! 10 95% C. A to J. If the data can be 'paired'.19 J 1. A 0.87 < Qd < £5.20 31. From the differences x d ! £3. Before the next advertising campaign at the supermarket. Confidence Interval: Q d ! x d s t d nd Example 6 The supermarket statistician realised that there was a considerable range in the spending power of its customers.69 This interval does not include zero so the possibility of 'no change' has been eliminated.26 v 3.

901 1.376 Std.25) Descriptives xbxQ U eS y xx R w GI G q u b U d G IHt Gg eSH vSGeI u G IHt U b i i sHF U b b i iIF Qb b r 6 V I eH fG eq c b G IH SHp bV IH GF V b IHGF Gii Sh Y@ Qa d IHGF S gHfSGeI c V ba Q 7 G IG I ` Y@ IHGF ‚ d  €d q ` p (SPSS) 59 7 69 @ Q ‚ 6V S SS eq C8469 E A586 E 885C6 59 A569 5 AC6 4C 856 55 9 B996 D 74B6 8 @ 884D6 54 7A576 54 C8 56 B4 @ B7A 6 75 @ 88576 54 cb xb e eHeq Example 6 Variable Diffs.2830 .37 SE Mean 1.59 ( 95.334 5776 59 6 @ V UQ W I T SGXX V UQ RQP I T SG .59.28 StDev 3.0 % CI 0.6921 3.12 SE Mean 1.3676 -.87. (Minitab) N 10 Mea n 3.69) Descriptives (SPSS) Statistic 3. 5.0 % CI 29.40 4.92 StDev 7. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Lower Bound Upper Bound .341 3. 36.06 ( 95.687 1.48 COMPUTER ANALYSIS OF EXAMPLES 1 AND 6 Confidence Intervals Example 1 Variable Invoices (Minitab) N 20 Mean 32.55 11.5750 11.6025 .8739 5.85 10.0649 DIFFS Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std.1089 3. Error 1.

70 1.1% 0.29 Y = no of degrees of freedom. E = total hatched area in tails Table 3 % POINTS OF THE STANDARD NORMAL CURVE One tailed One tail Two tails z 5% 10% 1.75 2.60 12.17 3.04 4.89 1.21 7.86 1.23 2.78 2.83 1.47 3.09 0.75 3.57 2.5% 1% 31.76 2.14 3.6 31.50 4.96 5.18 2.96 1% 2% 12.23 3.05% 0.85 3.02 1.09 2.73 3.70 2.31 2.5% 5% 1.84 4.49 2.82 2.14 3.30 4.2% 63.68 2.33 0.53 2.45 2.89 5.59 4.5% 5% 4.1% 3.05% 0.81 1.58 0.36 2.33 10.39 3.1% 0.96 1% 2% 2.60 4.42 2.03 3.29 .36 3.5% 1% 2.41 5.75 3.85 2.00 2.30 3.36 3.95 2.31 2.80 2.46 2.68 1.75 1.79 4.25 3.92 8 .07 3.00 1.31 3.96 4.66 22.30 4.71 3.1% 636.13 2.21 4.71 6.58 Two tailed 0.60 2.54 3.82 9.64 2.61 6.90 2.04 2.78 1.65 3.05 2.18 2.55 3.35 2.94 1.02 2.92 5.17 5.93 3.87 5.78 4.49 Table 2 PERCENTAGE POINTS OF THE t-DlSTRlBUTlON One tailed One tail E Two tails E R=1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 g 5% 10% 6.2% 3.66 2.71 1.46 3.39 2.92 2.55 3.06 2.50 3.33 Two tailed 0.72 1.67 1.13 2.32 4.26 2.09 0.64 2.

From the data we find the sample statistics. (Each formula is used in one specific type of hypothesis test only and so needs careful selection.This is a statement about the population which may. If the Test Statistic is not greater than the Critical Value then we do not have sufficient evidence and so do not reject the null hypothesis. State the alternative hypothesis (H1 ) . or less than ( < ). or H1: > £15 000 H1 : State the significance level (E) of the test . H0 : = £15 000 where is the population mean. This week we shall similarly analyse a sample and then use its statistic to see whether some claim made about the population is reasonable or not.) 2 3 4 5 6 .This value is calculated from the data obtained from the sample. and whether the test is one. or may not. or H1: < £15 000. A claim as been made that average male salary in a particular firm is £15 000) 1 State the null hypothesis (H0 ) . claim or hypothesis about the population the process is known as hypothesis testing (Significance Testing). such as the mean and standard deviation. the mean salary is not £15 000.g.50 HYPOTHESIS TESTING Last week we estimated an unknown population parameter from the corresponding statistic obtained from the analysis of a sample from the population. It depends on the significance level. greater than ( > ). It will include the terms: not equal to ( { ). { £15 000. If not stated specifically a 5% (0. If the Test Statistic is greater than the Critical Value then we have enough evidence from the sample data to cause us to reject the null hypothesis and conclude that the Alternative Hypothesis is true. We shall test for proportions and means. be true. When an estimate from a sample is used to test some belief. which we need for calculating the test statistic. A formal procedure is used for this technique. We shall test proportions and means.05) significance level is used.g.Compare the values of the Test Statistic and the Critical Value. these are then substituted into the appropriate formula to find its value. From computer output the p-value is the probability of H0 being correct. E.or two-tailed. e. the mean salary for the whole firm could be £15 000. Calculate the test statistic .this is the value from the appropriate table which the test statistic is required to reach before we can reject H0. It takes the form of an equation making a claim about the population. e.g. (NB We can prove that the Null Hypothesis is false but we can never prove it to be true since we do not have all the evidence possible as only a sample is available for analysis. Find the critical value .) Reach a conclusion . The Method which is common to all Hypothesis Tests: (e. Some conclusion about the population is drawn from evidence provided by the sample.the proportion of the time you are willing to reject H0 when it is in fact true.the conclusion reached if the Null Hypothesis is rejected.

or < c Significance Level: as stated or 5%. The total shaded area is the significance level.05). OR PERCENTAGE. >. H0 : T = c c is the hypothesised population proportion or % H1 : T {. The methodology will seem simpler with specific numerical examples! TESTING FOR THE PROPORTION. Critical Value: from Normal tables Test Statistic: Calculated by pT T. OF A POPULATION The critical value comes from the Normal tables as all sample sizes are large for this test.51 Graphical interpretation of conclusion If the test statistic falls into the critical region there is sufficient evidence to cause the null hypothesis to be rejected. E (alpha). as either a percentage (5%) or a decimal (0.

1  T n pT T

100  T

where p and T are the sample and population proportions respectively, and n the sample size.

For a percentage use

n Conclusion: H0 is rejected if the test statistic is so large that the sample could not have come from a population with the hypothesised proportion. Example 1 A company manufacturing a certain brand of breakfast cereal claims that 60% of all housewives prefer its brand to any other. A random sample of 300 housewives include 165 who do prefer the brand. Is the true percentage as the company claims or lower at the 5% significance level? 165 Given Information: T ! 60% ?, p ! v 100 ! 55%, n ! 300 300 H0: T = 60% H1: T 60% Critical Value: Normal tables, 1 tail, 5% significance, 1.645 Test Statistic: pT
T

100  T n !

55  60 60 v 40 300

!

5 8

. ! 177

Conclusion: Test statistic > critical value so reject H0. The % using the brand is < 60%

2 tail. H0: T H1 : T p = n = Critical Value: Normal tables.52 Example 2 An auditor claims that 10% of invoices for a certain company are incorrect. Test at the 1% significant level to see if the auditor's claim is supported by the sample evidence. To test this claim a random sample of 200 invoices are checked and 24 are found to be incorrect. Test Statistic pT T. Given Information: T = 10%?.

You may assume this to be correct for any data presented here. Summarising again the choice of statistical table: Population standard deviation W s Known: standard error = Unknown: standard error = n n Normal tables: z-test Normal tables: z-test Normal tables: z-test t-tables: t-test Sample size Large Small . before deciding which test to use we have to ask whether the sample size is large or small and whether the population standard deviation is known or has to be estimated from the sample.  T 100 n Conclusion: Test statistic is less than critical value so H0 cannot be rejected. The methods used are basically the same but different formulae and tables need to be used. TESTING FOR THE MEAN OF A POPULATION The methods described in this section require the population to be measured on the interval or ratio scales and to be normally distributed. The percentage of incorrect invoices is consistent with the auditor's claim of 10%. As with confidence intervals.

where is the population mean and c is the hypothesised value Significance Level : E = 5%. If we need to estimate it then a slightly larger critical value is found from the t-table. depends on whether this estimated value has been used or not.5 1 = 2. Decide whether to reject Ho or not. (0. either a t-test or a z-test. Conclusion Compare the test statistic with the critical value. significance level. whereas if it is underfilling the firm is liable to prosecution. H0: = 150 g H1: { 150 g Significance level (E): 5% (0. n = 25. A random sample of 25 filled boxes is weighed and shows a mean net weight of 152. x Q xQ Test statistic z! t! or W/ n s/ n where x and Q are the means of the sample and population respectively and s and W are their standard deviations. 1.53 Method A mean value is hypothesised for the population. Q < c. { c. The standard deviation is known to be 5. Test statistic: z! xQ W/ n   152. Can we conclude that the machine is no longer producing the 150g. This sample mean is then used to see if the value hypothesised for the population is reasonable or not. Example 3 A packaging device is set to fill detergent packets with a mean weight of 150g. or as stated in the question. x = 152.96. If we know the population standard deviation and do not need to estimate it the normal table is used. The machine can 'no longer produce the 150 g quantities' with quantities which are either too heavy or too light therefore the appropriate test is two tailed. two tailed. It is important to check the machine periodically because if it is overfilling it increases the cost of the materials. degrees of freedom. 5%.5 g W is known so a z-test is appropriate.5  150 5 / 25 = 2.05 sig. or Q > c. The appropriate test to use.5g. A sample is taken from that population and its mean value calculated. Conclude in terms of question. If the population standard deviation is not known then that from the sample is calculated and used to estimate it. quantities? Use a 5% significance level. number of tails. Ho: = c H1: Critical value From normal (z) table or t-tables.5 .0g.05) Critical value: W known therefore normal tables. level) Given = 150 g? W = 5 g.

? tailed We are only interested in whether the mean weight has increased or not so a test is appropriate.54 Conclusion: The test statistic exceeds the critical value so reject H0 and conclude that the mean weight produced is no longer 150g. Example 4 The mean and standard deviation of the weights produced by this same packaging device set to fill detergent packets with a mean weight of 150g. 5%. level) Given = 150 g? n = 100.0 g. Can we conclude that the mean weight produced by the machine has increased? Use a 5% significance level. (0. Test statistic: z! xQ s/ n = Conclusion: The test statistic and conclude that the machine the critical value so we reject H 0 . H0: H1: Significance level ( E): Critical value: W is unknown but the sample is large therefore use the normal tables. one tailed. s = 6.05 sig. are known to drift upwards over time due to the normal wearing of some bearings.5g. x = 151. Obviously it cannot be allowed to drift too far so a large random sample of 100 boxes is taken and the contents weighed. This sample has a mean weight of 151.0 g and a standard deviation of 6.5g.

of f. x = 96.1 = 12 d. H1: Significance level ( E): 1% (0. s = 5. n .2. 1 tail. The person who devised the test asserted that the mean mark attained would be 100. at the 1% significance level. The following results were obtained with a random sample of applicants: x = 96.. Given: H0: = 100?. n = 13. = 2.55 Example 5 One sample t-test The personnel department of a company developed an aptitude test for screening potential employees. 1% sig. Test this hypothesis against the alternative that the mean mark is less than 100.68 Test statistic: t! x Q s/ n   Conclusion: The test statistic conclude that the mean mark the critical value H 0 and .2.01) Critical Value: W unknown so t-tables. n = 13. s = 5.

. Conclusion Compare the test statistic with the critical value. xd  0 sd / n d % significance. < 0. +11. so do not reject H0. Critical value: t-table.Before): -5. are included in 'Business Statistics'. Has any change taken place at a 5% significance level? Trainee A Score before Training 74 Score after Training 69 B 69 76 C 45 56 D 67 59 E 67 78 F 42 63 G 54 54 H 67 76 I 76 75 First the 'changes' are computed and then a simple t-test is carried out on the result. including nonparametric tests. number of tails. or as stated.V. less extreme than C. Conclude in terms of the question. Chapter 7. Test statistic xd  0 sd / n d % significance. Only the most commonly used of hypothesis tests have been included in this lecture. sd = . xd = deg of freedom. . H1: d { 0. In practice we carry out a one sample t-test on the differences. +7. You will meet others in the next few weeks and a further selection. tailed = n = 9.Paired t-test If we have paired data the 'difference' between each pair is first calculated and then these differences are treated as a single set of data in order to consider whether there has been any change. H1: d { 0. Method H0: d = 0 where d is the mean of the differences. The scores are recorded below. -8. = Conclusion T. Example 6 A training manager wishes to see if there has been any alteration in the aptitude of his trainees after they have been on a course. Critical value Test statistic t-table. There may be no change. H0: d = 0. Decide whether to reject H0 or not. He gives each an aptitude test before they start the course and an equivalent one after they have completed it. or if there is any difference between them. deg of freedom. Changes (After .56 Testing for Mean of Differences of Paired Data .S. where the subscript d is used to denote 'differences'. or > 0 Significance Level (E): 5%.

05 2.6 31.31 2.66 2.55 3.60 2.95 2.96 1% 2% 12.03 3.05% 0.1% 3.60 4.71 1.1% 0.79 4.02 2.04 2.30 4.82 2.30 3.68 2.78 2.64 2.42 2.50 3.21 4.55 3.50 4.61 6.02 1.31 2.23 3.64 2.09 0.09 2.75 1.31 3. E = total hatched area in tails Table 3 % POINTS OF THE STANDARD NORMAL CURVE One tailed One tail Two tails z 5% 10% 1.89 5.36 2.33 0.96 4.2% 63.58 0.59 4.14 3.57 2.04 4.70 1.39 2.86 1.35 2.13 2.39 3.25 3.2% 3.36 3.70 2.14 3.18 2.07 3.76 2.81 1.78 4.54 3.17 5.1% 636.29 Y = no of degrees of freedom.66 22.32 4.06 2.00 1.89 1.96 5.73 3.45 2.29 .71 3.36 3.5% 1% 31.53 2.30 4.00 2.85 3.47 3.96 1% 2% 2.65 3.75 3.17 3.05% 0.49 2.18 2.84 4.75 2.72 1.58 Two tailed 0.92 8 .5% 5% 4.09 0.92 2.57 Table 2 PERCENTAGE POINTS OF THE t-DlSTRlBUTlON One tailed One tail E Two tails E R=1 2 3 4 5 6 7 8 9 10 12 15 20 24 30 40 60 g 5% 10% 6.33 Two tailed 0.67 1.41 5.94 1.60 12.82 9.90 2.46 3.71 6.5% 5% 1.1% 0.83 1.93 3.78 1.13 2.23 2.92 5.87 5.5% 1% 2.21 7.85 2.68 1.46 2.33 10.75 3.80 2.26 2.

95 of being correct at the 5% level. uses this to carry out a hypothesis test in the usual manner. ANOVA. can be thought of as an extension of the two-sample t-test to more than two samples.875 0. One-way ANOVA: Is there any difference between the average sales at various departmental stores within a company? Two-way ANOVA: Is there any difference between the average sales at various stores within a company and/or the types of department? The overall variation is split 'two ways'.95 = 0. it: y y y y measures the overall variation within a variable. finds the variation between its group means.125 1 .9545 = 0. Using Analysis of variance we can. combines these to calculate a single test statistic.90 Obviously this situation is not acceptable Solution: We need therefore to use methods of analysis which will allow the variation between all n means to be tested simultaneously giving an overall probability of 0. at a 5% level. (Further methods in Chapter 8 of Business Statistics) As an example: Using the 2-sample t-test we have tested to see whether there was any difference between the size of invoices in a company's Leeds and Bradford stores. Comparing more students: Students 2 3 n 10 Pairwise tests 1 3 {n(n-1)}/2 45 P( all correct) 0.05 0. simultaneously. In general. if we compare the average marks of two students at the end of a semester to see if their mean scores are significantly different we would have.95 0. In general. 0.0.95n 0. in its simplest form. . Problem: Why can't we just carry out repeated t-tests on pairs of the variables? If many independent tests are carried out pairwise then the probability of being correct for the combined results is greatly reduced.95n 0. Analysis of Variance.95 probability of being correct.58 ANALYSIS OF VARIANCE (ANOVA) Analysis of Variance. This type of analysis is referred to as Analysis of Variance or 'ANOVA' in short. investigate invoices from as many towns as we wish. assuming that sufficient data is available. For example. compares the variation between groups and the variation within samples by analysing their variances.10 3 P(at least one incorrect) 0.

it will be explained in stages using theory and a numerical example simultaneously. between the group means (SSG) Residual (error) variation not due to difference between the main group means.e. (SSG) Residual (error) variation not due to difference between the group means. Total variance = between groups variance + variance due to the errors It follows that: Total sum of = squares (SST) Sum of squares between Sum of squares due + the groups (SSG) to the errors (SSE) If we find any two of the three sums of squares then the other can be found by difference. . i. SSBl = Blocks Sum of Squares. In practice we calculate SST and SSG and then find SSE by difference. (SSE) Two-way ANOVA Total variation (SST) Variation due to difference between the groups.e. (SSE1) Variation due to difference between the block means. between the group means. (At this stage just think of 'sums of squares' as being a measure of variation.) The method of measuring this variation is variance. second group means (SSBl) Genuine residual (error) variation not due to difference between either set of group means (SSE) where SST = Total Sum of Squares.e.59 One-way ANOVA Total variation (SST) Variation due to difference between the groups. which is standard deviation squared. SSG = Treatment Sum of Squares between the groups. i. SSE = Sum of Squares of Errors. Since the method is much easier to understand with a numerical example. i.

In order to evaluate three database management systems. however.SSSys Calculation of Sums of Squares The 'square' for each case is (x . is there any difference between the training time needed for the three systems? In this case the 'groups' are the three database management systems. a firm devised a test to see how many training hours were needed for five of its word processing operators to become proficient in each of three systems. 16 19 14 13 18 hours System A System B System C 16 24 17 22 13 19 12 18 17 22 hours hours Using a 5% significance level. but not all. The 'total sum of squares' is therefore § . It follows that: Total sum = Sum of squares of squares between systems (SST) (SSSys) + Sum of squares of errors (SSE) In practice we transpose this equation and use: SSE = SST .60 Example 1 One important factor in selecting software for word processing and database management systems is the time required to learn how to use a particular system. These account for some. is not explained by the difference between them. of the total variance. Total variance = between systems variance + variance due to the errors.x )2 where x is the value for that case and x is the mean. The residual variance is referred to as that due to the µerrors¶. Some.

square the results.x  x 2 . The use of a statistical calculator is preferable! In the lecture on summary statistics we saw that the standard deviation is calculated by: sn ! § . subtract the mean from each value. and finally sum the squares. The classical method for calculating this sum is to tabulate the values.

x  x n 2 so § .

x  x 2 = ns 2 with s from the calculator using [xWn] n 2 or s n 1 ! § .

x  x so § .

x  x 2 = .

x and W n from the x Wn 2 Wn 2 nW n . TotalSS n Input all the data individually and output the values for n .419 11. calculator in SD mode.n  1 s 2 1 with s from the calculator using [xWn-1 ] n n 1 Both methods estimate exactly the same value for the total sum of squares.33 3.3 = SS Total . Use these values to calculate W 2 and nW 2 .69 175. n n 15 17.

(n = 5).S. SS. SSG. k-1 M. SS for Systems 15 17.33 2. M.889 103.S.. General ANOVA Table (for k groups. by difference. y The test statistic. is the ratio of the mean sum of squares due to the differences between the group means and that due to the errors. find the sum of squares due to the errors. n x Wn 2 Wn 2 nW n .S. and the between groups sum of squares..3 = 72. y the mean sums of squares.SSSys = 175. in your calculator and output n . for the total and the groups are one less than the total number of values and the number of groups respectively. x and W n . SST. F. as on the next page. total sample size N) Source Between groups S.S SSG d.(k-1) N-1 y Fill in the total sum of squares. .f. is found in each case by dividing the sum of squares. SSG ! MSG k 1 SSE ! MSE Nk F M !F M E Errors Total Method SSE SST (N-1) . If you find it helpful then make use of it.625 6. after calculation.f. otherwise just work with the numbers. d.61 SSSys Calculate n and x for each of the management systems separately: System A System B System C 5 5 5 n 16 x 15 21 Input as frequency data. SSE. find the error degree of freedom by difference.S.0 Using the ANOVA table Below is the general format of an analysis of variance table.103.3 . by the corresponding degrees of freedom. y the degrees of freedom.3 = SSSys SSE is found by difference SSE = SST .

f.65 72. is that at least two of the group means are different.1 = 14 M. 103. 3-1=2 14 .3 72. is that all the group means are equal. H 0.2 = 12 15 . The critical value is from the F-tables.S.65/6.0/12 = 6. FE . The significance level is as stated or 5% by default. The null hypothesis.00 F 51.00 = 8.0 175.61 The hypothesis test The methodology for this hypothesis test is similar to that described last week. N = 15 values) Source Between systems Errors Total S.3 d. The alternative hypothesis. H0: Q1 = Q2 = Q3 = Q4 etc. 103.62 In this example: (k = 3 systems. H1.3/2 = 51.S.S.

There is a difference between the mean learning times for at least two of the three database management systems.S. The test statistic is the F-value calculated from the sample in the ANOVA table.61 Conclusion: T. and the errors.R 1 . The critical difference formula is: ¨1 1 ¸ ¹ CD ! t MSE ©  ¹ ©n ª 1 n2 º t has the error degrees of freedom and one tail. so reject H0.12) = 3. Example 1 (cont.) H0: Q A = QB = Q C Test statistic: 8.89 (Deg. . the sample sizes and the significance level. The conclusion is reached by comparing the test statistic with the critical value and rejecting the null hypothesis if the test statistic is the larger of the two. of free.) Where does any difference lie? We can calculate a critical difference. Critical value: F0.V. H1: At least two of the means are different. which depends on the MSE. such that any difference between means which exceeds the CD is significant and any less than it is not. CD. R 2 .05 (2. R1. with the two degrees of freedom from the groups. > C. MSE from the ANOVA table. from 'between systems' and 'errors'. R2.

SSSys . Example 2 1 System A System B System C 16 16 24 2 19 17 22 3 14 13 19 Operators 4 13 12 18 5 18 17 22 Again we ask the same question: using a 5% level. Total variance = between systems variance + between operators variance + variance of errors.78 6. is there any difference between the training time for the three systems? We can use the Operator variation just to explain some of the unexplained error thereby reducing it. So Total sum = Sum of squares + Sum of squares of squares between systems between operators (SST) (SSSys) (SSOps) + Sum of squares of errors (SSE) In 2-way ANOVA we find SST.00©  ¹ ª5 5º ! 2. SSSys.SSOps We already have SST and SSSys from 1-way ANOVA but still need to find SSOps. 'blocked' design.) ¨1 1 ¸ CD ! t MSE©  ¹ ¹ ©n ª 1 n2 º ¨1 1¸   1. By extending the analysis from one-way ANOVA to two-way ANOVA we can find our whether Operator variability is a significant factor or whether the differences found previously were just due to the Systems. SSOps and then find SSE by difference. In the second we have a second set of groups . SSE = SST . or we can consider it in a similar manner to the System variation in the last example in order to see if there is a difference between the Operators. System C takes significantly longer to learn than Systems A and B which are similar. In the first case the 'groups' are the three database management systems and the 'blocks' being used to reduce the error are the different operators who themselves may differ in speed of learning.76 From the samples x A ! 16. x B ! 15. Two-way ANOVA In the above example it might have been reasonable to suggest that the five Operators might have different learning speeds and were therefore responsible for some of the variation in the time needed to master the three Systems.the Operators. x C ! 21.63 Example1 (cont. .

18/0.84 Test Statistic: 17.7/4 = 16. Hypothesis test (2) for Operators H0: Q 1 ! Q2 ! Q3 ! Q4 ! Q5 H1: At least two of them are different.33 From example 1: SST = 175. Using CD of 1. > C.65 64.3 64. Using CD of 1.7 = SSOps Two-way ANOVA table.91 = 17.33 5 18 17 22 19.78 Conclusion: T. so reject H0.33 3 14 13 19 15. including both Systems and Operators: Source Between Systems Between Operators Errors Total S.8) = 4.45 calculated as previously (see overhead): Operators 3 and 4 are significantly quicker learners than Operators 1.S.00 15.76 16. W n = 2. There is a difference between at least two of the Operators in the mean time needed for learning the systems.S.8) = 3.3 d. There is a difference between at least two of the mean times needed for training on the different systems.64 Operators 1 System A System B System C Means 16 16 24 18.00 21.1 = 14 M. 3-1=2 5-1=4 14 .7 7. x = 17. .078 and 2 nW n = 64.91 = 56.00 17.91 F 51.46 Test Statistic: 56.33.3/2 = 51.S.67 2 19 17 22 19.f.S. so reject H0.76 (Notice how the test statistic has increased with the use of the more powerful two-way ANOVA) Conclusion: T.6 = 8 15 .33 4 13 12 18 14. > C.3/8 = 0.65/0. Critical value: F0.18 7.78 Hypothesis test (1) for Systems H0: Q A ! QB ! QC H1: At least two of them are different.05 (4.12 (see overhead): C takes significantly longer to learn than A and/or B. 103.05 (2. 103.V.3 175.00 Means 16.3 SSOps Inputting the Operator means as frequency data (n = 3) gives: n = 15.S.3 and SSSys = 103.V. Critical value: F0. 2 and 5.

45 2.40 2.35 2.68 3.00 19.85 2.27 2.87 3.61 5.51 2.56 5.39 3.66 2.60 2.70 2.50 3.63 2.79 5.79 2.06 3.10 2.84 3.30 2.49 3.96 2.44 2.23 4.94 5.60 2.55 6.42 2.62 2.45 2.20 3.58 3.37 2.35 3.69 3.18 2.26 4.30 4.59 2.79 3.14 4.96 2.12 3.64 2.48 3.51 2.99 2.55 2.24 4.09 2.78 2.68 2.51 2.44 3.00 3.29 3.81 2.96 2.85 2.29 2.03 2.37 3.40 18.74 2.71 3.28 6.01 2.02 1.73 2.05 4.53 4.41 4.80 2.86 3.35 8.38 8081 6.09 3.74 3.44 3.96 1.01 8 238.33 8.85 2.15 3.40 2.74 2.32 5.61 2.84 2 199.42 2.45 2.96 4.49 2.28 2.76 4.32 2.54 4.85 6.92 3.28 2.25 2.33 3.54 2.24 3.51 10.84 4.20 4.00 4.53 2.26 5.21 4.22 2.67 4.37 3.71 2.93 2.19 4.63 3.89 3.83 2.09 4.76 2.68 3.34 2.59 5.10 7 236.34 2.25 2.16 3.90 2.65 2.18 3.11 3.25 9.17 2.30 9.46 2.53 2.70 2.04 1.08 4.55 3.41 3.91 2.18 3.92 2.48 2.10 3.07 2.88 Denominator degrees of freedom 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 g .34 3.52 3.66 2.37 8.13 7.82 2.39 3.97 3.99 5.66 2.36 2.10 3.26 3.00 3 215.35 4.80 2.37 5 230.21 6 234.69 2.32 3.48 3.37 2.74 2.76 2.59 2.61 2.87 2.15 3.80 19.22 3.02 2.60 4.12 2.57 2.33 2.95 2.49 4.20 3.93 2.58 2.05 3.59 2.76 2.50 19.90 2.70 19.75 4.70 2.55 2.60 19.37 2.07 3.33 3.32 4.29 2.21 2.57 2.10 3.64 2.95 4.89 6.24 2.71 2.59 3.98 3.38 4.81 3.32 2.47 3.39 5.31 2.12 4.01 2.49 2.39 3.04 4.25 2.14 3.18 4.36 2.47 2.68 2.01 2.00 2.77 2.36 3.92 2.88 4.71 2.17 2.49 3.21 3.84 2.03 3.50 19.07 3.17 4.42 2.45 4.71 2.46 4.28 4.42 2.11 3.29 3.41 4.65 Table 4 PERCENTAGE POINTS OF THE F DISTRIBUTION E = 5% Numerator degrees of freedom R1 R2 1 2 3 4 5 6 7 8 9 10 1 161.12 6.73 3.13 3.34 2.07 3.23 3.20 19.27 2.90 2.49 2.98 2.84 2.59 3.74 4.53 2.01 6.46 2.43 2.23 3.40 3.42 3.60 4 224.54 2.16 4.46 2.39 2.34 3.28 3.94 6.39 2.56 2.45 2.00 9.90 19.77 4.37 2.71 6.77 2.26 4.63 3.82 4.94 9 240.95 2.16 9.35 4.55 2.

We might want to know: y If a relationship exists between those variables. which defines the straight line equation of the line of 'best fit' through the bi-variate data. Once defined by an equation. Is the relationship strong enough to be useful? y If the relationship is found to be significantly strong. A correlation coefficient is calculated as the measure of the strength of this relationship. then its nature can be found using linear regression. the relationship can be used to make predictions.now we look at two variables.66 CORRELATION AND REGRESSION Correlation and regression are concerned with the investigation of two continuous variables. Curved relationships are considered in Chapter 9 of Business Statistics. Regression describes the relationship itself in the form of a straight line equation which best fits the data. It is not concerned with 'cause' and 'effect'. The 'goodness of fit' can be calculated to see how well the line fits the data. y y . Does the relationship appear to be linear or curved? y If there appears to be a linear relationship. it can be quantified. y Can we make use of that relationship for predictive purposes? Correlation describes the strength of the relationship. y if so. y what form that relationship takes. Previously we have only considered a single variable . Its symbol is usually 'r' and its value always lies between -1 and +1. Methodology y Some initial insight into the relationship between two continuous variables can be obtained by plotting a scatter diagram and simply looking at the resulting graph. how strong that relationship is.

Scatter diagram (from Minitab) Sales agai s Average Monthly Temperature 140 130 120 110 Sales 100 90 80 70 60 50 5 10 15 Seems to indicate a straight line relationship.67 Example As an example we shall consider whether there is any relationship between 'Ice cream Sales' for an ice-cream manufacturer and 'Average Monthly Temperature'. „ ƒ Av. (oC) 4 4 7 8 12 15 16 17 14 11 7 5 Sales (£'000) 73 57 81 94 110 124 134 139 124 103 81 80 Scatter diagrams We are looking for a linear relationship with the bivariate points plotted being reasonably close to the. so the potentially dependent variable should be identified and plotted on the vertical axis.Temp. At this stage no 'causality' is implied but it makes sense to use the same diagram for the addition of the regression line later. The collected data is recorded below: Month January February March April May June July August September October November December Av. The strength of the relationship will therefore be quantified by calculating the correlation coefficient and testing it for significance. . yet unknown. with all points fairly close to a 'line of best fit'.Temp. 'line of best fit'.

) Correlation Coefficient: For this data the value of r = 0. i. (See Appendix 1) (If you do not have a calculator capable of carrying out correlation and regression calculations it will be necessary for you to calculate them by tabulation and use of the formulae shown in Appendix 2. Hypothesis test for a Pearson¶s correlation coefficient Null hypothesis. 0. It compares how the variables vary together with how they each vary individually and is independent of the origin or units of the measurements. H 1: There is an association between them. r.576 Test statistic: 0. H 0: There is no association between ice-cream sales and average monthly temperature. appended. It is a ratio of the combined variance (covariance) with individual variances and so has no units itself. It needs to be compared to a table value to see if it was significantly high.) Is the size of this correlation coefficient. either interval or ratio data. is best produced directly from a calculator in LR mode. The value of the correlation coefficient. large enough to claim that there is a relationship between average monthly temperature and sales of ice-cream? The test statistic obtained above in this case seems very close to 1. (12 . Alternative hypothesis.9833 ( from 'shift (' ) (Calculator usage will be practised in tutorials. . Correlation tables. r.9833. Critical Value: 5%.2) = 10 degrees of freedom = 0. Pearson's correlation coefficient is calculated for situations in which the data can be measured quantitatively. but is it close enough?.68 Pearson's Product Moment Correlation Coefficient.e. are used with n .2 degrees of freedom.983 Conclusion: The test statistic exceeds the critical value so we reject the Null Hypothesis and conclude that there is an association between ice-cream sales and average monthly temperature.

assuming a causal relationship. as produced from a calculator in linear regression (LR) mode.g. As with the correlation coefficient.5 + 5.69 Regression equation Since we now know there is a significant relationship between the two variables.45 x 15 = 127. e. If x = 15 then y = 45. Having defined it. the corresponding value of y can be produced directly from the Ö calculator in LR mode. the centroid. as the straight line of µbest fit¶ with the equation: y = a + bx where x and y are the independent and dependent variables respectively. The regression line is described. ( x. the intercept. can be found directly from the calculator in LR mode (shift 7 and shift 8). we can then add it to the scatter diagram and also. a the intercept on the y-axis. any three points are plotted and joined up.3 (For any value of x. (0. a and b. the coefficients of the regression equation.Temp. the next obvious step is to define it.45x To draw this line on the scatter diagram.a). otherwise they need to be calculated as shown in Appendix 2. The values of a and b for this data. These points. are 45.52 (shift 7) for a and 5. in general. and/or any other points calculated from the regression equation as long as these are in the region of the observed data. .) Scatter diagram with Regression line added Sales against Average Monthl Temperature 1 40 1 30 1 20 1 10 Sales 1 00 90 80 70 60 50 5 10 15 Av.5 + 5. y ). and b the slope of the line.448 (shift 8) for b giving the regression line: y = 45. using the key y . use it to predict the next month's ice cream sales if we know the figure for the average temperature which has been forecast for the coming month.

' 'Sales'.5 + 5.5 + 5.4480 Stdev 3.10 p 0.45 Av.Temp. This can Ö be more easily be done directly by calculator in LR mode: type in 14. The correlation coefficient r was 0.038 Coef 45.31 2.3186 t -ratio 13.Temp. Sales 2 4.34 p 0.33R R denotes an obs.983)2 x 100 = 96.Temp.Temp. Prediction of Sales Suppose that the Ice-cream manufacturer knows that the estimated average temperature for the following month is 14 oC.00 17.503 0.Temp.70 Goodness of Fit 2 How well does this line fit the data? Goodness of fit is measured by (r x 100)%. The 'goodness of fit' indicates the percentage of the variation in Ice-cream Sales which is accounted for by the variation in average monthly temperature.Resid -2. Correlation of Av.2 253.40 Residual -10.983 so we have (0.000 Unusual Observations Obs.45 x average temperature = 45. This high value indicates that any predictions made about y from a value of x will be good.0 57. y.00 Fit Stdev.45 x 14 = 121. resid.4% Analysis of Variance SOURCE Regression Error Total DF 1 10 11 SS 7420.8 Expected sales would be £122 000 Minitab Regression output MTB > Correlation 'Av. Estimated Sales = 45. and Sales = 0.6% fit.5 + 5.31 St. s = 5.000 R -sq = 96.983 MTB > Regress 'Sales' 1 'Av.4 F 292. press y .'.7% R -sq(adj) = 96.Temp.Fit 67.000 0. x.0 M S 7420. what would he expect his Sales to be? The best estimate of the Sales is obtained by substituting the value of 14 for that of the independent variable. and calculating the corresponding value of the Sales.520 5. Predictor Constant Av. Av. with a large st.2 25. The regression equation is Sales = 45.8 7674. .

Correlation is signifi ant at t e .71 SPSS Regression output CORRELATIONS /VARIABLES=av_temp sales /PRINT=TWOTAIL NOSIG. (2-tailed) N . REGRESSION /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.176 25.000 . .319 .503 5. . 12 .520 3.000 . ependent ariable: I e ream sales (£'000) s s tr (Constant) Average mont l temperature nstandardized Coeffi ients B Std. . Predi tors: (Constant).098 Sig. 1 .000 ean Square 7420. Average mont l temperature ANOVAb odel Regression Residual otal df 1 10 11 1 a.10) /DEPENDENT sales /METHOD=ENTER av_temp .05) PO UT(. ** . 1 level (2-tailed).382 j k l j e h tr ~} y i e h vu | | e gf odel R R are . F 292. 12 1. 24 7674.996 17. (2-tailed) N Pear rrelati i . 12 ††† ‰ˆ‡ ††† d d ††† ™ ˜— … Average mont l temperat re 1.448 .983 s  s ‚ „  b. ependent ariable: I e ream sales (£'000) Coeffi ientsa Standardi zed Coeffi ien ts Beta xww { Sum of Squares 20.176 253. Average mont l temperature Model 1 a. 12 € ™ ˜ — ’‘– s  | z ƒ d d m Sig. a a. Predi tors: (Constant). Model Summary Adjusted R Square . Error 45. ††† ††† ‰ˆ‡ r q ’‘ ‘“ ’‘ †† — •” pon d ††† I e ream ales (£' ) ’‘ ‘“ ’‘ •” Average t l temperat re Pear rrelati i . ** . Std. rror of t e Estimate . Corr lations **. I e ream sales (£' ) .000a { t 12.

The method described below produces the correlation coefficient. Results out: 4 5 6 7 8 Check number of pairs entered Output correlation coefficient (r) Output intercept (a) and slope (b) of regression line To get y value for plotting line when x = 15 To estimate Sales (£000) for Av. Within the many memories of the calculator these numbers and their squares are accumulated and the correlation coefficient is then calculated using the formulae in the appendix from these stored sums. For any other make. and estimates values of y for given values of x for a Casio calculator. Data in: 1 2 3 Clear all memories Enter linear regression mode Input variables x and y together for each case Repeat to end of data. Temp = 14 RCL Red C Shift ( 'open bracket' Shift 15 14 7. refer to the calculator handbook.) The data is entered as pairs of numbers. A and B.72 Appendix 1 Calculator Method (only for calculators with linear regression mode. the regression coefficients.y DT . Ö y Ö y Shift 8. r. Shift AC Mode 3 x. and possibly some Casio models.

Otherwise the following calculations are necessary: The formula used to find the Pearson¶s Product Moment Correlation coefficient is: r ! Sxy S xxS yy .73 Appendix 2 The value of the correlation coefficient is best produced directly from a calculator in LR mode.

§ y 2 . The following example gives ice-cream sales per month against mean monthly temperature (Fahrenheit) You need the values of: § x. (x) 4 4 7 8 12 15 16 17 14 11 7 5 120 Ice cream sales (y) 73 57 81 94 110 124 134 139 124 103 81 80 1200 n. § x 2 . 1 e r e 1 whereS xx ! § x 2  S yy S xy § x§ x n § y§ y ! § y2  n x§ y § ! § xy  n where § x means the sum of all the x values. § xy. Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Sums 7 Average temp. § y. x2 16 16 49 64 144 225 256 289 196 121 49 80 1450 y2 5329 3249 6561 8836 12100 15376 17956 19321 15376 10609 6561 6400 127674 xy 292 228 567 752 1320 1860 2144 2363 1736 1133 567 400 13362 . etc.

b. (all produced previously. § xy.74 Calculations § x ! 120. § y ! 1200. the intercept on the y-axis: y ! a  bx   a ! y  bx From previously: (Sales now assumed to be dependent on temperature) S xy ! 1362 S xx ! 250 so b! 1362 ! 5.448 250 a! 1200 120  5. both means. the regression equation can be produced directly from a calculator in LR mode. § x 2 . § y 2 .) The regression line is described. its equation can be used to find the value of a. The gradient.983 (Use of calculator will be checked in tutorials.5 + 5. 2 § y ! 127674. in general. n. n ! 12.9833 250 v 7674 Correlation Coefficient: r = 0. 2 § x !1450. Otherwise further use is made of the values of § x.5 and 5. § xy ! 13362. and b the slope of the line. xx yy xy 120 v 120 ! 250 12 1200 v 1200 ! 127674  ! 7674 12 120 v 1200 ! 13362  ! 1362 12 ! 1450  r ! 1362 ! 0.45x .45 respectively giving the regression line: y = 45.52 The values of a and b are therefore 45. a the intercept on the y-axis. is calculated from : xy xx b! where xy ! § xy  § x§ y n and xx ! § x2  § x§ x n Since the regression line passes through the centroid. § y. as the straight line with the equation: y = a + bx where x and y are the independent and dependent variables respectively.) Regression equation As with the correlation coefficient.488 v ! 12 12 45.

708 .463 .456 .558 .393 .798 .433 .833 .482 .708 .295 E 0.934 .549 .606 .378 .872 .760 .323 .549 .847 .917 .529 .742 .900 .974 . .01% .805 .445 .590 .621 .408 .765 .349 .75 Table 6 PERCENTAGE POINTS OF CORRELATION COEFFICIENT One tailed v One Tail Two Tails 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 25 30 40 60 5% l0% .490 .426 .951 .730 .801 .574 .986 .898 .612 .834 .358 .553 .811 .685 .623 .369 .543 .669 .497 .935 .325 Two tailed 0.735 .457 .412 .211 2.820.991 .679 .634 .2% .561 .635 .999 .492 .684 .729 .444 .521 .707 .441 .711 .516 .905 .250 1% 2% .576 .648 .666 .597 .296 .750 .487 .875 .582 .878 .750 .847 .532 .652 .568 .875 .257 .526 .592 .823 .641 .694 .780 .409 .554 .925 .423 .602 .678 .632 .1% 0.665 .304 .882 .950 .497 .537 .468 .381 .05% 0.514 .360 .661 .575 .385 0.693 .389 .5% 1% .658 .754 .622 .990 .725 .476 .715 .5% 5% .662 .795 .772 .400 .980 .998 .503 .963 .789 .959 .449 .

As a reminder: cases are allotted to categories and their frequencies crosstabulated. which cannot be measured but which can be used to group people by variations within them. (described by name only). . note which variation of each characteristic is appropriate for each case and then cross-tabulate the data.g. etc. the proportions within the sub-totals of the contingency table would be expected to be the same as those of the totals for each variable. if there is no relationship between gender and eye colour we would expect similar proportions of males and females to have blue eyes. It is then analysed in order to see if the proportions of each characteristic in the sub-samples are the same as the overall proportions . the characteristics under scrutiny. or may not. the proportions are first calculated as fractions which are then multiplied by the total frequency to find the expected individual cell frequencies. If gender and eye colour are independent and if a third of the population has blue eyes. How can we decide? We can take a random sample from the population. distinguishing between 'observed' and 'expected' frequencies by enclosing the latter within brackets. frequencies may be cross-tabulated by each category within each variable. brown eyed males. be associated in some way. such as gender and eye colour. blue eyed females. How can we cope with more awkward numbers? In the on-going example. In practice we work with frequencies rather than proportions. These proportions are obviously contrived so as to be easy to work with. Ordinal variables may be used if there are only a limited number of orders so that each one can be classified as a separate category. Expected values If the two variables. These characteristics may. For any cell the expected frequency is calculated by: Overall total where the relevant row and column are those crossing in that particular cell. as is usually the case. in the gender / eye colour example there might be blue eyed males.76 CONTINGENCY TABLES AND CHI-SQUARED TESTS In this type of analysis we have two characteristics. This produces a formula which is applicable in all cases: ow total v Column total . Contingency Tables (Cross-Tabs) You have met this type of table before as a contingency table when calculating probabilities. All possible 'contingencies' are included in the 'cells' which are themselves mutually exclusive.easier to do than describe! As an example. are completely independent. Continuous variables may be grouped and then tabulated similarly though the results will then vary according to the grouping categories. The table is completed by calculating the 'row totals'. nominal. The Variables If the variables are. brown eyed females. These tables are known as contingency tables. e. we would expect a third of males to be blue eyed and a third of females to be blue eyed. the 'column totals' and the 'grand total'.

were male. form a quarter of the total staff. we would expect two thirds. Note that these figures are a quarter of each gender respectively. test. which checks since till operators. 35 60 v v 180 ! 11. This simplifies to: Row total v Column total . of them to be male. 45. such as gender and eye colour. 180. Example 1 The following table compiled by a personnel manager relates to a random sample of 180 staff taken from the whole workforce of the supermarket chain. In this example we randomly selected a sample of 180 supermarket staff and found that two thirds.8) y P (Supervisor) = 35/180 y P (Male) Female 15 30 35 40 Total 20 20 10 10 = 60/180 assuming independence. of them to be female and one third. test for association between a member of staff's gender and his/her type of job. is known as the Chi-squared. Assuming there is no association between gender and job category and finding that we have 45 till operators. 60. (G2 ). of them were female and one third. in this example. Overall total 35 v 60 ! 11. Note that it is a theoretical number which does not have to be an integer. 120.77 Chi-squared ('2) Test for Independence The hypothesis test which is carried out in order to see if there is any association between categorical variables. 15. We now calculate the other expected frequencies from the probabilities and put them into the table: (See Section 4. 30.67 180 . gives the full table. We shall. as previously with probability. at the 5% level of significance.67 180 180 y P (Supervisor and male) = 35/180 x 60/180 Therefore the expected number of Male supervisors = This is the expected frequency for members of staff who are both male and a supervisor. Male Supervisor Shelf stacker Till operator Cleaner Total Completing the row and column totals.

Calculate test statistic: 6) 7) § . Calculate expected frequency. O.67) (16. with the variables actually being independent? It is to be hoped that you recognise the need for a hypothesis test! The chisquared ('2 ) Hypothesis test To find the answer. Otherwise the test is invalid.78 Calculating the other expected frequencies and inserting them in the table (in brackets): Male Supervisor Shelf stacker Till operator Cleaner Total 20 20 10 10 (11. for each cell : row total x column total grand total Note that: No expected frequency should be less than 1 and the number of expected frequencies below 5 should not be over 20% of the total number of cells.33) (33. with (r . If the expected frequencies are observed to actually occur in practice then we can deduce that the two variables are indeed independent.67) (15. 1) 2) 3) 4) State Null Hypothesis. so some critical amount of difference is allowed and we compare the difference from our observations with that allowed by the use of a standard table. H1. We would obviously not expect to get exact agreement with the expected frequencies. (that of no association) and Alternative Hypothesis.00) (33.1) v (c . We carry out a formal hypothesis test at 5% significance: the chi-squared test.1) degrees of freedom where r and c are the number of rows and columns respectively.00) (16.33) 120 Total 35 50 45 50 180 These are the frequencies which would be expected if there is no association between gender and job category at the supermarket. as appended. 5) Find critical value from chi-square table. H 0.67) 60 15 30 35 40 Female (23. in each cell of the contingency table.33) (30. Are the values observed so different to those expected that we must reject the idea of independence? Or are the results just due to sampling errors. E. Calculate row. we analyse the data and compare the result to a standard table figure. Record observed frequencies. column and grand totals.

.  E 2 E Compare the two values and conclude whether the variables are independent or not.

3 and 4 of the procedure by calculating the expected values.79 In example 1.1)(2 . The test statistic is an overall measure of the difference between the expected and observed frequencies. as we shall do in this example.1) = (4 . or the contribution of each cell may be calculated directly as . or during. Level of significance = 5% ' 2 5%. we have already carried out steps 2. When the contributions from each cell are totalled their sum is compared with a critical value from the chi-squared table ± hence the name of this test.816 Test statistic The test statistic is calculated from the contingency table which includes both the observed and the expected values for the frequency of staff. before starting the test and to then insert the calculated value in the formal hypothesis test. Some statisticians also prefer to calculate the test statistic. as appended.) Alternative Hypothesis (H1 ): There is an association between gender and job category. the test is up to personal preference. Critical Value: from the chi-squared table Number of degrees of freedom (R ) = (r . R ! 3 = 7. Each cell difference is squared so that positive and negative differences have the same weighting and proportioned by the size of the expected cell contents.1) = 3 x 1 = 3. Whether these are calculated before. as this procedure is rather lengthy. is always one tailed. The data may be tabulated. Null Hypothesis (H0 ): There is no association between gender and job category. (Remember that 'null' means none. '2 table.1)(c .

67) (16.67) (15.33) (30.  E 2 and then the test statistic found as the sum of these contributions: E Male Supervisor Shelf stacker Till operator Cleaner Total 20 20 10 10 (11.00) (33.00) (16.67) 60 15 30 35 40 Female (23.33) 120 Total 35 50 45 50 180 .33) (33.

80 Test statistic = § .

Conclude that there is an association between gender and job category in the supermarket chain.O  E 2 E (O .33 -8.667 0.00 5.67 Test Statistic: 16.67 33.422 Conclusion: Test statistic > Critical value therefore reject H0 .833 2.665 0.00 30.974 0.E) 8.33 -5. Looking again at the data we can see that far more males than expected were supervisors or shelf stackers and more females were cleaners or till operators.33 16.67 33.00 16.67 23.33 3.422 O 20 15 20 30 10 35 10 40 E 11.946 2.E)2/E 5.33 -3.333 1.67 6.00 -6.33 15.669 1.33 (O . .335 Total 16.

Overall total The expected values can next be calculated: V. V. very unfavourable.favourable Favourable Unfavourable V. were asked to grade their attitude towards future wage restraint on the scale: Very favourable. a random sample of 160 employees: stackers. we have a 3 x 4 (or a 4 x 3) table. sales staff and administrators. 9 responded 'favourable' and 3 responded 'very favourable¶. and 8 the response 'very unfavourable'. 7 gave the response 'favourable'. Of the 40 stackers interviewed.favourable Favourable Unfavourable V. then find the missing figures by difference. i. check with that below before calculating the expected values. Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells. 24 the response 'unfavourable'. The rest of the sample were administrators. In the whole survey. There were 56 sales staff and from these.81 Example 2 In this example we first have to set up the contingency table from the following information collected from a questionnaire: In a recent survey within a Supermarket chain. exactly half the employees interviewed responded 'unfavourable'.unfavourable Total Stackers Sales staff Administrators Total 1 3 16 20 7 9 24 40 24 34 22 80 8 10 2 20 40 56 64 160 . When complete. 10 responded 'very unfavourable'. As you come to each number in the frequency of response above insert it into the appropriate place. favourable. There is sufficient information here to enable you to complete your table. 16 gave the response 'very favourable' and 2 the response 'very unfavourable'. We first draw up a contingency table showing these results and then test whether attitude towards future wage restraint is dependent on the type of employment. unfavourable. e. Setting up the table: in this example there are three types of employee giving four different responses. Of these. Have a go at compiling the table.unfavourable Total Stackers Sales staff Administrators Total ow total v Column total and inserted.

Level of Significance: 5% Level Of Significance Critical value: Number of degrees of freedom (R ) = (r . 6 degrees of freedom = 12.1) = Level of significance = 5% '2 table.1)(c . 5%. Alternative Hypothesis (H1 ): There is an association between job category and attitude towards wage restraint.82 Hypothesis test Null Hypothesis (H0 ): There is no association between job category and attitude towards wage restraint.59 Test statistic O 1 7 24 8 3 9 34 10 16 24 22 2 § .

900 0.O  E 2 E (Complete the table.200 0.969 Conclusion: Test statistic > Critical value therefore reject H0 Conclude that there is an association between job category and attitude towards future wage restraint.800 1.786 Total 32. The administrators were for it but the others against it.E) -4 -3 +4 +3 -4 -5 (O .E)2/E 3.) E 5 10 20 5 7 14 (O .286 1.800 2.969 Test static: 32. .

check with that below before calculating the expected values. Of the 40 stackers interviewed.favourable Favourable Unfavourable V.83 COMPLETED EXAMPLES FROM LECTURE HANDOUT Example 2 In this example we first have to set up the contingency table from the following information collected from a questionnaire: In a recent survey within a Supermarket chain. Overall total V. e. very unfavourable. sales staff and administrators. 10 responded 'very unfavourable'. 24 the response 'unfavourable'. exactly half the employees interviewed responded 'unfavourable'.unfavourable Total Stackers Sales staff Administrators Total 1 3 16 20 7 9 24 40 24 34 22 80 8 10 2 20 40 56 64 160 The expected values can next be calculated: Row total v Column total and inserted. In the whole survey. We first draw up a contingency table showing these results and then test whether attitude towards future wage restraint is dependent on the type of employment. There were 56 sales staff and from these. i. Setting up the table: in this example there are three types of employee giving four different responses. unfavourable. Of these. 16 gave the response 'very favourable' and 2 the response 'very unfavourable'. Adding extra rows and columns for the subtotals and titles we need 5 x 6 cells. The rest of the sample were administrators. As you come to each number in the frequency of response above insert it into the appropriate cell. 7 gave the response 'favourable'.favourable Favourable Unfavourable V. a random sample of 160 employees: stackers. V. 9 responded 'favourable' and 3 responded 'very favourable¶. Have a go at compiling the table. then find the missing figures by difference. we have a 3 x 4 (or a 4 x 3) table. and 8 the response 'very unfavourable'.unfavourable Total Stackers Sales staff Administrators Total 1 3 16 (5) (7) (8) 20 7 9 24 (10) (14) (16) 40 24 34 22 (20) (28) (32) 80 8 10 2 (5) (7) (8) 20 40 56 64 160 . favourable. were asked to grade their attitude towards future wage restraint on the scale: Very favourable. There is sufficient information here to enable you to complete your table. When complete.

Table 5 in Appendix D. Alternative Hypothesis (H1 ): There is an association between job category and attitude towards wage restraint.1) = 2 x 3 = 6 Level of significance = 5% '2 table. Level of Significance: 5% Level of significance Critical value: Number of degrees of freedom (R ) = (r .84 Hypothesis test Null Hypothesis (H0 ): There is no association between job category and attitude towards wage restraint.1) = (3 . 5%.59 Test statistic O 1 7 24 8 3 9 34 10 16 24 22 2 § . 6 degrees of freedom = 12.1)(c .1)(4 .

286 1.800 1.000 3.286 8. The administrators were for it but the others against it.969 Conclusion: Test statistic > Critical value therefore reject H0 Conclude that there is an association between job category and attitude towards future wage restraint.800 2.900 0.286 1.E)2/E 3.969 Test static: 32.E) -4 -3 +4 +3 -4 -5 +6 +3 +8 +8 -10 -6 Total (O . .O  E 2 E E 5 10 20 5 7 14 28 7 8 16 32 8 (O .200 0.125 4.500 32.000 4.786 1.

49 20.69 76.780 9.46 24.01 33.81 18.51 22.89 40.50 79.378 9.81 36.92 39.635 9.92 18.56 36.6 2 5% 6.98 59.12 37.87 30.09 21.53 96.9 113.30 27.236 10.02 17.68 25.40 85.62 54.19 31.17 35.36 23.3 5% 5.72 26.53 32.3 124.210 11.22 27.31 43.72 46.06 22.98 44.02 13.252 7.25 40.8 137.21 24.024 7.00 26.31 19.65 38.85 34.64 42.18 52.14 30.73 51.5 0.15 88.32 46.31 23.706 4.19 37.34 24.3 124.58 107.38 100.1 129.92 23.27 18.49 28.03 22.70 73.67 33.89 29.00 33.34 42.85 30.54 24.62 30.59 50.91 34.841 5.41 29.08 90.38 35.17 74.59 28.8 1% 10.05 55.85 Table 5 PERCENTAGE POINTS OF THE G2-D1STRIBUTION R 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 2.47 20.30 95.12 27.68 15.53 101.28 15.19 44.34 71.67 23.14 12.14 31.991 7.4 112.66 99.99 27.351 11.46 45.30 59.26 32.11 41.20 34.89 58.70 39.80 48.82 45.59 31.42 37.56 43.51 16.64 46.64 12.59 14.488 11.31 45.1 135.41 32.58 32.79 42.48 21.92 43.6 118.08 39.26 51.48 36.93 40.53 19.48 56.83 13.07 15.96 48.77 55.5 10% 3.09 40.42 83.45 16.l% .57 38.36 40.76 67.89 63.41 34.65 41.68 21.92 35.36 26.02 106.816 9.2 149.81 63.83 14.28 49.17 36.6 118.55 19.81 32.35 13.07 12.1 124.99 17.78 38.27 49.74 37.77 25.61 112.81 21.69 29.28 18.82 16.74 26.42 86.29 41.13 27.605 6.20 28.36 14.02 20.51 36.08 16.

(See Chapter 11 of Business Statistics) Index numbers were originally developed by economists for monitoring and comparing different groups of goods. annual income and profits. Company Market 100 Time Base period Parity Index values are measured in percentage points since the base value is always 100. It is necessary in business to be able to understand and manipulate the different published index series.88 Index 100. Index numbers measure the changing value of a variable over time in relation to its value at some fixed point in time. an industry or the economy. Constructing an Index Index for any time period n = Example 1 2 3 4 Year value in period n v 100 value in base period Value 12380 12490 12730 13145 Calculation 12380/12380 x100 12490/12380 x 100 12730/12380 x 100 100. Such indexes are often used to show overall changes in a market. For example.86 INDEX NUMBERS In business. when it is given the value of 100. A graph of the two indexes will illustrate the company's performance within the sector. etc. numbers of employees and customers. and to construct your own index series. managers may be concerned with the way in which the values of most variables change over time: prices paid for raw materials. Index numbers are one way of describing such changes. the base period.00 (Base year is year 1) 1 . an accountant at a supermarket chain could construct an index of the chain's own sales and compare it to the index of the volume of sales for the overall supermarket industry.

what is the price in each of the other months? Month 1: The index numbers in months 3 and 1 are 98 and 121 respectively. etc. the other values can be found by scaling the known value in the same ratio as the relevant index numbers. insurance values. the RPI. 98 is equivalent to a price of £240 so for month 1 this must be scaled in the same ratio as the index numbers.3 98 £240 v 112 ! £274. The RPI shows price changes clearly and has many important practical implications particularly when considering inflation in terms of wage bargaining. Example 2 Month Index The following table shows the monthly price index for an item: 1 121 2 112 3 98 4 81 5 63 6 57 7 89 8 109 9 131 10 147 11 132 12 126 If the price in month 3 is £240.87 Finding values from indexes If we have a series of index numbers describing index linked values and know the value associated with any one of them. . index linked benefits. Price in month 3 is £240 Price in month 2: Price in month 4: Price in month 5:   Price in month 1 is £240 v 121 ! £296. It gives an aggregate value based upon the prices of a representative selection of hundreds of products and services with weightings produced from the 'Family expenditure survey' as a measure of their importance.3 98 Completing the table: Month Index Price (£) 1 121 296 2 112 274 3 98 240 4 81 5 63 6 57 140 7 89 218 8 109 267 9 131 320 10 147 360 11 132 323 12 126 309 Price indexes The most commonly quoted index is the Retail Price Index. which is a monthly index indicating the amount spent by a 'typical household'.

y The numbers growing so large that the size of a points change is many times that of a percentage change. original index number Example 3 Month Price(£) Price index Percentage (Base period Jan.it is periodically adjusted. gross national product. Apr. changes in production processes. daily. etc. y The choice of base period can be any convenient time . variable supply and demand. etc. . etc. general inflation. etc. May 20 22 25 26 27 100 22/20 v 100 = 110 25/20 v 100 = 125 26/20 v 100 = 130 10 15 (110-100)/100 v 100 = 10. Yearly: Monthly: Daily: G.N. monthly. It is often of interest to compare it to the R. Gross National Product Unemployment figures Stock market prices. Measuring changes over time One of the most important uses of indexes is to show how the price of a product changes over time due to changing costs of raw materials.) Point change Jan. Feb.P. y Indexes can be calculated with any convenient frequency: yearly.I. or to indexes of the prices of competitors' products. Changes over time are measured as either a percentage point change by subtraction or as a percentage change as calculated by change in index number v 100 .6 Percentage Change Changing the base period The base period can be chosen as any convenient time.0 (125-110)/110 v 100 = 13. Mar.88 Other Indexes y An index can be used to measure the way any variable changes over time: unemployment numbers. It is usual to update the base period fairly regularly due to: y Any significant change which makes comparison with earlier figures meaningless. number of car registrations.P.

Example 4 The amount of money spent on advertising by the supermarket. a) Base year for each index: Index 1 = Index 2 = b) Missing values for Index 1 and Index 2: (Both values known for year 5) y From year 5: ratio of old : new = 220 : 100 (Note: All new values will be less) y From year 5: ratio of new : old = 100 : 220 (Note: All old values will be more) Index 1 for Year 6 Old index = new index v Index 2 for Year 1 New index = old index v 100 220 ! 100 v 100 220 ! 45. etc. The ratio of these two values forms the basis of any conversion between them. We need to convert all the numbers to a common base year so that they can be analysed as a whole continuous series. . In order to change from one index series to another we need the value in the first series for the same period as that in which the rebasing to 100 for the second takes place. 103. given the index linked amount spent on advertising for any one year. which is index linked. 485 becoming series 100.45 220 100 ! 125 v 220 100 ! 275 Now complete both index series checking that each is adjusted in the right direction. Published index numbers may include a sudden change such as a series 470. it is used to estimate that for any of the other years. is described by the following indexes: Year Index 1 Index 2 Adverts (£) 4860 1 100 2 138 3 162 4 196 5 220 100 125 140 165 6 7 8 Typically each index series is completed and then.89 We need therefore to know how to change a base period. 478. 105.

such as the R. Given that the advertising expenditure was known to be 4860 in year 3. weight each price according to the quantity bought in a particular time period which may be the base period (a Laspeyres Index) or the current period (a Paasche Index).45 ! 3000 73.90 c) The values for amount spent on advertising can be calculated for all years if any one year is known by multiplying that value by the ratio of the relevant index numbers from either series. The weightings to be applied are usually arrived at by discussion and mutual agreement..I. We shall work through a simple illustration of the principle of these aggregate indexes but you must realise that the use of them in the 'real world' is far more complex than this and needs to be studied further if to be used in practice in a business context. as used by us previously. . There are alternative ways of doing this depending on which set of index numbers are used. Simple aggregate and Weighted aggregate indexes Most published price indexes are produced from many items rather than single ones. that for: Year 1 = 4860 v 100 ! 3000 162 or 4860 v 45.64 Year 2 = 4860 v 138 ! 4140 162 or 4860 v 62.73 ! 4140 73. Simple price aggregate indexes just add together the prices of all the items included in the 'basket of goods'. Weighted aggregate indexes.P. assuming each occurs just once.64 Now complete the advertising expenditures in the table. or from other information such as the 'Family expenditure survey'.

0 Product B: 25/23 v 100 = 108. the Paasche index. for year 2 using year 1 as the base year.0 Product D: 20/19 v 100 = 105. 3068 d) Find the current period-weighted aggregate index.59 3847 The type of aggregate index chosen is based on the relative importance of the weightings. um of prices v weights in Year 2 Index for year 2 ! um of prices v weights in base year ! 11 v 20  25 v 55  17 v 63  20 v 28 10 v 20  23 v 55  17 v 63  19 v 28 ! 3226 v 100 ! 10515 .91 Example 5 A company buys four products with the following characteristics: Number of units bought Items A B C D a) Year 1 20 55 63 28 Year 2 24 51 84 34 Price paid per unit (£) Year 1 10 23 17 19 Year 2 11 25 17 20 Find the simple price indexes for the products for year 2 using year 1 as the base year: Product A: 11/10 v 100 = 110. for year 2 using year 1 as the base year.8 69 c) Find the base-weighted aggregate index.7 Product C: 17/17 v 100 = 100. Index for year 2 ! ! um of prices v weights in Year 2 um of prices v weights in current year 11 v 24  25 v 51  17 v 84  20 v 34 10 v 24  23 v 51  17 v 84  19 v 34 ! 3647 v 100 ! 104. b) Find the simple aggregate index for year 2 using year 1 as the base year: um of prices in Year 2 um of prices in base year ! 11  25  17  20 10  23  17  19 ! 73 v 100 ! 105.3 Note that all these figures come from the prices only and that no weighting has been applied. the Laspeyres index. .

seasonal Some suitable methods Seasonal decomposition. Non-linear regression (?). Exponential smoothing. under the same conditions. More methods are described in Chapter 12 of Business Statistics. Quarterl sales (£0. values for them in the near future. non-seasonal 96 19 T 6 C O 19 9 L 96 JU 1 9 R 6 P A 19 9 N 95 J A 19 T 5 C O 19 9 L 95 JU 1 9 R 5 AP 9 19 N 94 J A 19 T 4 C O 99 1 L 94 JU 1 9 R 4 AP 19 9 N 93 J A 19 T 3 C O 19 9 L 93 JU 1 9 R 3 AP 19 9 N JA Non-linear regression. This week we again study the past but with the main purpose of identifying patterns in past events which can be used next week to forecast future values and events. …† … Date 200 180 160 140 120 100 80 Non-linear. Date . Since the best method to use in any circumstance depends on the form taken by the past data it is essential that these values are first plotted against time to help us to select the most appropriate type of model. additive or multiplicative. or models. to fit to this data. (By no means an exhaustive list of either past data time series or suitably fitting models. Generally. Exponential smoothing. non-seasonal Linear regression (?). Timeseries of past data p Suitable method p Best fitting model p Forecast Below we see three typical patterns exhibited by time series with suggested suitable types of model. There are many standard methods used in forecasting and they all depend on finding the best model to fit the past time series of data and then using the same model to forecast future values.SEASONAL DECOMPOSITION Last week we used time series in the form of index numbers to investigate changes over time with the main interest being in past time periods and the main objective comparison of different series. if a model can be found which reasonably fits past chronological data it can be used to predict. Exponential smoothing.) Time series of past data 80 General description Linear.000) 70 60 Quarterl Sales Mont l Sales … 50 40 95 19 4 5 Q 1 99 3 5 Q 1 99 2 5 Q 1 99 1 4 Q 1 99 4 4 Q 1 99 3 4 Q 1 99 2 4 Q 1 99 1 3 Q 1 99 4 3 Q 1 99 3 93 Q 19 2 3 Q 1 99 1 2 Q 1 99 4 2 Q 1 99 3 2 Q 1 99 2 2 Q 1 99 1 1 Q 1 99 4 1 Q 1 99 3 1 Q 1 99 2 1 Q 1 99 1 Q Date 42 40 38 36 34 32 30 28 26 Q 1 19 94 Q 3 19 94 Q 1 199 5 Q 3 1 99 5 Q 1 1 99 6 Q 3 1 996 Q 2 19 94 Q 4 199 4 Q 2 199 5 Q 4 1 99 5 Q 2 1 99 6 Q 4 1 996 Linear (?).92 TIME SERIES ANALYSIS 1 .

Exponential smoothing is not covered in this lecture but is also described fully in Chapter 12 of µBusiness Statistics¶. If the independent variable is 'time' then these same methods can be used to estimate the value of the dependent at any point in time . Linear regression models work best for nonseasonal data which appear to be fairly linear when plotted against time. It is a method which is lengthy to carry out by hand but is straight forward with the help of a computer package such as SPSS. if that can be identified. 'Curve fitting¶ is the equivalent non-linear method which is easy to carry out in a computer package such as SPSS or Minitab. Seasonal decomposition of time series How do we cope with data which exhibits marked seasonality in its scatterplot? We carry out a procedure called seasonal decomposition which identifies the general trend and also the seasonal components exhibited by past data so that they can be used later in an equivalent forecasting model. We need to identify any past pattern and use it for future sales predictions. We shall concentrate in this lecture on seasonal decomposition of data which exhibit a seasonal pattern as well as a general trend. y that for the corresponding quarter of the previous year. Minitab or Excel.in practice only the near future is valid. Is this amount good or bad? Can this single figure be used to forecast future sales? With only one figure we never have sufficient data to make either a judgement or a forecast.93 We have previously used regression models to produce equations describing a dependent variable in terms of an independent one. . y that which would be expected from the current trend. We also need to compare the figure with: y that for the previous quarter. This is described in detail in Chapter 12 of µBusiness Statistics¶. Example A small ice-cream manufacturer who supplies a supermarket chain reports sales figures of £50 000 for the last quarter. We also need to estimate how accurate those predictions are likely to be.

This second calculation is only necessary for cycles with even numbers of periods as an odd number would produce a cycle average which was already centred. We shall use an additive model to identify how much has been 'added' to the sales because it is a particular quarter of the year. Sales (£'000) Year 1997 1998 1999 2000 Method 1 First plot the data in chronological order on a graph against time. y The fitted values are those produced by the model. The Seasonal factor. . since it is quarterly data.g. From these first residuals we calculate the seasonal factor which is the deviation from the trend due to it being a particular season. y The second residual is the difference between the observed and fitted model values. trend + seasonal factor. This cannot be eliminated but is a useful measure of the goodness of fit of the model and therefore of the quality of any forecasts made from it. one from each season.e. (Grid on back page) What do you think? Does this data exhibit a 'four point pattern'? It does and therefore. This will be largely due to the seasonality of the data. S. It represents the quarterly sales figures for the four years preceding the figure of £50 000 quoted above. the following data was collected. 2 We can calculate the missing trend values from the table on the next page which includes: y The observed sales figures. the sales are judged to exhibit a quarterly seasonal effect. e.94 In order to understand the situation more fully. T. As this value is not centred on any quarter its deviation from the observed sales cannot be found. i. is the value of the sales had the seasonal effect been 'averaged out'. e. is the random variation due to neither of the previous variables. R. i. summer. The Random factor. y The trend which is the moving average of two consecutive cycle averages and is centred on a quarter. is the average effect of it being a particular quarter. Additive model Observed Sales value A = = Trend value T + + Seasonal factor S + + Random factor R Quarter 1 40 30 35 50 Quarter 2 60 50 60 70 Quarter 3 80 60 80 100 Quarter 4 35 30 40 50 The Trend value. y The first residual is the difference between the observed values and the trend. y The cycle average which is the moving average of four consecutive quarters.

95 Date 1997 Q1 Q2 Q3 Q4 1998 Q1 Q2 Q3 Q4 1999 Q1 Q2 Q3 Q4 2000 Q1 Q2 Q3 Q4 Sales (A) Cycle Trend (T) First Resid.50 75.63 +24.50 + 7.00 -15.75 -13.46 -2. .58 +1.50 ____ ____ The first missing cycle average: The second missing cycle average: The first missing moving average trend figure: 60.75 52.75 80 57.13 + 6.58 45.00 28.00 ! 62.75 43.42 2nd Resid.50 40 60.00 -15.42 +1. Note that the first value is plotted against Quarter 3.46 +1.75 30 46.54 52.08 +0.25 35 48. Fitted value (£'000) Average Moving average S = (A .04 43.25 -16.00 4 4 ____ ____ ____ ____ ____ ____ 58.T) F = (T + S) 40 60 53. R = (A .83 -6.13 +16.25 +0.00 50 ____ 70 ____ 100 50 40  50  70  100 260 ! ! 65.25 32.75 80 51.87 49.25 35 51.17 46.75 -18.50 2 The second missing moving average trend figure: Next plot the centred Trend values (T) on your graph.25 60 53.25 -2.87 66.50 +27.04 +1.75 42.50 58.00  65.75 30 43.00 33.50 60 43.F) +4.08 50.75 34.75 50 42.54 48.37 78.50 55.

) Compare the means and standard deviations of the observed data.770 .75 +18.50 -14. (We shall consider these again next week. From the first residuals calculate the average Seasonal factor for each of the four quarters: (Complete Quarters 3 and 4) Seasonal Factor (S) = Average seasonal deviation from Trend Quarter 1 -16. If the model is a 'perfect fit' the fitted values will be the same as the observed Sales.50 +16. the first residuals and the second residuals.75 -12.0. They should be small and randomly distributed about a value of zero.238 Standard deviation = 16.50 + 3. The discrepancy between them.37 Quarter 4 Check that the average of these means is no very far from zero.12 +6.04 Quarter 3 +27. 5 Calculate the last two Second residuals (R): R = A .002 Standard deviation = 20. should be very small and randomly distributed about zero.935 Standard deviation = 2.25 -13.87 +24.Trend) for each separate quarter.96 3 Next calculate values for the last two First Residuals (Sales . 4 Calculate the last two Fitted values (F): F = T + S.F The second residuals are the random amounts by which the model has 'mismatched' the observed data in the past and give an indication of how good the forecasts are likely to be.17 Quarter 2 + 6.375 Mean = .366 Mean = . Plot the fitted values on the same graph. the second residuals.0.87 + 7. (These will be discussed further next week. We expect them to be close indicating that we have a 'good' model. 6 Assess your model The fitted and observed values should be similar as judged by their graphs.50 Total Mean -42. Summary Statistics (check later with your calculator) Sales First residuals Second residuals Mean = 54.

97 100 90 80 70 Sales (£¶000) 60 50 40 30 20 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 1997 1998 1999 2000 2001 .

00 43.17 Quarter 2 + 6.00 70 67.50 -14.75 -16.98 Date 1997 Q1 Q2 Q3 Q4 1998 Q1 Q2 Q3 Q4 1999 Q1 Q2 Q3 Q4 2000 Q1 Q2 Q3 Q4 Sales (A) Cycle Trend (T) (£'000) Average Moving average 40 60 53.75 80 57.83 -6.) = Average seasonal deviation from Trend Quarter 1 -16.08 49.F) +27. Fitted value S = (A .00 -18.58 +1.25 .75 50 42.75 + 7.75 55.46 +1.75 -48.75 +18.50 100 50 66.87 +24.75 30 46.75 -12.00 52.75 30 43.46 -2.25 35 48.91 Quarter 4 -15.50 + 3.50 -15.29 Seasonal Factor (S.00 -16.75 32.T) F = (T + S) 2nd Resid.00 -13.50 48.50 60 43.00 50 65.87 +16.04 Quarter 3 +27.50 First Resid.75 -12.00 -15.F.50 Total Average -42.13 46.42 33.50 +1.75 75.75 34.75 80 51.37 +68.50 58.74 +22.25 50.50 +24.25 60 53.08 +0.17 66.63 52.54 78.25 +0.58 58.50 48.12 +6.67 -2.25 35 51.25 + 6.37 -18.87 -15.04 +1.25 -13.13 43.29 +4.50 40 60.42 +1.54 42.75 45.25 -2.50 +16.25 62.04 28.87 + 7. R = (A .33 72.50 + 3.

those between the fitted and observed values. Forecast Extended Seasonal = + Trend value factor Additive model Sales value F = T + S . 1 and 2 See the incomplete graph from last week repeated on the last page.0.FORECASTING In this lecture we shall produce forecasts using a seasonal model. The plot of the observed data should show a steady repeated pattern and be close to that of the fitted values. should have a mean of zero and a much reduced standard deviation from that of the observed values. Use your own best judgement in extending this line. Non-seasonal forecasting is covered by Chapter 13 of Business statistics. this should be taken into account. If you have any background information concerning future sales.770 Second residuals Mean = . 2. When using this model for forecasting the Random factor is left out of the equation for the single figure forecast but re-introduced when the likely accuracy is estimated as a range. the plot of the trend should be fairly smooth and steady without too many fluctuations. 3 and 4 Summary Statistics (as calculated last week) Sales First residuals Mean = 54. 3. those between the trend and the observed values. the second residuals.935 Standard deviation = 2. the first residuals. Was this appropriate? If so: 1.375 Mean = . Last week we produced an Additive model for the Ice-cream sales data.99 TIME SERIES ANALYSIS 2 .0. The Random factor is the random variation due to neither of the previous variables.366 Standard deviation = 20.002 This model therefore seems satisfactory so we shall use it to predict future Ice-cream sales. The Seasonal factor is the average of the previous deviations from the trend line for that particular season. The Trend value is estimated from the extended trend line over the next four quarters on your graph. 4. The Seasonal factor is the average effect of it being a particular quarter. should have a mean which is much smaller than the observed values.238 Standard deviation = 16. Forecasting using an Additive model (from last week) Additive model Observed Sales value A = = Trend value T + + Seasonal factor S + + Random factor R The Trend value is the value of the sales had the seasonal effect been 'averaged out'.

and add to them the appropriate seasonal factor to get your forecast. They are therefore used to estimate the size of the forecasting errors. Assuming that the conditions from the past are continuing into the future. If they are shown to be small and randomly distributed about zero.14. on average. note them in the table below. Additive model Extended Trend value (T) + Seasonal factor (S) (. the same size. Your forecasts should therefore be quoted as: 2001 Quarter 1 2001 Quarter 2 2001 Quarter 3 2001 Quarter 4 s s s s £5540 £5540 £5540 £5540 = = = = A final rounding to the nearest hundred seems reasonable.92) (-16.25) = Forecast Sales value (F) 2001 Quarter 1 2001 Quarter 2 2001 Quarter 3 2001 Quarter 4 + + + + = = = = Add your forecasts to your graph to check that the past pattern follows into the future.100 Forecasts Read off the values from your extended trend line.17) (+ 6. 2001 Quarter 1 2001 Quarter 2 2001 Quarter 3 2001 Quarter 4 s s s s £5500 £5500 £5500 £5500 = = = = . Maximum likely error in forecasts How good are these forecasts? The second residuals are the measure of the discrepancy in the past between the model and the observed data. (Think 95% confidence interval!) The standard deviation of the second residuals = 2.770 (£'000) so the maximum likely error is 2 x 2 700 = £5540.08) (+22. there seems no reason why they should not remain. then it can be assumed that the error in the forecasts is not likely to be more than twice the standard deviation of the second residuals.

101 Deseasonalising seasonal data In order to compare figures from consecutive quarters the effect due to it being a particular quarter needs to be eliminated. It now makes sense to calculate percentage changes. This was followed by a decrease in sales of £10 830 between quarters 3 and 4. Observed Sales 2000 Quarter 1 2000 Quarter 2 2000 Quarter 3 2000 Quarter 4 50 70 100 50 Seasonal factor Deseasonalised sales - = = = = = So having removed the amount due to each quarter we can see that quarters 1 and 2 were very similar but that quarter 3 showed an increase. say. sales are progressing better or worse than the general trend at that time. The deseasonalised values can be compared to the trend value at any point in time to see whether. etc. compare with published indexes. How can we compare the ice-cream sales in summer with those in winter? We 'deseasonalise' the observed sales figures. in real terms. . of £13 120 over quarter 2.

500 3.000 -13.25000 32.375 -18.750 * * SEASONAL * * 22. are the same as those calculated last week for this data.0417 * * FITTED * * 2ND.750 62.1667 6.1667 6.RESD * * 27.500 24.45833 78.7500 1.04166 28. 5833 0. ˆ ˆ Sales ‡ * 75.3333 1.125 45.45834 42.750 -12.5417 1.1667 0.625 58.2500 -14.000 -16.250 43.250 6.750 52.1667 6.2500 -14.000 46.875 -15. Note the quarterly repetition in the 'Seasonal' column .125 43.7500 1.5417 1.500 66.4167 4.000 48.9167 -16.RESD * These figures.this is the average deviation for a quarter so is the 'Seasonal factor' for use in both forecasting and deseasonalising. with the addition of 'Seasonal'.9167 -16.50000 48.500 -15.0833 -2.25000 34.66667 72.500 50.5000 -2.58334 33.41667 58.9167 -16.250 * * 1ST.500 55.0417 22.750 7.0417 -6..102 Minitab output for this data Ti e series plot for Ice-crea 100 90 sales 80 70 60 50 40 30 0 5 10 15 Ice crea Quarters We can see that this is quarterly data which can be described well by an additive model.08333 49.29166 * * * * .2500 -14. Time series Decomposition ROW 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 QUARTER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 THE DATA 40 60 80 35 30 50 60 30 35 60 80 40 50 70 100 50 TREND * * 52.83333 66.875 16.0417 22.2917 -2.

36458 16. £5540 ‰ ‰ Index 5 10 15 Plot of Second residuals against Time -5 0 5 10 15 QUARTER . = 54. = 2.536743E-07 ST. be used to estimate the future trend but it would be difficult to extend the trend line and read values from it with any accuracy. The second composite graph indicates a fairly close match but the residuals still need to be analysed before the accuracy of any predictions can be estimated numerically.DEV.103 Plotting the trend line with the original data shows its smoothed fit. These residuals seem to satisfy the required conditions and also give us the additional information that our forecasts will be accurate to within 2 x 2. Adding the Fitted values to the previous graph shows how well the model has fitted the observed data in the past. Š Fitted values added 100 90 80 Sales 70 60 50 40 30 Ti e series plot for Ice crea Sales ith Trend 100 90 80 Sales 70 60 50 40 30 Index 5 10 15 The first diagram could.RESD 0 5 FOR THE FIRST RESIDUALS MEAN = ST.DEV.77 i. Residual analysis Remember that residuals are required to be small.955 FOR THE SECOND RESIDUALS MEAN =9.e.DEV.375 20. with a mean of zero and a 'small' standard deviation and be randomly distributed in time. Summary statistics FOR THE DATA MEAN = ST.238 2ND.7696 Are they randomly distributed chronologically? No obvious seasonal pattern is evident. A hand drawn graph is therefore preferable. in theory. = -0.

000 30.122 -4. a term more suited to a multiplicative model. MOD_1 ADD CEN 4 The moving averages are the trend figures calculated by hand previously and the µRatios¶.417 36.573 -1.756 .413 -5.000 50.000 -13. Seasonal factors -13.236 The following new variables are being created: Name ERR_1 SAS_1 SAF_1 STC_1 Label Error for SALES from SEASON.000 Ratios .000 48.719 68.978 -2.500 66.594 65.802 6. MOD_1 ADD CEN 4 Trend-cycle for SALES from SEAS ON.500 55. -> * Seasonal Decomposition.361 76.972 55.885 -13.969 3.000 30.885 45.802 62.176 43.500 50. Centered MA method. Moving averages .802 48. -> DATE YEAR 1997 QUARTER 1 4.747 .986 65.125 43.972 56.000 80.750 .000 100.111 56. 27.104 SPSS output for this data Time series Decomposition -> VARIABLE LABELS SALES "ice cream sales (£000)".000 60.955 -2.875 16.000 80.594 54.733 7. .719 42. No 'fitted values' from the model have been produced by default but these can easily be obtained as the sum of 'smoothed trend cycle' and 'seasonal factors' which have automatically been produced and saved by SPSS as STC_1 and SAF_1.281 -15.750 62.000 46. Years and quarters used here.000 70. The errors and deseasonalised sales have also been saved.750 52.875 -15.375 -18. Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 DATE_ 1997 1997 1997 1997 1998 1998 1998 1998 1999 1999 1999 1999 2000 2000 2000 2000 SALES 40. .885 50. These differ slightly from our calculated figures as they have been further smoothed and also extended to cover the first and last two time periods. The 'seasonal factors' and the 'smoothed trend cycle' describe the average effect due to a particular quarter and the sales level without that effect.406 23.750 50.000 50.885 70.406 23.885 Dates can be described in many ways.625 58. 52.719 52.500 -15.281 -15. We shall look at seasonally adjusted series later.705 -1. MOD_1 ADD CEN 4 Seas adj ser for SALES from SEASON. -> /MODEL=ADDITIVE -> /MA=CENTERED.885 -13.787 43.406 23.250 .719 55.500 24.645 1. .000 40. .802 6. Period = 4.000 50.546 63.787 48.125 45.406 23.281 -15. are the same first residuals.771 53.046 53.885 58.250 6.000 35.000 35.802 6.750 -12.098 .177 . .432 -2.885 -13. Method: Additive with centred moving average.802 45.622 .802 55. Results of SEASON procedure for variable SALES.157 63.000 -16.750 7.594 52.594 43.500 3.098 .250 43.281 -15.000 60. Additive Model.694 45.000 60. Seasonally Smoothed adjusted trend Irregular series cycle component 53.802 6. MOD_1 ADD CEN 4 Seas factors for SALES from SEASON.

 40 Ž Œ ‹  80 60 40 20 Trend added 01 20 4 1 Q 200 3 1 Q 200 2 1 Q 200 1 0 Q 200 4 0 Q 200 3 0 Q 200 2 0 Q 200 1 9 Q 199 4 9 Q 199 3 9 Q 199 2 9 Q 199 1 8 Q 199 4 8 Q 199 3 8 Q 199 2 8 Q 199 1 7 Q 199 4 7 Q 199 3 7 Q 199 2 7 Q 199 1 Q Date Sales 01 20 4 1 Q 200 3 01 Q 20 2 1 Q 200 1 00 Q 20 4 0 Q 200 3 00 Q 20 2 0 Q 200 1 9 Q 199 4 99 Q 19 3 9 Q 199 2 99 Q 19 1 8 Q 199 4 98 Q 19 3 8 Q 199 2 8 Q 199 1 97 Q 19 4 7 Q 199 3 97 Q 19 2 7 Q 199 1 Q . in theory. be used to estimate the future trend but it would be difficult to extend the trend line and read values from it with any accuracy.105 120 Ti e series plot for Ice crea Sales Sales ( '000) Quarterl Icecrea 100 Plotting the trend line with the original data shows its smoothed fit. 120 100 80 60 Icecrea ( '000) 20 Trend Date This diagram could. A hand drawn graph is therefore preferable.

9534 2.e. Fitted values added to Sales and Trend 120 100 80 Quarterl Icecream S 60 ales (£'000) TREND 40 Fitted values from t Date This composite graph indicates a fairly close match but the residuals still need to be analysed before the accuracy of any predictions can be estimated numerically.7702 Are they randomly distributed chronologically? 6 Plot of second residuals against time 4 2 No obvious seasonal pattern is evident.77 i. with a mean of zero and a 'small' standard deviation and be randomly distributed in time.38 -.50 4. £5540 0 -2 -4 -6 -8 Date – — — ˜ Quarterly Icecrea Sales ( ' ) irst residuals Second residuals alid (list ise) Ÿ ž ˜ ˜ ž ˜ ˜ ž ini u axi u ean ‘ 20 e Additive model  01 20 4 01 Q 20 1 3 0 Q 20 2 01 Q 20 0 1 0 Q 20 0 4 0 Q 20 0 3 0 Q 20 0 2 0 Q 20 9 1 9 Q 19 9 4 9 Q 19 9 3 9 Q 19 2 99 Q 19 8 1 9 Q 19 8 4 9 Q 19 8 3 9 Q 19 8 2 9 Q 19 7 1 9 Q 19 7 4 9 Q 19 7 3 9 Q 19 7 2 9 Q 19 1 Q •” “ ’ Std.75 -6.3 20. 4 16.667E.59 54.3658 1. Residual analysis Remember that residuals are required to be small.106 Adding the Fitted values to the previous graph shows how well the model has fitted the observed data in the past. eviation œ  ——— ™ œ š › 01 20 4 1 Q 00 2 3 1 Q 00 2 2 1 Q 200 1 0 Q 00 2 4 0 Q 200 3 0 Q 00 2 2 0 Q 200 1 9 Q 99 1 4 9 Q 99 1 3 9 Q 199 2 9 Q 199 1 8 Q 99 1 4 8 Q 99 1 3 8 Q 199 2 8 Q 99 1 1 7 Q 199 4 7 Q 99 1 3 7 Q 199 2 7 Q 99 1 1 Q . RESID2 These residuals seem to satisfy the required conditions and also give us the additional information that our forecasts will be accurate to within 2 x 2. Descri ti e t tistics 16 12 12 12 30 -18. 4 100 27.

107 100 90 80 70 Sales (£¶000) 60 50 40 30 20 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 1997 1998 1999 2000 2001 .

58 +1.00 -18. R = (A .25 35 51.50 + 3.00 50 65.108 Date 1997 Q1 Q2 Q3 Q4 1998 Q1 Q2 Q3 Q4 1999 Q1 Q2 Q3 Q4 2000 Q1 Q2 Q3 Q4 Sales (A) Cycle Trend (T) (£'000) Average Moving average 40 60 53.04 +1.75 +18.00 -16.75 50 42.50 + 3.75 30 43.75 -16.12 +6.75 34.17 66.50 -15.25 -13.00 43.08 +0.75 -48.91 Quarter 4 -15.50 40 60.50 First Resid.T) F = (T + S) 2nd Resid.87 -15.42 +1.46 +1.13 43.75 55.75 -12. Fitted value S = (A .25 50.29 Seasonal Factor (S) .74 +22.54 42.00 -13.37 -18.Average seasonal deviation from Trend Quarter 1 -16.75 32.17 Quarter 2 + 6.04 28.25 +0.67 -2.00 70 67.75 75.25 60 53.37 +68.04 Quarter 3 +27.63 52.50 Total Average -42.00 -15.58 58.50 +24.25 35 48.50 +1.00 52.29 +4.87 +16.75 80 57.50 +16.75 + 7.25 62.F) +27.87 +24.33 72.50 58.54 78.50 100 50 66.75 45.50 48.08 49.50 -14.25 .25 -2.50 48.25 + 6.50 60 43.46 -2.42 33.75 -12.75 80 51.87 + 7.83 -6.75 30 46.13 46.

45.} . 34.05 (2) = 5.5%.6 At the end of a semester. (ii) 0.99.5%. the students' grades for Statistics were tabulated in the following 3 x 2 contingency table to see if there is any association between class attendance and grades received.550 (v) 0.) Probability. G20.450 (b) 3. Interpret your findings.5% to 45. (vi) 0. so percentages different.3 4 . At 5% significance. do the data indicate that the proportions of students who pass differ between the three absence categories? Calculate the 95% confidence intervals for the percentage of students who passed and the percentage of them who failed. (b) (c) Calculate the expected value for the number of days absence. (iv) passed given that he/she was absent for less than four days. (iii) passed. Reject H0. (ii) was absent for less than seven days. No overlap.45 (a) Pass 135 36 9 Fail 110 4 6 Calculate the probability that a student selected at random: (i) was absent for less than four days.6 7 .967. (iii) 0.600. (v) passed or was absent for less than four days. (vi) passed and was absent for less than four days. (iv) 0.817. Briefly summarise your results so that they can be understood by a nonstatistician. Grade received No. Conclude that the proportions differ. of Days Absent 0 . Contingency tables and Chi -squared tests 16. (d) 54.950. (d) (e) {(a) (i) 0.19 days (c) G2 = 17. Numbers from that chapter.5% to 65. Chapter 16.109 SELECTED REVISION QUESTIONS WITH NUMERICAL ANSWERS (Much larger selection of questions and all the answers in Business Statistics.

93 kg) d) If a random sample of 25 bags from another machine has a mean weight of 2.5 kg would its mean production be any different? (95% C.05 to £20.10 The weights of sugar in what are nominally '2 kilo' bags are actually normally distributed with mean 2.70) . (£18.110 Normal Distribution 16.1 kg and standard deviation 0. on average.I. interval includes 50%) 16. £0.0 kg and a standard deviation of 0.14) b) Calculate the 99% confidence interval for the mean amount spent by the husbands.5%) (1. £20 in interval) d) Is there any evidence that the mean amounts spent by wives and their husbands are different? (Calculate the 95% confidence interval for the differences in each family.9%) (77.1%) b) Does this support the firm¶s claim that they actually sell shoes to half of their potential customers? Why? (Yes.72 to £12. women spend £20 per head.21 kg so not different. Does the confidence interval in (a) from the sample support this claim? Explain your answer.72 to £27. a) Treating this as a random sample of all potential customers find a 95% confidence interval for the actual percentage of the people entering the shop who will buy at least one pair of shoes. what should this value be? (15.) Confidence Intervals 16. (£12.79 kg to 2. 1. The amounts spent by 14 husbands and their wives were selected randomly from all the pairs with the following results: (rounded to the nearest £1) Family Husband (£) Wife (£) A B C D E F G H I J K L M N 9 20 19 9 26 16 20 13 15 10 18 13 16 23 23 30 13 23 12 17 29 35 18 26 32 23 26 14 a) Calculate the 95% confidence interval for the mean amount spent by the wives.0 kg and 2.37) c) The store manager thinks that.1 kg.17 The records of a large shoe shop show that over a given period 520 out of 1000 people who entered the shop bought at least one pair of shoes.) (Yes.0 kg? b) What percentage will weigh between 2.19 The same supermarket then decided to investigate the spending habits of husbands and wives.9% to 55.25 kg? c) If only 4% of the bags were to contain under the stated minimum weight. (48. a) What percentage of bags will weigh less than 2. They were thinking of starting late 'family shopping' evenings and as an experiment asked both partners to shop separately. (Yes.

88.07.83. t = 1.76. = 2. and both were given the same reading test with the following results: Pair number S method score N method score 1 56 63 2 59 57 3 61 67 4 48 52 5 39 61 6 56 71 7 75 70 8 45 46 9 81 93 10 60 75 a) Is 65 a reasonable estimate for the mean score of all children taught by method S? b) Calculate the 'difference in scores' for each pair of children and test to see if.3 D 42.S. t = 2. in £¶000. F 0. is to be compared to a standard teaching method.8 60.S.2 a) Is there any significant difference.V.25 A new teaching method. between the sales of the men if the location is taken into account? c) Which salesman won? Did he do significantly better than his nearest rival? {(a) Fsalesmen = 2. 65 could be the mean mark for all the children. at the 5% level. one of each pair was taught be each method.6) = 4.4 48.73. Each has the task of selling µcold¶ in three different types of location.05 (3. significant difference when locations taken into consideration. F0.)} Analysis of Variance 16. Children were µpaired¶. (b) C. between the sales of the men if the location is not taken into account? b) Is there any significant difference. new method N gives a better mean score than standard method S.0 B 49.V. No} . Their resulting sales. no significant difference. N.111 Hypothesis Testing 16.3 61.28 Four salesmen in your company are competing for the title µSalesman of the year¶. T. S.1 62.8) = 4. (b) Fsalesmen = 5. at a 5% level. Method N does give a higher mean score.9 50. = 1. T.4 C 58.14.0 63. which was believed to have a mean score of 65.80.1 56.05 (3.26.6 61. {(a) C. at a 5% level. were as follows: Salesmen Area 1 2 3 A 52. (c) C.

significant) (Sales = 15.112 Correlation and Regression 16. d) If significant. Year Advertising (£'000) Sales (£'000) 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 a) c) 3 3 5 4 6 7 7 8 6 7 10 40 60 100 80 90 110 120 130 100 120 150 Plot a scatter diagram of the data.2 + 14. (0.V.602.1 Advertising) Test it for significance at the 5% level. = 0.31 The following data refers to the Sales and Advertising expenditure of a small local bakery.958) (C. find the equation of the regression line. b) Calculate the correlation coefficient. e) f) What practical use could be made of this equation? What percentage of the variation in sales is explained by your model? (92%) .

and 3.63 (3. how many thousand were made in each of the other years? { (a) 87.5 million in Year 3.2 (b) 247. which was 65p in year 3.6 63.16 5.38 Year 1 2 3 4 5 Index 100 110 120 135 150 (c) If the index concerns the price of a loaf of bread. 6. what would they be for the other years? (b) Scale to µOld¶ the µNew¶ indexes for years 5.0 285.0 313. If the production of cookers followed the same index and 550 thousand were made in year 1.0 266.40 5. 2.23 6.40 The production of refrigerators can be summarised by the following table: Year Production (000s) a) b) Year 1 4500 Year 1 4713 Year 3 5151 Year 4 5566 Year 5 6000 Calculate index numbers for the above data taking Year 3 as the base year. {(a) 52.7 73.4 91.0 (b) Construct an index with Year 5 = 100 (a) Construct an index with Year 3 = 100 { (a) 83.0 (b) 66.82 6.1 116.5 (c) 2.0 90.0 108.50) 4.19 2.0 112.7 100.5 100. then calculate the price of a similar loaf in each of the other years 80.5 125.3 91.0 100.86 £ million} 16.5 (b) (550) 576 630 680 733 } .39 Indexes Year 1 2 3 4 5 6 7 8 µOld¶ 100 120 160 190 100 130 140 150 165 (c) If a company¶s profits were £ 3.2 84.113 Index Numbers 16.3 (c) 54p 60p 65p 73p 81p ( to the nearest 1p) } 16. 7 and 8 µNew¶ (a) Scale to µNew¶ the µOld¶ indexes for years 1.

42 The following table shows the prices of three commodities in three consecutive years: Prices Year 1 Tea Coffee Chocolate 8 15 22 Year 2 12 17 23 Year 3 16 18 24 a) Calculate the simple aggregate index for Year 2 and Year 3 using Year 1 as the base year.5 100.5% (b) 6.7 143.0 106.9 4 394.9 115.2 3 385. { (a) 6.9% 7.9 8.2% 6.114 16. { (a) 115.3 10. b) Calculate the mean price relative index for Year 2 and Year 3 using Year 1 as the base year. Year RPI1 RPI2 a) b) 1 351.4% 2.8 2 373.1 5 6 7 Calculate the percentage change between consecutive years.9 } 16.0 } .41 The table below refers to the Annual Average Retail Price Index.9 (b) 122.1% 3.8% 9.6 128.2 126. Calculate the percentage point change between consecutive years after Year 4.

on your graph.3 A summary of the analysis performed on this data. in thousands.9 115.6 78.44 The following quarterly data represents the number of marriages. or Fitted values for Minitab.0 73.) Year 1 2 3 4 Q1 71. g) With reference to your graph and the summary statistics given in the SPSS or Minitab outputs. stating your forecasts clearly and showing how they were calculated.115 Time series analysis and Forecasting 16. f) Demonstrate how the Seasonally adjusted series for SPSS.7 138.2 61.3 Q2 111. is shown overleaf. discuss the suitability of the additive model.0 140.9 61.8 108. leaving room for your forecasts for Year 5. c) Plot the Smoothed Trend-cycle.0 78. of the number of marriages for each of the quarters of Year 4 have been found.0 108.0 Q3 135. for four consecutive years: (Source Office of Population censuses. b) Calculate the missing values indicated by the three question marks (?) on the table in the printout. e) Estimate the likely error in your forecasts. d) State the four seasonal factors and use with your graph to forecast the expected number of marriages for each of the four quarters of Year 5. Trend for Minitab.2 Q4 79. .5 146.6 62. a) Plot the given data on graph paper as a time series. using both SPSS and Minitab.

757 96.81690 Maximum N Label 146.300 115.3 30.600 111.325 97.688 -35.850 96. Seasonal factors -35.44 -> * Seasonal Decomposition.300 96.700 79.000 146.600 62.538 -1.788 105.79 61. -> /VARIABLES=marriage -> /MODEL=ADDITIVE -> /MA=CENTERED.000) 47.850 41.718 13.938 ? 15.250 -34.181 -3.000 78.918 96.594 91.0 16 First residuals 6.380 -35.116 SPSS output for use with Question 16.109 96.764 97. .609 97.207 40.144 99.12 -.250 -19.83 -38.35 -7.004 97.207 40.273 .512 96.575 43.912 97. Ratios .200 108.618 97. . 37.207 40.451 2.662 11.200 73.000 61. 98.863 96.206 .713 96.012 97.59 3 96.500 78.287 -18.693 97.538 .000 61.980 95.809 97.112 11.309 99.080 95.384 ? 2.380 97.688 97.000 135.31389 16 Second residuals .04 Std Dev Minimum 30.314 -2.2 16 Number of marriages (0.0 3.715 -7.817 Summary Statistics Variable MARRIAGE FIRST SECOND Mean 98.900 140.891 -18.04 -.097 -1.653 97.718 13.950 96.318 101. DATE_ Q1 1 Q2 1 Q3 1 Q4 1 Q1 2 Q2 2 Q3 2 Q4 2 Q1 3 Q2 3 Q3 3 Q4 3 Q1 4 Q2 4 Q3 4 Q4 4 MARRIAGE 71.891 -18.976 97. .900 108.005 5.891 -18.146 101.980 96.380 96.250 97.497 Second resid 6.799 97.380 -35.207 40.380 Seasonally Smoothed adjusted trend series cycle 107.938 ? 99.380 -35.973 94.793 99.463 .938 -1.103 1.793 99.800 138.400 -17.680 99. .353 -.005 .891 -18.718 13.300 Moving averages .018 99.718 13.

5750 43.8500 96. = 98.9500 96.500 * * 2ND.2500 -19.0 Summary Statistics FOR THE DATA MEAN = ST.6250 -35.7125 96.9625 12.0 61.313 63.44 ROW 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 QUARTER 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 THE DATA 71.3 115.2 108.9375 ? 15.24479 ST.4000 -17.85000 -1.2500 97. = 30.0 146.06250 0.6458 -18.775 FOR THE SECOND RESIDUALS MEAN =3.37500 1.RESD * * -3.8625 96.178914E-07 ST.4625 * * SEASONAL * * 40.6875 -35.450 112.8500 41.0 6.117 Minitab output (from an older version) for use with Question 16.9375 ? 99.0 9.9625 * * FITTED * * 138.287 137.DEV.896 79.50000 * * PLOT OF THE DATA AND THE TREND LINE 150+ 120+ 90+ 60+ +---------+---------+---------+---------+---------+-----0.9625 12.6 62.0 135.5 78.DEV.125 30.9 108.3 TREND * * 98.0 3.791 FOR THE FIRST RESIDUALS MEAN = -0.0 15.0125 97.6250 -35.2875 -18.8536 .063 61.0 61.64166 -0.5375 * * 1ST.38750 2.946 78.9625 40.11250 0.6458 -18.6875 97.3000 96.31250 ? 2.RESD * * 37.6250 -35.050 110.24583 1.2 73.60417 -1.6 111.1125 11.7 79.900 109.6458 -18.30000 -1. = 1.6625 11.0 12.0 78.DEV.9625 12.9 140.912 137.3250 97.9625 40.225 60.2500 -34.358 78.8 138.

118 140 130 120 110 Marriages ('000) 100 90 80 70 60 Q1 Q2 1 Q3 Q4 Q1 Q2 2 Q3 Q4 Q1 Q2 3 Q3 Q4 Q1 Q2 4 Q3 Q4 Q1 Q2 5 Q3 Q4 .

119 119 .

Sign up to vote on this title
UsefulNot useful